All published articles of this journal are available on ScienceDirect.
Inter-rater and Intra-Rater Reliability Test with Goodenough-Harris Drawing Test
Abstract
Introduction:
There are various methods to measure intelligence in children, one of them is by drawing method. Goodenough-Harris Drawing Test (GHDT) is the first cognitive measurement tool developed based on drawing. Giving quantitative scores in drawing methods has been a challenging task, especially in controlling the level of subjectivity. It requires psychometric reliability evidence to make sure the result is free from bias.
Methods:
Drawings of 799 children (kindegarten/4 to 6 y.o= 412, primary school/7 to 9 y.o= 387 ; boys= 388, girls= 411) were examined to investigate inter-rater and intra-rater reliability in GHDT scoring. Data are scored by two raters who have been given special training on the scoring system based on the manual book. Pearson correlation is used to analyzed drawing reliability.
Results:
Significant correlation between the scores of two rates were 0.840 (p<.001) for the man drawing, and 0.844 (p<.001) for the woman drawing. In addition, intra-rater reliability ranged from 0.880 to 0.963.
Conclusion:
The study showed that GHDT has a high-reliability score by raters who trained based on a manual book GHDT scoring system. In addition, these results also showed scoring criteria were applied consistently.
1. BACKGROUND
Measurement of intelligence in children generally aims to predict the success of children in learning and adapting [1, 2]. Various methods are used to assess children's intelligence, ranging from drawing, answering questions, or telling stories [3, 4]. Using drawing as a measurement method is considered to have many advantages. Not only is drawing an enjoyable activity, rarely children do resist [5]; drawing also has a high correlation with concept development and general intelligence [6]. It is believed that drawing provides rich insight into young children’s thinking [7].
Human figures are considered the most familiar object to be drawn by children of any age and educational level [8]. Florence Goodenough developed a formal intelligence test by using human figure drawing as measurement items. In her book, Measurement of intelligence by drawings published in 1926, Goodenough describes the research and development process of what she called A Goodenough Draw-a-Man Test (GDAMT). GDAMT is a non-verbal test that is used to measure the level of cognitive maturity, by looking at the detailed execution of human figure drawings [9]. There were 51 strict indicators for rating each drawing, which was later developed and improved by Harris, Goodenough’s student and colleague into 73 indicators for the man figure and 71 indicators for the woman figure [10]. This test is known as Goodenough-Harris Drawing Test (GHDT). Not only did Harris add score indicators, but he also added comprehensive instructions: drawing Draw-a-Woman (DAW), Self-Portrait-Man (SPM; for boys), and Self-Portrait-Woman (SPW; for girls). This addition from Harris is expected to change the scoring system to be more objective than the original version. However, Harris only added a scoring system for Draw-a-Man and Draw-a-Woman, meanwhile, the Self-Portrait scoring system has not been specified yet [11].
Goodenough-Harris Drawing Test (GHDT) is the first cognitive measurement tool developed based on drawing [12]. This method has no right or wrong answer and no time limit, and is non-invasive and nonthreatening. Hence, children can respond sincerely and comfortably [13]. It is believed that a child draws what he knows, not what he sees [14]. This 'knowing' sense is defined as intellectual maturity tapped by the drawing test [13]. Through the GHDT scoring system, a total test score will be generated and described as the level of non-verbal cognitive development [15].
Until today, using the drawing method as an objective measurement method is still argued [11]. Several previous studies have shown that this method is a valid and reliable measurement method [12]. However, more studies are required to give proven and robust information related to the scoring system in achieving reliability and validity [11, 13]. Regardless of the existing set of guidelines and standards, there is still a salient problem in Human Figure Drawing test scoring and interpretation system [16]. One of the problems that raise enough pros and cons is the subjective bias of the rater in deciding to score. This happens due to various backgrounds of raters such as education, professionalism, experience, or point of view. Projection test, such as GHDT, is susceptible to misleading interpretation when the interpreter is influenced by their social stereotypes [17].
In giving a quantitative score in a drawing test, it has been challenging for the scorer to control the level of the rater’s subjectivity. Therefore, it requires psychometric reliability evidence to make sure that this subjectivity bias does not affect the rating results. The classic analysis method which can be used to determine test reliability is by obtaining two scores from a group of participants and correlating the score with coefficient correlation [18]. In this context, the two scores are acquired from (a) A set of children's drawings (DAM & DAW) given scores by the same rater (intra-rater reliability), (b) two separate scores obtained from the same child's drawing rated by two different raters (inter-rater reliability), (c) scores obtained from the same child on two different times and assessed by the same rater (test-retest) [19]. Reliability testing is a technique that is generally used to measure the relationship between each score given by several raters. Consistency between the scores given by the two raters is a prerequisite to ensure the accuracy of the assessment. Inter-rater agreement and inter-rater reliability are the two indices used to ensure the assessment's consistency [20].
Several studies show that the Goodenough-Harris scoring system has good inter and intra-rater reliability [13, 21-23]. The reliability coefficient of inter-raters of GHDT ranged between .92 to .98 for man and woman figure drawings [13]. Research on the GHDT scoring system and cultural bias in Indonesia has already been studied. The results showed that the scoring system is not relevant to children’s culture, especially in DIY Yogyakarta [24]. The study suggests conducting further research to develop an appropriate scoring system for Indonesian children. While other research that has also been conducted in Indonesia is testing the validity of items of the GHDT, where all indicators on GHDT are declared valid both on DAM and DAW. The Cronbach Alfa internal consistency reliability test showed a value of 0.92 (DAM) and 0.867 (DAW). This study suggests psychometric testing on a bigger sample size [23].
The method of inter-rater rating process in the drawing test is carried out to reduce the uncertainty of quantitative scoring. The inter-rater evaluation index is based on a comparison of the variance of scores between different raters. Meanwhile, the intra-rater evaluation index is based on a comparison of the variance of scores between two images of the same rater. The most popular method used to test inter-rater reliability is a correlation, which can be known through coefficients: Pearson, Kendall's tau, and Spearman rho [25]. A method that offers many advantages, especially simple procedures and enjoyable experiences for children, practitioners and researchers also need clear information about its psychometric strength. This study aims to determine the inter-rater and intra-rater reliability of GHDT quantitative scoring results.
2. METHODS
2.1. Participants
This study sample included 799 subjects from general Indonesian kindergarten and primary school student population (kindegarten/4 to 6 y.o= 412, primary school/7 to 9 y.o= 387 ; boys= 388, girls= 411). In Table 1, an overview of participants’ demographics is presented.
Categorization of Research Subjects | Category | Total | Percentage |
---|---|---|---|
Gender | Male | 388 | 48.56% |
Female | 411 | 51.44% | |
Grade | Kindergarten | 412 | 51.56% |
Primary School | 387 | 48.44% | |
Age | 4 – 6 years old | 412 | 51.56% |
7 – 9 years old | 387 | 48.56% |
2.2. Instrument
For Goodenough-Harris Drawing Test, the subject was given 3 sheets of unlined paper (8 1/2 x 11 inches) and a pencil with an eraser. The subjects were asked to draw a man, a woman, and themself. There was no time limit set, but the subject was usually able to finish in less than fifteen minutes.
2.3. Procedure
Before data gathering, the relevant teachers were contacted for consent, permission, and discussion about assessment plans. The teachers and parents were informed as to the nature and procedures of the assessment. Five psychologists were recruited and familiarized themselves with the test procedures, class environment, and the participants days before testing was conducted. The participants were assessed in a group of classes that consisted of a maximum of 15 children per class size. Each psychologist was responsible for approx. 5 children. The standardized instruction was given in Bahasa. The duration of testing was approx. an hour per class period. Scoring of the test was done in two drawings (Draw a Man and Draw A Woman) by two raters according to the rules prescribed by the Harris scoring system. For inter-rater analysis, there were two different raters involved. Previously, the two raters had knowledge of scoring techniques. Each rater conducted scoring on all items based on the scoring manual book as a guideline for the two drawings. GHDT standard score was obtained by conversion table according to children’s gender. Furthermore, raters gave a score and converted it, hence the score was 1 to 10 for each of the DAP and DAW. The intra-rater analysis also involved 2 raters to assess the drawings. Each of the raters gave scores on 43 drawing samples. The rater gave a score to two subject drawings based on the scoring manual book. Three months after, each rater gave another score for the same drawings as same as the manual book.
3. RESULTS
All data were analyzed using JASP 0.13.1.0 software for Windows. First, descriptive statistics (means, standard deviations, and frequency tables) were calculated to determine the overall data overview. Second, to study intra and inter reliability, Pearson’s correlation analysis was conducted to measure the strength and direction of the rater scores relationship.
- | Rater 1 (R1) | Rater 2 (R2) | ||||||
---|---|---|---|---|---|---|---|---|
Draw A Man | Draw A Woman | Draw A Man | Draw A Woman | |||||
Boys | Girls | Boys | Girls | Boys | Girls | Boys | Girls | |
N | 388 | 411 | 388 | 411 | 388 | 411 | 388 | 411 |
Mean | 88.912 | 90.827 | 83.711 | 86.399 | 88.189 | 89.25 | 83.028 | 84.201 |
Median | 87 | 90 | 81 | 84 | 86 | 88 | 81 | 83 |
Std. Dev | 17.67 | 16.201 | 14.041 | 14.644 | 16.709 | 14.295 | 13.077 | 13.149 |
Sum | 34409 | 37330 | 32396 | 35510 | 34041 | 36414 | 32049 | 34270 |
Minimum | 56 | 60 | 52 | 62 | 56 | 58 | 52 | 56 |
Maximum | 152 | 151 | 144 | 171 | 152 | 139 | 131 | 152 |
According to the results of the GHDT Scale Score (SS) in Table 2, the average values for girls are greater than for boys for the two raters. In general, the average values for Draw A Man are higher than for Draw A Woman for the two raters. The highest score is a Draw A Woman by girls that scored by Rater 1. The minimum score is 52 in Draw A Woman score by boys in both raters.
- | Variable | Reliability (r) |
---|---|---|
R1 vs R2 | DAM | 0.840 |
α | < .001 | |
DAW | 0.844 | |
α | < .001 |
The score range given by each rater is 0 – 73 for DAM and 0 -71 for DAW. The result of inter-rater reliability is shown in Table 2. The result proved all values α < .001, with a positive correlation. Inter-rater correlation value in DAP (r = 0.844) is higher than that in DAM (r = 0.840). Both of the reliability coefficients are in the high correlation category (Table 3).
The result of intra-rater reliability is shown in Table 4. The result showed that both variables tested were significant, α < .001. The reliability value in R1 is higher than in R2, both in DAP and DAW. The r-value in DAM from R1 has a value of 0.942, while DAM from R2 is 0.880. In DAW, the r-value in R1 is 0.963, however, in R2, the r-value is 0.909.
Variable | Reliability (r) | |
---|---|---|
DAM | DAW | |
R1 | 0.942 | 0.963 |
Α | < .001 | < .001 |
R2 | 0.880 | 0.909 |
Α | < .001 | < .001 |
4. DISCUSSION
Drawing representation made by children is a symbol of the important concepts understood about the object, and this understanding of objects develops when children start to enter primary school as their cognitive maturity gets stronger and logical. Goodenough assumes that children’s drawings are complex and have a lot of meaningful marks that comprise a great deal of information about their intellectual component, which links theoretically to the psychometrics study of intelligence [26]. By making a measurement based on human figure drawing, Goodenough tried to capture information related to children’s cognitive maturity levels. However, quantifying the results of children's drawings to make standardized scores for cognitive measurements is a complex challenge. There is a need for strong psychometrics evidence that proves score reliability.
The descriptive analysis showed that GHDT average IQ score for girls is higher than for boys, from both raters. This is in line with the results obtained by the previous research where GHDT average IQ score for girls is higher than boys [8, 27, 28]. It is also said that most Human Figure Drawing researchers such as Goodenough, Koppitz, Harris, and Machover found that the drawings of girls in the primary grades are superior to those of boys [29]. Meanwhile, it is found that there is no different score for both sexes in other countries such as India [30] and Greek [31]. Further research is needed to find out bias-free culture in GHDT. The descriptive analysis also showed that DAW scores are lower than DAM scores for both girls and boys. Although in the Harris scoring system, there are more scoring indicators for DAW (73) than for DAM indicators (71), this does not make the DAW score higher. It will be more difficult for children to execute female figures than male figures because female figures have a higher level of variety and demand for detail.
However, this study showed that the inter-rater reliability for Draw A Man (DAM) and Draw A Woman (DAW) has a strong positive correlation. The coefficient value is 0.840 for the DAM inter-rater reliability, and 0.844 for the DAW inter-rater reliability; all coefficients are significant (α < .001), indicating that the scoring system carried out by the two raters has good reliability. Coefficient index range from 0.70-0.89 indicated high correlation [25, 26]. The inter-rater correlation was lower than those reported by Yater, where the Pearson correlation between the scores of two raters was 0.951 for DAM and 0.954 for DAW [32]. The results of the DAM inter-rater correlation in other studies also show consistent results. The average interscorer correlation obtained when two examiners scored identical drawings was +.90 [29]. The coefficient of inter-rater reliability on DAM was shown around 0.88 [21]. The previous study also showed that the coefficient between raters using the GHDT scoring system was 0.95 (for the mentally retarded sample), 0.86 (for the learning disabled group), and 0.91 (for the normal sample) [22]. The test of the inter-rater reliability coefficient ranged from 0.83 to 0.93 in the sub-group, and 0.86 to 0.91 in the total sample study conducted on 131 children from various socioeconomic backgrounds in the UK [30]. Inter-rater agreement was good when identifying DAP scores in South African preschool children, especially in children with lower DAP IQ scores [33]. Based on these results, it can be concluded that the score given by each rater is consistent.
This study also showed that two raters gave scores consistently, both for DAM and DAW. Rater 1 showed 0.942 coefficient reliability for Draw a Man and 0.963 coefficient reliability for Draw a Woman. While Rater 2 showed 0.880 coefficient reliability for Draw a Man and 0.909 coefficient reliability for Draw a Woman. All these results were categorized as high and significant correlation (<.001). Dunn showed a 0.93 coefficient intra-rater reliability in previous research [21]. The study regarding GHDT assessment consistency by the same rater was also conducted by McCarthy, which shows a correlation of 0.95 [34]. Furthermore, the result of inter-rater reliability and intra-rater conducted by Yater also shows that the coefficient value range is from 0.750 to 0.917 [32]. This result showed that the rater consistency in rating, both for all Draw a Man and Draw a Woman data in this study, is consistent.
CONCLUSION
Inter-rater and intra-rater reliability testing in this study showed positive and significant coefficient correlation (<.001), which proves reliability. From inter-rater coefficient correlation, it can be concluded that the score given by the two raters is consistent. Meanwhile, from intra-rater correlation, it can be concluded that the rater gave consistent scores both for Draw a Man and Draw a Woman from all drawings. Furthermore, it requires a separate study to find out bias-free culture in GHDT.
LIST OF ABBREVIATIONS
GHDT | = Goodenough-Harris Drawing Test |
GDAMT | = Goodenough Draw-a-Man Test |
GHDT | = Goodenough-Harris Drawing Test |
DAW | = Draw-a-Woman |
DAM | = Draw A Man |
ETHICS APPROVAL AND CONSENT TO PARTICIPATE
This study was ethically approved by the Bandung Institute of Technology and the Indonesia University of Education, and the study was conducted according to APA ethical standards.
HUMAN AND ANIMAL RIGHTS
No animals were used in this research. All human research procedures followed were by the ethical standards of the committee responsible for human experimentation (institutional and national), and with the Helsinki Declaration of 1975, as revised in 2013.
CONSENT FOR PUBLICATION
Informed consent was obtained from the participants.
STANDARDS OF REPORTING
STROBE guidelines were followed.
AVAILABILITY OF DATA AND MATERIALS
The data analyzed during the current study are available from the corresponding author [M.T.] on reasonable request.
FUNDING
None.
CONFLICT OF INTEREST
The authors declare no conflict of interest, financial or otherwise.
ACKNOWLEDGEMENTS
Declared none.