Chapter 2

Table of Contents


Chapter 3

APPENDIX

Estimating Reliability

Estimating Reliability - Forced-Choice Assessment

The split-half model and the Kuder-Richardson formula for estimating reliability will be described here. Given the demands on time and the need for all assessment to be relevant, school practitioners are unlikely to utilize a test-retest or equivalent forms procedure to establish reliability.

Reliability Estimation Using a Split-half Methodology

The split-half design in effect creates two comparable test administrations. The items in a test are split into two tests that are equivalent in content and difficulty. Often this is done by splitting among odd and even numbered items. This assumes that the assessment is homogenous in content. Once the test is split, reliability is estimated as the correlation of two separate tests with an adjustment for the test length.

Other things being equal, the longer the test, the more reliable it will be when reliability concerns internal consistency. This is because the sample of behavior is larger. In split-half, it is possible to utilize the Spearman-Brown formula to correct a correlation between the two halves--as if the correlation used two tests the length of the full test (before it was split), as shown on the next page.

For demonstration purposes a small sample set is employed here--a test of 40 items for 10 students. The items are then divided even (X) and odd (Y) into two simultaneous assessments.

Student

Score (40)

X Even (20)

Y Odd (20)

x

y

x2

y2

xy

A

40

20

20

4.8

4.2

23.04

17.64

20.16

B

28

15

13

-0.2

-2.8

0.04

7.84

0.56

C

35

19

16

3.8

0.2

14.44

0.04

0.76

D

38

18

20

2.8

4.2

7.84

17.64

11.76

E

22

l0

12

-5.2

-3.8

27.04

14.44

19.76

F

20

12

8

-3.2

-7.8

10.24

60.84

24.96

G

35

16

19

0.8

3.2

0.64

10.24

2.56

H

33

16

17

0.8

1.2

0.64

1.44

0.96

I

31

12

19

-3.2

3.2

10.24

10.24

-10.24

J

28

14

14

-1.2

-1.8

1.44

3.24

2.16

MEAN

31.0

15.2

15.8

 

 

95.60

143.60

73.40

SD

 

3.26

3.99

 

 

 

 

 

PAGE 31

From this information it is possible to calculate a correlation using the Pearson Product-Moment Correlation Coefficient, a statistical measure of the degree of relationship between the two halves.

Pearson Product Moment Correlation Coefficient:

where

x is each student's score minus the mean on even number items for each student.
y is each student's score minus the mean on odd number items for each student.
N is the number of students.
SD is the standard deviation. This is computed by

  • squaring the deviation (e.g., x2 ) for each student,
  • summing the squared deviations (e.g., S x2 );
  • dividing this total by the number of students minus 1 (N-l) and
  • taking the square root.
  • The Spearman-Brown formula is usually applied in determining reliability using split halves. When applied, it involves doubling the two halves to the full number of items, thus giving a reliability estimate for the number of items in the original test.

    Estimating Reliability using the Kuder-Richardson Formula 20

    Kuder and Richardson devised a procedure for estimating the reliability of a test in 1937. It has become the standard for estimating reliability for single administration of a single form. Kuder-Richardson measures inter-item consistency. It is tantamount to doing a split-half reliability on all combinations of items resulting from different splitting of the test. When schools have the capacity to maintain item level data, the KR20, which is a challenging set of calculations to do by hand, is easily computed by a spreadsheet or basic statistical package.

    PAGE 32

    The rationale for Kuder and Richardson's most commonly used procedure is roughly equivalent to:

    1) Securing the mean inter-correlation of the number of items (k) in the test,
    2) Considering this to be the reliability coefficient for the typical item in the test,
    3) Stepping up this average with the Spearman-Brown formula to estimate the
         reliability coefficient of an assessment of k items.

     

    ITEM (k)

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    X

    x=X-

    x2

    Student
    (N)

    1=correct

    0=incorrect

    (Score) mean
    (score-
    mean)

    A

    1

    1

    1

    1

    1

    1

    1

    0

    1

    1

    1

    1

    11

    4.5

    20.25

    B

    1

    1

    1

    1

    1

    1

    1

    1

    0

    1

    1

    0

    10

    3.5

    12.25

    C

    1

    1

    1

    1

    1

    1

    1

    1

    1

    0

    0

    0

    9

    2.5

    6.25

    D

    1

    1

    1

    0

    1

    1

    0

    1

    1

    0

    0

    0

    7

    0.5

    0.25

    E

    1

    1

    1

    1

    1

    0

    0

    1

    1

    0

    0

    0

    7

    0.5

    0.25

    F

    1

    1

    1

    0

    0

    1

    1

    0

    0

    1

    0

    0

    6

    -0.5

    0.25

    G

    1

    1

    1

    1

    0

    0

    1

    0

    0

    0

    0

    0

    5

    -1.5

    2.25

    H

    1

    1

    0

    1

    0

    0

    0

    1

    0

    0

    0

    0

    4

    -2.5

    6.25

    I

    1

    1

    1

    0

    1

    0

    0

    0

    0

    0

    0

    0

    4

    -2.5

    6.25

    J

    0

    0

    0

    1

    1

    0

    0

    0

    0

    0

    0

    0

    2

    -4.5

    20.25

    S =

    9

    9

    8

    7

    7

    5

    5

    5

    4

    3

    2

    1

    65

    0

    74.50

    mean
    6.5

    Sx2
    74.50

     

    P-values

    0.9

    0.9

    0.8

    0.7

    0.7

    0.5

    0.5

    0.5

    0.4

    0.3

    0.2

    0.1

    Q-value

    0.1

    0.1

    0.2

    0.3

    0.3

    0.5

    0.5

    0.5

    0.6

    0.7

    0.8

    0.9

    pq

    0.09

    0.09

    0.16

    0.21

    0.21

    0.25

    0.25

    0.25

    0.24

    0.21

    0.16

    0.09

    Spq

    2.21

    Here, Variance    


    Kuder-Richardson Formula 20

    PAGE 33

      p is the proportion of students passing a given item
      q is the proportion of students that did not pass a given item
      s2 is the variance of the total score on this assessment
      x is the student score minus the mean score;
      x is squared and the squares are summed (S x2);
      the summed squares are divided by the number of students minus 1 (N-l)
      k is the number of items on the test.

      For the example,

    Estimating Reliability Using the Kuder-Richardson Formula 21

    When item level data or technological assistance is not available to assist in the computation of a large number of cases and items, the simpler, and sometimes less precise, reliability estimate known as Kuder-Richardson Formula 21 is an acceptable general measure of internal consistency. The formula requires only the test mean (M), the variance (s 2) and the number of items on the test (k). It assumes that all items are of approximately equal difficulty. (N=number of students)

    For this example, the data set used for computation of the KR 20 is repeated.

    Student (N=l0)

    X
    (Score)

    x= X-mean
    (score-mean)

    x2 

    A

    11

    4.5

    20.25

    B

    10

    3.5

    12.25

    C

    9

    2.5

    6.25

    D

    7

    0.5

    0.25

    E

    7

    0.5

    0.25

    F

    6

    -0.5

    0.25

    G

    5

    -1.5

    2.25

    H

    4

    -2.5

    6.25

    I

    4

    -2.5

    6.25

    J

    2

    -4.5

    20.25

     

    mean = 6.5

     

    S x2 = 74.50


     Variance

    Kuder-=Richardson formula 21  

      M - the assessment mean (6.5)
      k - the number of items in the assessment (12)
      s 2 - variance (8.28).

    PAGE 34

    Therefore; in the example:

    The ratio [ mean (k-mean)] / ks2 in KR21 is a mathematical approximation of the ratio Spq/s2 in KR20. The formula simplifies the computation but will usually yield, as evidenced, a lower estimate of reliability. The differences are not great on a test with all items of about the same difficulty.

    In addition to the split-half reliability estimates and the Kuder-Richardson formulas (KR20, KR21) as mentioned above, there are many other ways to compute a reliability index. Another one of the most commonly used reliability coefficients is Cronbach's alpha (a ). It is based on the internal consistency of items in the tests. It is flexible and can be used with test formats that have more than one correct answer. The split-half estimates and KR20 are exchangeable with Cronbach's alpha. When examinees are divided into two parts and the scores and variances of the two parts are calculated, the split-half formula is algebraically equivalent to Cronbach's alpha. When the test format has only one correct answer, KR20 is algebraically equivalent to Cronbach's alpha. Therefore, the split-half and KR20 reliability estimates may be considered special cases of Cronbach's alpha.

    Given the universe of concerns which daily confront school administrators and classroom teachers, the importance is not in knowing how to derive a reliability estimate, whether using split halves, KR20 or KR21. The importance is in knowing what the information means in evaluating the validity of the assessment. A high reliability coefficient is no guarantee that the assessment is well-suited to the outcome. It does tell you if the items in the assessment are strongly or weakly related with regard to student performance. If all the items are variations of the same skill or knowledge base, the reliability estimate for internal consistency should be high. If multiple outcomes are measured in one assessment, the reliability estimate may be lower. That does not mean the test is suspect. It probably means that the domains of knowledge or skills assessed are somewhat diverse and a student who knows the content of one outcome may not be as proficient relative to another outcome. 

    Establishing Interrater Agreement for Performance-Based or Product
    Assessments (Complex Generated Response Assessments)

    In performance-based assessment, where scoring requires some judgment, an important type of reliability is agreement among those who evaluate the quality of the product or performance relative to a set of stated criteria. Preconditions of interrater agreement are:

      1) A scoring scale or rubric which is clear and unambiguous in what it demands of the student by way of demonstration.

      2) Evaluators who are fully conversant with the scale and how the scale relates to the student performance, and who are in agreement with other evaluators on the application of the scale to the student demonstration.

    PAGE 35

    The end result is that all evaluators are of a common mind with regard to the student performance and that one mind is reflected in the scoring scale or rubric and that all evaluators should give the demonstration the same or nearly the same ratings. The consistency of rating is called interrater reliability. Unless the scale was constructed by those who are employing the scale and there has been extensive discussion during this construction, training is a necessity to establish this common perspective.

    Training evaluators for consistency should include:

    • A discussion of the rating scale by all participating evaluators so that a common interpretation of the scale emerges and so diverse interpretations can be resolved or referred to an authority for determination.
    • The opportunity to review sample demonstrations which have been anchored to a particular score on the scale. These representative works were selected by a committee for their clarity in demonstrating a value on the scale. This will provide operational models for the raters who are being trained.
    • Opportunities to try out the scale and discuss the ratings. The results can be used to further refine common understanding. Additional rounds of scoring can be used to eliminate any evaluator who cannot enter into agreement relative to the scale.

    Gronlund (1985) indicated that "rater error" can be related to:

      1) Personal bias which may occur when a rater is consistently using only part of the scoring scale, either in being overly generous, overly severe or evidencing a tendency to the center of the scale in scoring.

      2) A "halo effect" which may occur when a rater's overall perception of a student positively or negatively colors the rating given to a student.

      3) A logical error which may occur when a rater confuses distinct elements of an analytic scale. This confounds rating on the items.

    Proper training to an unambiguous scoring rubric is a necessary condition for establishing reliability for student performance or product. When evaluation of the product or performance begins in earnest, it is necessary that a percentage of student work be double scored by two different raters to give an indication of agreement among evaluators. The sample of performances or products that are scored by two independent evaluators must be large enough to establish confidence that scoring is consistent. The smaller the number of cases, the larger the percentage of cases that will be double scored. When the data on the double-scored assessments is available, it is possible to compute a correlation of the raters' scores using the Pearson Product Moment Correlation Coefficient. This correlation indicates the relationship between the two scores given for each student. A correlation of .6 or higher would indicate that the scores given to the students are highly related.

    Another method of indicating the relationship between the two scores is through the establishing of a rater agreement percentage--that is, to take the assessments that have been double scored and calculate the number of cases where there has been exact agreement between the two raters. If the scale is analytic and rather extensive, percent of agreement can be determined for the number of cases where the scores are in exact agreement or adjacent to each other (within one point on the scale). Agreement levels should be at 80% or higher to establish a claim for interrater agreement.

    PAGE 36

    Establishing Rater Agreement Percentages

    Two important decisions which precede the establishment of a rater agreement percentage are:

    • How close do scores by raters have to be to count as in "agreement?" In a limited holistic scale, (e.g., 1-4 points) it is most likely that you will require exact agreement among raters. If an analytic scale is employed with 30 to 40 points, it may be determined that exact and adjacent scores will count as being in agreement.
    • What percentage of agreement will be acceptable to ensure reliability? 80% agreement is promoted as a minimum standard above, but circumstances relative to the use of the scale may warrant exercising a lower level of acceptance. The choice of an acceptable percentage of agreement must be established by the school or district. It is advisable that the decision be consultative.

    After agreement and the acceptable percentage of agreement have been established, list the ratings given to each student by each rater for comparison:

    Student

    Score: Rater 1

    Score: Rater 2

    Agreement

    A

    6

    6

    X

    B

    5

    5

    X

    C

    3

    4

    D

    4

    4

    X

    E

    2

    3

    F

    7

    7

    X

    G

    6

    6

    X

    H

    5

    5

    X

    I

    3

    4

    J

    7

    7

    X

    Dividing the number of cases where student scores between the raters are in agreement (7) with the total number of cases (10) determines the rater agreement percentage (70%).

    When there are more than two teachers, the consistency of ratings for two teachers at a time can be calculated with the same method. For example, if three teachers are employed as raters, rater agreement percentages should be calculated for

    • Rater 1 and Rater 2
    • Rater 1 and Rater 3
    • Rater 2 and Rater 3

    All calculations should exceed the acceptable reliability score. If there is occasion to use more than two raters for the same assessment performance or product, an analysis of variance using the scorers as the independent variable can be computed using the sum of squares.

    In discussion of the various forms of performance assessment, it has been suggested how two raters can examine the same performance to establish a reliability score. Unless at least two independent raters have evaluated the performance or product for a significant sampling of students, it is not possible to obtain evidence that the score obtained is accurate to the stated criteria.


    Chapter 2

    Beginning of Chapter 2

    Table of Contents

    Chapter 3