Chapter 2 |
|
Chapter 3 |
|
APPENDIX Estimating Reliability - Forced-Choice Assessment The split-half model and the Kuder-Richardson formula for estimating reliability will be described here. Given the demands on time and the need for all assessment to be relevant, school practitioners are unlikely to utilize a test-retest or equivalent forms procedure to establish reliability. Reliability Estimation Using a Split-half Methodology The split-half design in effect creates two comparable test administrations. The items in a test are split into two tests that are equivalent in content and difficulty. Often this is done by splitting among odd and even numbered items. This assumes that the assessment is homogenous in content. Once the test is split, reliability is estimated as the correlation of two separate tests with an adjustment for the test length. Other things being equal, the longer the test, the more reliable it will be when reliability concerns internal consistency. This is because the sample of behavior is larger. In split-half, it is possible to utilize the Spearman-Brown formula to correct a correlation between the two halves--as if the correlation used two tests the length of the full test (before it was split), as shown on the next page. For demonstration purposes a small sample set is employed here--a test of 40 items for 10 students. The items are then divided even (X) and odd (Y) into two simultaneous assessments. |
|
Student |
Score (40) |
X Even (20) |
Y Odd (20) |
x |
y |
x2 |
y2 |
xy |
|
A |
40 |
20 |
20 |
4.8 |
4.2 |
23.04 |
17.64 |
20.16 |
|
B |
28 |
15 |
13 |
-0.2 |
-2.8 |
0.04 |
7.84 |
0.56 |
|
C |
35 |
19 |
16 |
3.8 |
0.2 |
14.44 |
0.04 |
0.76 |
|
D |
38 |
18 |
20 |
2.8 |
4.2 |
7.84 |
17.64 |
11.76 |
|
E |
22 |
l0 |
12 |
-5.2 |
-3.8 |
27.04 |
14.44 |
19.76 |
|
F |
20 |
12 |
8 |
-3.2 |
-7.8 |
10.24 |
60.84 |
24.96 |
|
G |
35 |
16 |
19 |
0.8 |
3.2 |
0.64 |
10.24 |
2.56 |
|
H |
33 |
16 |
17 |
0.8 |
1.2 |
0.64 |
1.44 |
0.96 |
|
I |
31 |
12 |
19 |
-3.2 |
3.2 |
10.24 |
10.24 |
-10.24 |
|
J |
28 |
14 |
14 |
-1.2 |
-1.8 |
1.44 |
3.24 |
2.16 |
|
MEAN |
31.0 |
15.2 |
15.8 |
|
|
95.60 |
143.60 |
73.40 |
|
SD |
|
3.26 |
3.99 |
|
|
|
|
|
|
|
||||||||
PAGE 31
| From this information it is possible to calculate a correlation using
the Pearson Product-Moment Correlation Coefficient, a statistical measure
of the degree of relationship between the two halves.
Pearson Product Moment Correlation Coefficient:
where x is each student's score minus the mean on even number items
for each student.
The Spearman-Brown formula is usually applied in determining reliability using split halves. When applied, it involves doubling the two halves to the full number of items, thus giving a reliability estimate for the number of items in the original test.
Estimating Reliability using the Kuder-Richardson Formula 20 Kuder and Richardson devised a procedure for estimating the reliability of a test in 1937. It has become the standard for estimating reliability for single administration of a single form. Kuder-Richardson measures inter-item consistency. It is tantamount to doing a split-half reliability on all combinations of items resulting from different splitting of the test. When schools have the capacity to maintain item level data, the KR20, which is a challenging set of calculations to do by hand, is easily computed by a spreadsheet or basic statistical package. |
PAGE 32
| The rationale for Kuder and Richardson's most commonly used
procedure is roughly equivalent to:
1) Securing the mean inter-correlation of the number of items (k) in
the test, |
|
ITEM (k) |
|||||||||||||||
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
X |
x=X- |
x2 |
|
|
Student |
1=correct |
0=incorrect |
(Score) | mean (score- mean) |
|||||||||||
|
A |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
11 |
4.5 |
20.25 |
|
B |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
10 |
3.5 |
12.25 |
|
C |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
9 |
2.5 |
6.25 |
|
D |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
0 |
7 |
0.5 |
0.25 |
|
E |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
7 |
0.5 |
0.25 |
|
F |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
6 |
-0.5 |
0.25 |
|
G |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
5 |
-1.5 |
2.25 |
|
H |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
4 |
-2.5 |
6.25 |
|
I |
1 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
-2.5 |
6.25 |
|
J |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
-4.5 |
20.25 |
|
S = |
9 |
9 |
8 |
7 |
7 |
5 |
5 |
5 |
4 |
3 |
2 |
1 |
65 |
0 |
74.50 |
|
mean |
Sx2
|
||||||||||||||
| P-values |
0.9 |
0.9 |
0.8 |
0.7 |
0.7 |
0.5 |
0.5 |
0.5 |
0.4 |
0.3 |
0.2 |
0.1 |
|
Q-value |
0.1 |
0.1 |
0.2 |
0.3 |
0.3 |
0.5 |
0.5 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
| pq |
0.09 |
0.09 |
0.16 |
0.21 |
0.21 |
0.25 |
0.25 |
0.25 |
0.24 |
0.21 |
0.16 |
0.09 |
|
Spq |
2.21 |
| Here, Variance
|
PAGE 33
|
Student (N=l0) |
X |
x= X-mean |
x2 |
|
A |
11 |
4.5 |
20.25 |
|
B |
10 |
3.5 |
12.25 |
|
C |
9 |
2.5 |
6.25 |
|
D |
7 |
0.5 |
0.25 |
|
E |
7 |
0.5 |
0.25 |
|
F |
6 |
-0.5 |
0.25 |
|
G |
5 |
-1.5 |
2.25 |
|
H |
4 |
-2.5 |
6.25 |
|
I |
4 |
-2.5 |
6.25 |
|
J |
2 |
-4.5 |
20.25 |
|
mean = 6.5 |
S x2 = 74.50 |
Variance Kuder-=Richardson formula 21 M - the assessment mean (6.5) |
PAGE 34
Therefore; in the example:
The ratio [ mean (k-mean)] / ks2 in KR21 is a mathematical approximation of the ratio Spq/s2 in KR20. The formula simplifies the computation but will usually yield, as evidenced, a lower estimate of reliability. The differences are not great on a test with all items of about the same difficulty. In addition to the split-half reliability estimates and the Kuder-Richardson formulas (KR20, KR21) as mentioned above, there are many other ways to compute a reliability index. Another one of the most commonly used reliability coefficients is Cronbach's alpha (a ). It is based on the internal consistency of items in the tests. It is flexible and can be used with test formats that have more than one correct answer. The split-half estimates and KR20 are exchangeable with Cronbach's alpha. When examinees are divided into two parts and the scores and variances of the two parts are calculated, the split-half formula is algebraically equivalent to Cronbach's alpha. When the test format has only one correct answer, KR20 is algebraically equivalent to Cronbach's alpha. Therefore, the split-half and KR20 reliability estimates may be considered special cases of Cronbach's alpha. Given the universe of concerns which daily confront school administrators and classroom teachers, the importance is not in knowing how to derive a reliability estimate, whether using split halves, KR20 or KR21. The importance is in knowing what the information means in evaluating the validity of the assessment. A high reliability coefficient is no guarantee that the assessment is well-suited to the outcome. It does tell you if the items in the assessment are strongly or weakly related with regard to student performance. If all the items are variations of the same skill or knowledge base, the reliability estimate for internal consistency should be high. If multiple outcomes are measured in one assessment, the reliability estimate may be lower. That does not mean the test is suspect. It probably means that the domains of knowledge or skills assessed are somewhat diverse and a student who knows the content of one outcome may not be as proficient relative to another outcome. Establishing Interrater Agreement for Performance-Based
or Product In performance-based assessment, where scoring requires some judgment, an important type of reliability is agreement among those who evaluate the quality of the product or performance relative to a set of stated criteria. Preconditions of interrater agreement are: 1) A scoring scale or rubric which is clear and unambiguous in what it demands of the student by way of demonstration. 2) Evaluators who are fully conversant with the scale and how the scale relates to the student performance, and who are in agreement with other evaluators on the application of the scale to the student demonstration. |
PAGE 35
| The end result is that all evaluators are of a common mind with regard
to the student performance and that one mind is reflected in the scoring
scale or rubric and that all evaluators should give the demonstration the
same or nearly the same ratings. The consistency of rating is called interrater
reliability. Unless the scale was constructed by those who are employing
the scale and there has been extensive discussion during this construction,
training is a necessity to establish this common perspective.
Training evaluators for consistency should include:
Gronlund (1985) indicated that "rater error" can be related to: 1) Personal bias which may occur when a rater is consistently using only part of the scoring scale, either in being overly generous, overly severe or evidencing a tendency to the center of the scale in scoring. 2) A "halo effect" which may occur when a rater's overall perception of a student positively or negatively colors the rating given to a student. 3) A logical error which may occur when a rater confuses distinct elements of an analytic scale. This confounds rating on the items. Proper training to an unambiguous scoring rubric is a necessary condition for establishing reliability for student performance or product. When evaluation of the product or performance begins in earnest, it is necessary that a percentage of student work be double scored by two different raters to give an indication of agreement among evaluators. The sample of performances or products that are scored by two independent evaluators must be large enough to establish confidence that scoring is consistent. The smaller the number of cases, the larger the percentage of cases that will be double scored. When the data on the double-scored assessments is available, it is possible to compute a correlation of the raters' scores using the Pearson Product Moment Correlation Coefficient. This correlation indicates the relationship between the two scores given for each student. A correlation of .6 or higher would indicate that the scores given to the students are highly related. Another method of indicating the relationship between the two scores is through the establishing of a rater agreement percentage--that is, to take the assessments that have been double scored and calculate the number of cases where there has been exact agreement between the two raters. If the scale is analytic and rather extensive, percent of agreement can be determined for the number of cases where the scores are in exact agreement or adjacent to each other (within one point on the scale). Agreement levels should be at 80% or higher to establish a claim for interrater agreement. |
PAGE 36
| Establishing Rater Agreement Percentages
Two important decisions which precede the establishment of a rater agreement percentage are:
After agreement and the acceptable percentage of agreement have been established, list the ratings given to each student by each rater for comparison: |
|
Student |
Score: Rater 1 |
Score: Rater 2 |
Agreement |
|
A |
6 |
6 |
X |
|
B |
5 |
5 |
X |
|
C |
3 |
4 |
|
|
D |
4 |
4 |
X |
|
E |
2 |
3 |
|
|
F |
7 |
7 |
X |
|
G |
6 |
6 |
X |
|
H |
5 |
5 |
X |
|
I |
3 |
4 |
|
|
J |
7 |
7 |
X |
| Dividing the number of cases where student scores between the raters
are in agreement (7) with the total number of cases (10) determines the
rater agreement percentage (70%).
When there are more than two teachers, the consistency of ratings for two teachers at a time can be calculated with the same method. For example, if three teachers are employed as raters, rater agreement percentages should be calculated for
All calculations should exceed the acceptable reliability score. If there is occasion to use more than two raters for the same assessment performance or product, an analysis of variance using the scorers as the independent variable can be computed using the sum of squares. In discussion of the various forms of performance assessment, it has been suggested how two raters can examine the same performance to establish a reliability score. Unless at least two independent raters have evaluated the performance or product for a significant sampling of students, it is not possible to obtain evidence that the score obtained is accurate to the stated criteria. |
| Chapter
2 |
|
Chapter
3 |