If your test is not reliable, you must try to lower your error.

PEP 6305 Measurement in Health & Physical Education

Topic 11: Reliability

Section 11.1

n   This Topic has 3 Sections.

Reading

n   Vincent & Weir, Statistics in Kinesiology, 4th ed. Chapter 13 �Quantifying Reliability�.

n   Also, the "Reliability" PDF reading posted in Blackboard.

Purpose

n   To discuss the principles of reliability and measurement error.

n   To demonstrate the estimation of reliability and the standard error of measurement.

Objectivity

 

n   Objectivity concerns how a test is scored. It depends on two factors:

�  A defined scoring system.

�  Individuals (called judges or raters) who have been trained to score the test.

n   An objective test is one that two or more competent judges assign the same value when scoring a test. In other words, whether the judges agree on the rating.

�  The training of the judges and the scoring system are important to achieving high reliability.

�   For example, if multiple judges score an event, such as gymnastics or diving, you would need a scoring system that not only includes what would be important, but also how to assign points (a scale), and scorers would have to be trained on what to observe and how to assign points.

n   Objectivity is sometimes referred to as interrater reliability.

n   An objective rating is always better than a subjective rating because less measurement error is introduced.

�  Differences between multiple test scorers introduces measurement error.

�  Increasing measurement error decreases objectivity (interrater reliability); decreasing objectivity, in turn, decreases validity (Validity is the final course Topic).

 

n   Reliability concerns how accurately a test represents variation between subjects.

n   Measurement theory: An observed score (X) consists of two components, a true score (T) and error score (E):

�  X = T + E

�  T (true score) is the measure of the ability or characteristic that we are interested in.

�  E (error score) is measurement error, which is anything that is NOT the thing we want to measure.

n   There are several possible sources of measurement error:

�  Measurement unit or scale � a test�s unit of measurement may be too large to measure the characteristic precisely. For example, if subjects are rated on a three point scale (poor, average, excellent), then distinguishing between subjects within any of the three ratings would be impossible, although it is unlikely that all subjects who received the same rating have exactly the same ability: not everyone who is rated "excellent" is performing exactly the same, so giving them all the same score introduces some error (deviation from their "true" score or ability).

�  Subject inconsistency � a person could have a good or bad day when being tested, which means their performance differs from day to day.

�  Poor test conditions � noise or other distractions when taking a written test or a slippery condition when taking a running test. Poor conditions rather than the subject's ability affect the subject's score.

�  Poorly constructed test � writing bad test questions that no one can understand so that the test takers end up guessing. Guessing is not a measure of academic ability.

�  Poor test equipment � The equipment (e.g., the gas analyzer when measuring VO2) is inconsistent because it is malfunctioning or improperly calibrated.

n   Theoretically, you could determine a person�s true score (T) by calculating the mean of an infinite number of tests.

�  While it is not possible to administer a test an infinite number of times, it is important to understand that the mean of several administrations (or trials) is the most accurate representation of the subject's true score. Why? In general (but not always), the more trials you have, the more accurate (reliable) is the average score with respect to the subject's true score -- it is more reliable.

�  Measurement error is assumed to be random and normally distributed. Thus, the mean error over several trials will be 0, with predictable variability on either side of 0.

Interpreting Test Reliability

n   A reliability coefficient represents the proportion of total variance that is measuring true score differences among the subjects.

�  A reliability coefficient can range from a value of 0.0 (all the variance is measurement error) to a value of 1.00 (no measurement error). In reality, all tests have some error, so reliability is never 1.00.

�  A test with high reliability (≥0.70) is desired, because lower reliability indicates that a large proportion of test variance is measurement error.

�  If test reliability is 0, and test scores are used to assign grades, a student's grade would be assigned purely by chance, similar to flipping a coin or rolling dice!

n   High reliability indicates that the test is measuring something; validity studies (Topic 12) determines what the test is measuring.

�  High test reliability is required for test validity. A test cannot be valid if it is not reliable.

�  Low reliability means most of the observed test variance is measurement error � due to chance.

�  If test variance is largely due to chance, it is not measuring anything.

�  If a test is not measuring anything, it cannot be a valid measure of anything.

Determining Reliability

n   Test reliability is always established for a defined population; reliability of a test in one population may not be the same as in other populations.

�  Test variance is central to reliability.

�  Since a test score (X) consists of true and error components, the total variance (σx2) of a test administered to a group consists of true score variance (σt2) and error variance (σe2): 

         

If your test is not reliable, you must try to lower your error.
     Test Variance = True Score Variance + Error Variance

Show

n   To illustrate, suppose the SD of a test administered to students was 2.0 (thus, total test variance = 2.02 = 4.0). All of the students guessed on every question, which means that getting the questions correct was due to luck or chance (σe2 = 4.0). Since guessing is completely random and has nothing to do with ability (true score), there would be no true score variance (σt2 = 0) and the components would be:

�  Total Variance = True Score Variance + Error Variance: 4.0 = 0.0 + 4.0

n   Test reliability (Rxx) is calculated from these variances:               

    

If your test is not reliable, you must try to lower your error.

�  For this example test reliability would be: Rxx = [(4.0 � 4.0)/4.0] = 0/4.0 = 0.0

�  The reliability of this test is 0, which means that all test variance was due to measurement error. The test did not measure anything.

n   As another example of test reliability, let us assume we have a test with a total variance of 40 and error variance is 4. If this were the case we would have the following:

�  The reliability of the test would be: Rxx = [(40 � 4)/40] = 36/40 = 0.90.

�  The test reliability would be 0.90. 90% of the test variance is attributable to true score differences; only 10% of the total test variance was due to measurement error. This test has good reliability for detecting differences among subjects for the ability or trait being measured.

Types of Reliability

Stability or �Test-Retest� Reliability

n   Involves administering the same test on two or more different occasions.

n   Typically the tests are administered within a 7-day period to ensure that true score does not change in the testing period.

n   This method can be used with any test, but is often used with tests that cannot be administered twice within the same day. An example would be an endurance test like the 1.5-mile run.

n   The stability reliability of a scorer (i.e., comparing multiple scores assigned by a single judge) is called intrarater reliability (how is this different from interrater reliability, or objectivity?

Internal Consistency Reliability

n   This type of test involves getting multiple measures within a day, usually at a single testing session. Examples are:

n   Written test. The items are the multiple measures. The person�s score is the sum of all items answered correctly.

n   Psychological instrument. These survey or interview instruments consist of several items that are often scored with 1 to 5 points. The person responds on a 5-point scale describing how characteristic the described behavior is of the person responding to the scale. The person�s score is usually the sum of all items.

n   Judge's Ratings. A judge rates the performances of several individuals. Some examples are: figure skaters, divers, gymnasts; allied health students performing clinical procedures; or high school students who try for the drill team or cheerleader squad. The judge rates several aspects of the performance, such as the components of a figure skating routine or competitive dive; each aspect is rated independently of the other aspects. The final score is typically the sum or average of the judge's ratings. (Comparing the final scores between multiple judges is objectivity.)

If your test is not reliable, you must try to lower your error.
Click to go to the next section (Section 11.2)

What happens if a test is not reliable?

If a test is not valid, then reliability is moot. In other words, if a test is not valid there is no point in discussing reliability because test validity is required before reliability can be considered in any meaningful way. Likewise, if as test is not reliable it is also not valid.

How can you make sure a test is reliable?

Identify questions that may be too difficult. Identify questions that may not be difficult enough. Avoid instances of more than one correct answer choice. Eliminate exam items that measure the wrong learning outcomes.

Which situation lowers the reliability of a test?

All other factors being equal, the more items included in a test, the higher the test's reliability. 4. Reliability tends to decrease as tests become too easy or too difficult.

Can a test that has poor reliability be valid?

A test is valid if it measures what it is supposed to measure. If theresults of the personality test claimed that a very shy person was in factoutgoing, the test would be invalid. Reliability and validity are independent of each other. A measurement maybe valid but not reliable, or reliable but not valid.