Menu Close
Research shows that the test error is too high in NAPLAN. from

NAPLAN data is not comparable across school years

Recent political comment suggests Australia has not performed as well as expected in the latest round of NAPLAN testing. Such comment is based on the belief that NAPLAN scores can be compared from one year to the next.

But new research, to be published next month in the journal, Quality Assurance in Education, shows that NAPLAN results cannot be compared across years. It is not then reasonable for politicians to say NAPLAN results have plateaued, because comparisons from year to year are not reliably accurate.

The study – which used NAPLAN scores from 2008 to 2012 – questions the reliability of NAPLAN as a tool for charting individual student progress across school years, let alone that of whole year groups.

The study collected data for nearly 10,000 students in over 110 primary schools in Australia. It looked at the influences on student performance of gender, language background and the NAPLAN test itself (the nature of the actual test in a particular year).

While gender was found to be the most influential, followed by language background, the test itself was also found to be a factor, with the average score attained by students fluctuating significantly from year to year. This variability, the study concluded, was likely to be a consequence of differences in the tests themselves rather than a reflection of student performance.

Flaws in the test

There are two conflicting views about the reliability and use of NAPLAN scores to compare individuals from one year to the next.

NAPLAN tests have questions that are in common from one year to the next. Student performance in these questions can then be used to standardise the test as a whole. This then provides the mechanism for comparing the test in one year with that of the next.

Professor Margaret Wu from the University of Melbourne is sceptical of the capacity for NAPLAN scores to be used to compare individuals or schools from year to year.

She argues that comparisons of national cohorts are problematic due to the large random fluctuations and error margins implicit in such comparisons.

Each NAPLAN test is short, only 40 questions. Therefore the questions used for standardising one test with another are not enough.

As with any test, there is an expected error in measurement. The errors can be for a number of reasons. One reason can be that the answer to the question may not in fact be the best answer, which confuses kids. A test error rate is a measure of how good an instrument (test) is at achieving the same result if it were to be done by another group of students.

In the case of NAPLAN, that would mean the same group of students getting the same result, the school getting the same result and even any system (Department of Education or Catholic Education Office or even the DOE in different states) getting the same result.

In the NAPLAN test the measurement error is large mainly because the test is short. Even if the test from one year could be compared with the test from another, the errors inherent in individual test scores would mean such a comparison would be unreliable.

Margaret Wu states that the fluctuation in NAPLAN scores can be as much as ± 5.2. This is because of a standard error of measurement of about 2.6 standard deviations.

This means there is a 95% confidence that if the same students were to complete the same test again (without new learning between tests) the results would vary by as much as ± 5.2 (2.6 x 2) of the original score. This represents nearly 12% variability for each individual score.

The standard error of measurement depends on the test reliability, meaning the capacity of the test to produce consistent and robust results.

What some researchers say is that the NAPLAN test’s large margin for errors makes the comparison across years inaccurate.

For example, if a student gets 74% in a test and another gets 70% and the error is 5, that means that essentially the first mark is 74 + or – 5, and the other mark is 70% + or – 5.

This means the two different marks can overlap by a fair bit. So it is not really possible to say a score of 74 is that much different to a score of 70.

The implication is that when you take this into account over a whole cohort of people it is difficult to sat categorically that one set of marks is any different compared with another.


There are various implications for using NAPLAN results to compare students, schools or even state performance.

The “My School” website data, for example, should be viewed with caution by parents when making decisions about their children’s schooling.

Teachers and principals should not be judged based on NAPLAN findings and, as others have argued, more formative (assessment during learning) rather than summative (assessment at the end of a learning cycle) measures for providing teaching and learning feedback should be explored.

NAPLAN is not good for the purpose for which it was intended. However, it makes politicians feel they are doing something to promote literacy and numeracy.

Want to write?

Write an article and join a growing community of more than 183,700 academics and researchers from 4,959 institutions.

Register now