Leaving Cert: Code must always be rigorously tested to ensure it does what’s needed

Software tends to get caught up in jargon. Artificial Intelligence, machine learning, algorithms.

At its simplest an algorithm is a set of instructions or a recipe, that a computer follows. Computers are essentially stupid and have to be told exactly what to do and what not to do. Your shampoo bottle contains an algorithm. Wash hair, rinse, repeat.

A robot following these instructions would use all the shampoo as the ‘repeat’ instruction means that it would keep repeating the earlier steps until all the shampoo was gone. Programmers often talk about 90 per cent of a developer’s work being defensive - to stop bad things happening. To stop the robot emptying the shampoo bottle.

The ideal way to build software is to write out the detailed list of things that the computer needs to do and then turn this into code. This code then needs to be rigorously tested to make sure it really does what it needs to do (spacecraft have crashed because a few lines of code weren’t written or tested properly). Lots of test planning, test cases and test results are needed to check the models against. People who are expert in testing need to be used. This isn’t trivial and it works well when you know exactly what you need to do.

The HSE’s Covid Tracker app is a good example of this done well (while massively oversimplifying here - ping phones within a few meters and note the codes from those phones every 15 minutes and if someone was in close contact develops Covid use the app to notify them ).

The problem comes when you don’t know exactly what you want and you spend time figuring it out. If you’re painting a wall you’ll usually get a few tester pots, decide on a colour and paint. And if you change your mind later you need to repaint the whole wall again. This appears to be a partially what happened with the Leaving Cert Calculated grades. There was a desire for the distribution of grades this year to be similar to previous years.

To get this result at least 20 different models were created, some with multiple different variations, to try and create broadly similar pattern of results as previous years.

The process was designed to keep the spread of grades closer to previous years (similar numbers of H1, H2, H3s etc), while keeping system fair. This feels a bit like painting one wall 20 times to try and figure out how to get the colour of one wall as close as possible to the colour of another wall (the grades from previous years).

Previous school histories

In addition amid “public disquiet” in other countries previous school histories were removed from the models late in the process. This make the grade matching even harder statistically. This level of change in software development projects which are under severe time constraints frequently leads to problems.

It is clear is that a huge amount of work went into the calculated grades process over a relatively short period of time. The report from the Department of Education’s expert group on calculated grades recognises that “statistical prediction models are inherently biased”. Because of this there may have been a focus on testing the overall model to compare overall results to previous years without a detailed testing focus on either schools or individual students. Given the overall output looked broadly correct, the specific error in the code was missed and may never have been tested for.

There appears to be no discussion in any of the documentation published as to how the models were tested other than by reference to overall comparison to grade distribution in previous years.

Many questions

At this point there are lots of questions that need to be addressed. Why were so many models run? Did anyone notice the problems experienced by specific schools with the extreme levels of downgrades in some schools compared to other schools? Why was this specific problem discovered now? Is this tied to the cases currently before the courts ? How were the models tested and validated originally? Were external teams and testing experts used to support code review and quality checking of the code ? Why weren’t the overall algorithms and the models and the assumptions in these models published? In contrast the HSE Covid Tracker App has set a gold standard globally for a clear open sourced code that is being shared and used internationally.

Given the role of the Leaving Cert in Irish society and the impact of this key life event for thousands of Irish students, these questions, which go well beyond a single error, urgently need to be answered.

Dermot Casey is an innovation and technology expert with over 25 year’s experience working across multinationals, and early stage technology companies