Charlie discusses the recent IT failure of NATS which caused numerous flight delays and gives an insight into why we should plan for a potential IT failure in our organisations.
The failure of the NATS (National Air Traffic Services) computer was one of the big stories of the week, with the import of one flight plan being enough to crash the air traffic system and cause flight chaos lasting several days. The incident occurred on Monday, and I was at Luton Airport on Thursday, waiting for a delayed flight to Glasgow. The delay has given me time and inspiration to write this week’s bulletin.
I think with cyber threats and power outages being the focus of the moment, occupying us business continuity folks, we have forgotten about a good old threat: the computer outage.
I’d like to share a few thoughts with you as I reflect on the NATS incident:
- I never believe anyone who says it couldn’t happen, especially technology personnel. It appears that one inputted flight plan was enough to completely crash the entire air traffic system for the UK for several hours. You would think that, as this system is so critical to the safety of life, there would be backups and backups of backups, and so the ‘glitch’ would be momentary. So, lesson number one, is never to believe technology or perhaps other people who say it can’t happen. If ‘it couldn’t happen’ were true, all of us business continuity professionals would be out of a job, and the Titanic wouldn’t have sunk!
- When there is a complex and always-on system working 24/7, the ripples take days to sort out and return to normal. If air flights only ran from 9-5, then airlines could do some evening work to catch up. Aircraft do not make money sitting on the tarmac, so they will work as hard as possible. Any delay has a knock-on effect on future flights, again taking days to return to normal. There is very little spare capacity in the system to accommodate delayed passengers and cancelled flights, so the effects take days to sort out. Lesson number two: if a system runs at full or near capacity, then you have to accept that the impact will last for days.
- I suspect there was a sigh of relief among the airlines that it was not their IT system that had failed and caused the issue. British Airways is one of the airlines that has had a number of computer failures, leading to delayed and cancelled flights and angry passengers. They have been quick to blame NATS and have used an “event beyond their control” to avoid, in some instances, paying compensation or providing food and accommodation. There is always a danger in blaming a supplier for a disaster affecting your customers, as it was you who chose the supplier. In this case, the airlines have no choice but to use NATS, so they can legitimately blame them. Lesson number three: if the incident is not your fault, think through the implications before you blame someone else.
- Customers also need to be aware of what they are buying. If you pay £53 for a flight and it is delayed or cancelled, you can’t be surprised when the airline tries to wriggle out of compensation under the law, which is often more than the cost of the flight and paying for a £100 hotel room. With this incident, passengers paying a lot more than £53 will also be delayed, but I think as a consumer, you have to be aware of the value proposition you are entering into when buying a cheap flight. It’s brilliant if it all works, but beware of the impact if it doesn’t. Lesson number four: if it’s very cheap, there’s a reason.
- Can airlines ever provide enough information? As soon as there’s a delay to a flight, people say they didn’t get any information or updates from the airline. You would think that with all the tools and experience they have, they would know to give out as much information as possible. Perhaps they can never win. Lesson number five: if you think you are informing your customers, then do the same again four times, and that might satisfy some of those wanting an update.
- Don’t forget the manual workarounds. My understanding is that NATS can carry out air traffic control manually, but of course, they cannot supervise as many flights as they can using a computer. We must also acknowledge that some activities can’t be done manually and are not performed until the system is up and running again. As business continuity planners, we should, where possible, ensure that there are manual workarounds for activities that will be impacted if IT fails. This might require some detailed planning and instructions. Lesson number six: make sure that, whenever possible, manual workarounds are in place.
The conclusion from this incident is that we mustn’t forget IT failure as one of the risks we should be planning for, even if those who manage IT say, ‘This could never happen’.
If you are interested in learning more about IT failures and how to recover your data, please find the link for our Live Online IT Disaster Recovery course here.
 NATS Holdings, formerly National Air Traffic Services and commonly referred to as NATS, provides en-route air traffic control services to flights within the UK flight information regions and the Shanwick Oceanic Control Area.