Frustration to salvation: a code to end computer crashes

Remember people told you to back things up? Remember? Sybren A. Stuvel/Flickr

Computer crashes and software bugs are infuriating and – usually belatedly – teach us the value of regular back-ups. But could they be a thing of the past?

We’ve all been there: the work of days or weeks destroyed in an instant. It might have been a crashed word processor, a crashed operating system, the computer eating your homework, destroying your lab results, your report …

But what if software is responsible for more than data? What if your money or your life depended on a particular piece of software not failing?

Stock exchanges and banking systems are controlled by software, and every time you drive a car or get on a plane you put your life in the hands of software systems.

Of course, safety-critical systems are built with much more care than ordinary software, but bugs still occur. Bug testing – the main method of software assurance in use today – has the following problem: not finding any more bugs doesn’t mean they’re not there. Maybe you just haven’t looked long or hard enough.

So what can we do about it?

Well, there is another method: “formal verification”. At its heart, software is just a form of logic. Using a mathematical proof you can demonstrate that a piece of software will always behave in a certain way.

This process, known as formal verification, can show the absence of bugs in a piece of software, but the technique has its own limitations. Until now, it has been too difficult to apply this method directly to code of any real complexity.

This is what our team from NICTA and UNSW has done: we have demonstrated a formal proof of correctness for a complex, critical piece of code, the seL4 microkernel.

(A microkernel is a minimal operating system (OS) kernel that controls computer hardware and memory access for the rest of the OS.)

With about 10,000 lines of code, seL4’s size is typical for a critical piece of software, even if it is tiny for an operating system (Linux, for example, is millions of lines of code long).

seL4’s main job is to provide fault isolation: if two components are running on top of the microkernel – a user interface and a medical sensor, say – and the user interface crashes, the medical sensor will be unaffected.

This is a great strategy for managing software bugs: if you cannot eliminate them entirely, at least contain them where they are annoying, but not lethal.

While there can be no final guarantee that a particular computer system will run our code properly, we’ve been able to significantly increase the amount of trust we can put into pieces of software of seL4’s size.

People have been working to achieve something like this for more than 30 years and now, finally, it is a reality.

If all goes well, the first commercial product built on seL4 may appear as early as 2012. This technology has a range of potential uses, including the development of:

  • enhanced security and safety for existing software products
  • financial applications running securely on consumer hardware
  • secure mobile communications in defence and in the enterprise and consumer sectors
  • improved medical implant devices

But while we’re busy trying to roll this code out into commercial applications, remember you can avoid (some of) the hassle of annoying glitches and crashes: back up your data regularly.

The seL4 kernel is available for download for academic use and for evaluation. More information on the proof and its assumptions can be found on the project web site.