Imagine taking a thousand copies of a phone book, shredding them all together, then trying to use the overlapping pieces to reconstruct a copy.
This is a simple problem compared to assembling the human genome, which has about 3 billion “letters”.
Now imagine trying to piece these together on a desktop computer. Sound impossible? The Computational Genomics group at NICTA have produced a new software program called Gossamer to do just that.
Sequences of DNA “letters” are the instruction books from which all living cells operate, and understanding them is crucial in advancing our understanding of complex diseases, especially cancer.
The past few years have seen a revolution in our ability to gather DNA sequence information. Scientists can, for a few tens of thousands of dollars and a few weeks effort, gather the data that took the Human Genome Project over a decade, and billions of dollars – and it’s getting cheaper and faster all the time.
The technology that does this, so called “Second Generation Sequencing”, yields millions of randomly selected short fragments of DNA with about a hundred letters in each fragment.
The problem is that these sequencing machines give us no clues about how the individual fragments fit together. In some cases we can match them against a reference sequence (e.g. the human genome) in order to begin to analyse them, but for many important kinds of analysis we need to try and piece them together to reconstruct the original sequence: a process called assembly.
On bacteria – which usually have a few million letters in their DNA – this process is not too hard. For humans – with approximately 3 billion letters – it’s vastly more difficult.
With Gossamer, we are certainly not the only researchers to have built software for this purpose. Indeed there are many such programs and some of them work very well.
But when it comes to assembling complex organisms (such as humans) other assembly programs typically require the use of large computing infrastructure – a supercomputer, or a large cluster of computers. This kind of infrastructure is expensive to build and maintain, and not always very accessible.
This results in a situation whereby researchers don’t gather data because they cannot analyse it, or they use round-about methods which have significant shortcomings.
To address this problem, we dipped into our theoretical computer science “toolbox” and asked the following question. What is the minimum amount of computer memory required to solve the assembly problem?
It’s a question no-one had asked before, and the answer was surprising. We found the theoretical minimum is well within the capacity of a good workstation computer that you might have on your desk.
Using state-of-the-art research, we produced a program for assembling DNA that uses, in some cases, one thousandth of the memory required by popular alternatives.
There are several benefits of this technology. Scientists who previously struggled to get access to a powerful-enough computer to analyse their sequence data can now do so. Scientists with existing supercomputer-access can now do more analyses, perhaps being able to study more samples.
Moving out of the lab
In the near future we expect this kind of sequencing will move out of the research laboratory and into the pathology sector. For this to be economically feasible, the pathology companies will have to be able to analyse the sequence data efficiently. Gossamer is one of the pieces of technology that will enable this transition.
The development of Gossamer is important for two reasons:
First and foremost, it’s a useful tool that will help biomedical scientists increase their understanding of complex diseases.
A second, more abstract reason, is that Gossamer shows theory is important. Without a large body of theoretical research to turn to, we would be unlikely to discover the kinds of techniques used in Gossamer.
Gossamer is available for download at the NICTA website.