HPC in genomics

The 26th of June 2000 a rough draft of the human genome was published and announced by the US President Bill Clinton and UK Prime Minister Tony Blair. Still 3 more years were necessary (April 2003) before, essentially, the complete human genome was published [1]. In May 2006 the sequence of the last chromosome was completed and published. The numbers associated to the Human Genome Poject are impressive: 13 years were necesary to map the complete human genetic code, about 3000 millions of dollars and an international collaboration from US, UK, japanese, french, german and chinese universities.

From the High performance computing point of view the task was equally challenging. The human genetic code consists in about 25.000 genes and around 3000 milion nucleotides. The produced raw data exceeds the terabytes of disk space (in 1997, at the middle of the project a commodity hard drive had 1 GB). Nowadays, the raw data producced by sequencing machine for an entire human gene is about 30 TB of disk space. After proccesing and assembling it can be squezed down to 90 GB where you have obtained the sequence and the statistics about the accuracy of the process. If you just store a sequence of A, C, G and T letters in the file without any information about the accuracy, which is not 100% of course [1] the file will be around 1.5 GB .

Nevertheless, not only storage but also high computing power is needed not only in the squezing but also later to process, analyze and classify it. The data must be transformed in information.

Sequencing technologies have come a long way in few years. For example, today scientist at the Wellcome Trust’s Sanger Institute in Cambridge can sequence a complete human genome in 13 hours at a cost of 10000 dollar. This advance in sequentiation has pushed a jump from the general gene analisys to the genetic analysis of the individual. Following this way, the Wellcome Trust has launched the UK10K project where they intend to map the genetic code of 4000 healthy people and 6000 people with different diseases, so, researches will try to find genetic variations associated to particular disseases.

This kind of projects produce terabytes of data every week which challenge the HPC to store it. As previously mentioned not just store, the data mining is a computationally very intensive task that needs to use state of the art computing. The last point involved is the design of software to display all the obtained information in an intuitive way for the researcher.

It seems that the sequencing technologies will be able to push the price of sequencing a complete human genetic code bellow 1000 dolar, the limit bellow the sequenciation will made clinical aplications possible and the HPC will be involved in the process. Today, the knownledge and applications of the Human Genome Project has not been the spected ones at short term, but it is hoped that the next generation sequencing technology will help in the better understanding of the genome [2].

[1] Human Genome Project.

[2] How much data is a human genome? It depends how you store it.

[3] Revolución aplazada, Hall, Stephen S. Investigación y Ciencia, Diciembre 2010.

Human Genome Project at Wikipekia