Data overload

16 February 2010

Genomics, transcriptomics, metabolomics and of course immunomics. The suffix –omics is synonymous these days with large amounts of data. Data storage and processing has become a major issue in biology, in 2009 the Sanger Institute data storage centre contained 4 petabytes, or 1 quadrillion bytes. The decreasing cost of genome sequencing and other techniques is only going to worsen this problem. As we generate increasing volumes of data each day do we really know what we’re doing when it comes to its analysis and presentation? Do we have the tools that we need?

Science and its sister journals have, this week, produced focus issues which disucss this important subject (see here) and detail recently developed tools for those struggling to deal with their giga-, tera-, or petabytes of data. The articles also discuss the issues surrounding data transparency and sharing.

Science Signalling focuses on pathway analysis featuring Pathline a tool for the visualisation and analysis of multiple data sets particularly in functional genomics as a function of time. Science talks about the use of ‘cloud computing’ in science, featuring Galaxy, a bioinformatics platform, and also the tools developed by the Microsoft Computational Science Group . Finally Science Translational Medicine looks at data ownership in clinical trials (see here.

In the EU the ELIXIR project , a major upgrade of the EMBL-EBI services, is well underway. A pan-European initiative it will provide a stable platform for publicly accessible databases of biological information as well as providing a network to enable us to survive the data deluge. However the project isn't scheduled to be completed until 2016, so we must hope that our current systems can survive until then.

We have, it is clear, been caught off-guard when it comes to dealing with our data. Though budgets are squeezed this is an area where we must invest resources before we are lost in the deluge.

