Drowning in the data or the calm before the genome sequence storm? #PAG

High Throughput DNA Sequencing at PAG
High Throughput DNA Sequencing at PAG

My abiding memory of the Plant and Animal Genome (#PAG) conference this year will be not so much the talks, the many individual discussions I have had, nor the quantities of great food in southern California. It will be the huddles of people in corners earnestly discussing what to do with gigabases of short DNA sequence reads from their organism of choice. Perhaps reflecting this sequence deluge, I’m finding it unusually hard to note an ‘action’ from many talks. A surprising number stick rather rigidly to their published positions, or as Julian Catchen tweeted, “PIs cut and paste their grant reports into a PowerPoint talk, drone on like it’s a department meeting.” Despite being hugely connected and computer-literate, with a third of the audiences being on-line, the Twitter feed (#PAG) from nearly 3000 people is minimal, with few nuggets coming across.

Looking back at previous PAGs, integrating principles across all of plant biology have emerged, each conference with a big idea which has changed my research or thinking. This is the 19th annual meeting, and over the years we have seen the similarity of genes over huge phylogenetic distances; functional genomics identify the purpose of nearly every gene; new marker systems giving insight into crop diversity and their wild relatives; the ubiquity of whole genome duplication or polyploidy events providing the basis for the evolution of plants; micro and small RNAs being critical controls; the utility of genetic maps in understanding diversity; the universal genome browsers, databases and web tools to extract information. After the Science and Nature papers are out, the ideas impact a substantial majority of the papers we publish in Annals of Botany, and I hope we will publish papers from several of the presentations made here. But unlike the previous meetings, this week, I’m finding it hard to see what NEW areas we will be publishing from results being presented. Indeed, sitting at the back, I find myself mentally composing ‘return-without-review’ letters: largely in line with expectation from other species … acknowledge the huge amount of work/data but can’t see new principles … essential that the work addresses important questions … I come to conferences to hear about work in progress or incomplete, so this is something I hope to see in a presentation, but I rather suspect this criticism will remain in some papers where the hard work has not been completed.

Of course, one also (primarily?) comes to a conference for the individual discussions, and these have been as exciting as ever, with plenty of new ideas discussed, thoughts of collaboration, and updates on what people are doing. With the bigger picture of sequence data, I hope we are seeing a pause before the real biology emerges from genome-wide sequencing – or even better, can someone convince me that still being convalescent from swine flu has turned me into an old cynic and I’m missing the paradigm shift.

  • I’d be very interested to know if there was any consensus (or at least ideas) generated about what to do with all of this data, as before the studies on the real biology can start this data needs to be shared somehow. As someone at the BGI, I appreciate we are responsible for a lot of this data flood. Whilst it means our computational biologists will happily not be running out of work any time in the near future, there is obviously a bottleneck in disseminating the raw data and assembled genomes that is only going to get worse in this era of 10K animal and plant genome projects.

    Whilst getting more of the raw data into the short read archives is one thing that can help the community, rather than waiting for assembled genomes to be published (which may not be feasible for every single genome in a ‘K-genome’ project, especially if many of them may eventually be getting ‘return-without-review’ letters) do people feel there is a need for new mechanisms of data distribution and accreditation such as data doi’s? Are there types of data currently not handled well (or at all) by the current repositories that people would particularly like to share? At the BGI we are currently exploring ways of putting our large cloud computing resources to better use here, or working with organisations such as datacite to issue data doi’s, but any other thoughts or ideas regarding this issue would be very gratefully appreciated (scott.edmunds@genomics.org.cn).

  • The article by Athel Cornish-Bowden, CNRS, in the current issue of The Biochemist concludes “as long as biologists continue to think that studying systems means collecting huge amounts of data in the absence of a global view of the organism, they will not be progressing towards a real understanding of how organisms stay alive.” see Systems Biology: How far has it come? http://www.bit.ly/ehEJgU

  • Interesting commentary in Nature about the programs for analysis of NGS / next generation sequencing data: “Sequencing DNA on an industrial scale is no longer difficult: the challenge is in assembling a full genome from the multitude of short, overlapping snippets that second-generation sequencing machines churn out.” …

    http://www.nature.com/news/2011/110323/full/471425a.html?s=news_rss&utm_source=feedburner&utm_medium=twitter&utm_campaign=Feed%3A+news%2Frss%2Fmost_recent+%28NatureNews+-+Most+recent+articles%29&utm_content=FeedBurner

    News, updates, and results from the Assemblathon project can be found at the Assemblathon website – http://assemblathon.org/ Details of Assemblathon 2 are being finalised and this should start in May 2011.