Key Moments
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
The human genome is massively growing in size and complexity, necessitating new infrastructure like the DAS protocol to integrate data from various sources, as traditional methods struggle to keep pace.
Key Insights
The Ensembl project started from scratch to handle the human genome data, storing it in an RDBMS and providing an API for web and programmatic access.
The human genome sequence is estimated to be around 3 gigabases, with Ensembl currently annotating about one-third of it.
The Ensembl database contains approximately 30 different genomes, ranging from yeast to large mammalian genomes.
The size of assembled genome sequences is growing with a 13-month doubling time, while the archive for raw, unassembled sequence data doubles every 11 months and is currently 35 terabytes.
New sequencing technologies can produce 100-300x more data per machine, with costs reduced by a factor of 10, aiming for $1,000 per genome with higher quality.
The DAS (Distributed Annotation System) protocol is presented as a solution for data integration, allowing users to integrate data from multiple distributed servers on the fly rather than relying on a central repository.
The challenge of annotating and accessing the human genome
The human genome sequence, discovered around 2000, presented a significant scaling challenge for the young field of bioinformatics. Being 30 times larger than previous genomes, it required new systems to store, analyze, and provide access to the immense amount of data. The Ensembl project was initiated to address this by developing an RDBMS for data storage, a pipeline for pre-computed analyses, and an API for both web-based and programmatic access. The Welcome Trust Sanger Institute, where the speaker is involved, sequenced one-third of the human genome as part of a public international partnership. This institute is a large center with a substantial computer facility and is funded by the Welcome Trust, indicating the scale of 'big science' involved. Ensembl has expanded beyond the human genome to include around 30 different, large genomes, going down to references like yeast. The core of Ensembl's work involves organizing sequence data, which is delivered in millions of pieces, into a usable coordinate system, analogous to Google Maps but in one dimension. Over 80 different types of information, including genes, are layered on top of this sequence data, with gene annotation proving to be a particularly difficult challenge due to fragmented structures in higher organisms.
Ensembl's infrastructure and data handling
Ensembl's database is built on MySQL with Perl and Java APIs, featuring layered objects for elements like genes. The system undergoes continuous improvement, with updates every two months, often involving schema changes to accommodate new data or refactor storage methods. The project is open-source, with code available under a BSD license and data dumps freely accessible. Development occurs in small sub-teams, managing a release cycle that has evolved from one month to a more sustainable two months to balance development and updates. The project maintains a healthy 'paranoia' about its relevance and usage, monitoring API accesses and web page views, a crucial aspect for securing ongoing funding from organizations like the Welcome Trust.
Exponential growth in genomic data
The amount of genomic data is not static; it continues to grow exponentially. The assembled human genome sequence follows a rough 13-month doubling time, a trend that has persisted for a long time. Concurrently, a new archive for raw, unassembled sequence data is doubling every 11 months and has reached 35 terabytes, making it one of the larger Oracle databases. This relentless growth is comparable to Moore's Law in computing, highlighting that sequencing technology is just another form of information processing with seemingly unbounded potential. The speaker notes that the field has only scratched the surface of what could be sequenced in the natural world. The focus is shifting from sequencing a representative individual to collecting data across many individuals, driven by a revolution in sequencing technology. New machines can produce 100-300 times more data, and costs have already decreased tenfold, with further reductions anticipated. The goal is to reach $1,000 per genome with higher quality, a target that, while still a couple of orders of magnitude away, is within sight.
Future of human health research and data interpretation
The future of human health research will be increasingly reliant on this massive influx of genomic data. The process involves taking a reference genome, layering variation data, identifying genes, analyzing variations within those genes, and relating them to medicine. The ability to sequence an individual's genome is becoming achievable, allowing for integration with existing databases to understand an individual's genetic makeup. While complete interpretation is not yet possible, the understanding will grow as collective databases expand. This has practical implications, such as identifying adverse drug reactions, which are a significant cause of mortality. Ensembl is already integrating resequencing data and dealing with large databases, providing data mining interfaces and APIs for access, though download issues and the need for flexibility are acknowledged.
The Distributed Annotation System (DAS) for data integration
A major challenge is how to integrate data from various sources without creating data monopolies or overwhelming scientists. The DAS protocol, proposed by Lincoln Stein, offers a decentralized approach. Instead of users depositing data into a central system, DAS allows external contributors to serve their own data, which can then be integrated on the fly by clever viewers. This system uses common infrastructure to synchronize coordinate systems, ensuring that data from different providers can be combined seamlessly. This is a stark contrast to linking models where users navigate between different websites with varying interfaces. DAS promotes standardized servers and viewers that integrate data, giving users control over what information they display. This allows smaller groups to contribute their data without the overhead of establishing a competing central database, and it enables users to access and combine data from multiple sources transparently, even projecting gene variations onto protein structures. Currently, around 200 DAS servers are in the registry, supporting diverse coordinate systems and even enabling servers to build upon others.
Evolution of genomic interpretation and prediction
The field is moving towards a deeper understanding of genomics, aiming to interpret individual mutations. This will likely involve highly CPU-intensive computations, similar to approaches seen in protein structure prediction. In protein structure prediction, there's a spectrum from comparative modeling (inferring structure from known related sequences) to pure physics-based simulations. Intermediate and hybrid approaches, like fragment-based assembly popularized by the Baker group, are proving more practical. In genome annotation, while ab initio gene prediction has limitations, evidence-based methods can be automated with lower accuracy. Comparative genomics, initially thought to simplify gene identification, has revealed that even non-gene regions can be similar due to regulatory elements. The speaker suggests that understanding gene structures requires predicting motifs and signals, potentially involving significant computational power. Casp competitions highlight different strategies: some groups focus on evolutionary information with less CPU, while others use costly fragment-based methods that yield high accuracy, sometimes approaching crystallographic resolution.
Addressing data sharing and credit allocation
The open nature of the human genome project has been a driver for data sharing. However, the increasing volume and diversity of databases pose a risk of scientists getting lost. There is a need to find ways to split data and presentation to foster competition and identify optimal tools for visualization. Emerging open models, like the hypothetical 'Fister', suggest potential difficulties with large-scale decentralized projects. Science has always been cooperative, but the current data richness in biology demands better handling mechanisms. Practical issues include increasing integration and processing bandwidth. While the DAS protocol provides a framework for interoperability, challenges remain in evolving protocols for broader adoption. The NCBI, for instance, does not currently use DAS, though UC Santa Cruz has some DAS capabilities. The evolution of protocols is ongoing, with efforts to create more flexible specifications for handling metadata and distributed searches. The speaker acknowledges questions about how to ensure servers remain active and how to monitor usage, especially as governments evaluate the value of such resources for continued funding.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Common Questions
Ensembl is one of the leading genome browsers and provides access to genomic data, including gene annotations and comparative genomics information. It plays a crucial role in organizing and making vast amounts of genomic sequence data accessible to researchers worldwide.
Topics
Mentioned in this video
An institute located near the Sanger Institute, equivalent to NCBI, housing around 400-500 bioinformaticians.
A research group successful at the CASP competition, known for popularizing fragment-based assembly in protein structure prediction.
An EU-funded project in Europe that adopted DAS as a mechanism for exchanging data among bioinformaticians working on protein sequences.
An institute that sequenced one-third of the human genome as part of a public international partnership and is funded by the Wellcome Trust.
The project that sequenced the human genome, with the first sequences appearing in 1995 and the human genome around 2000.
A direct competitor to Ensembl, offering a simpler browser engine.
The National Center for Biotechnology Information, mentioned as an equivalent to the European Bioinformatics Institute and a competitor in genome browsers.
Refers to the 22 autosomes and X and Y chromosomes, which vary in length and are used as a coordinate system for addressing genomic information.
Data from different mouse strains has been incorporated, allowing for projection of information between incomplete sequences.
Used as an example of exponential growth in information processing and computers, paralleled by the growth in sequencing data.
Contrasted with higher organisms, bacterial genomes are described as continuous and made of a single unit, making gene identification simpler.
Mentioned as the type of database used for Ensembl's raw unassembled sequence archive, which is currently 35 terabytes.
Used as an analogy for Ensembl's 1D coordinate system, contrasting with Google Maps' 2D system for placing annotations on top of sequences.
The database system on which Ensembl is based, utilizing Perl and Java APIs.
Standard protein sequences used as a reference for annotation, which can be linked to protein structures.
One of the major genome browsers providing access to genomic data, now containing around 30 genomes.
More from GoogleTalksArchive
View all 79 summaries
58 minEverything is Miscellaneous
54 minStatistical Aspects of Data Mining (Stats 202) Day 7
45 minKey Phrase Indexing With Controlled Vocabularies
63 minMysteries of the Human Genome
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free