Why is gene annotation in higher organisms challenging?

Gene annotation is challenging because in higher organisms, genes are fragmented, unlike the continuous genes found in bacteria. This fragmentation makes it difficult to identify gene boundaries accurately, and even automated algorithms produce significant uncertainty.

How is Ensembl funded and what is its data policy?

Ensembl is funded by the Wellcome Trust and adheres to a policy of complete openness, making all its data and code publicly available in the public domain.

What is the significance of the exponential growth in sequencing data?

The exponential growth in sequencing data, doubling every 11-13 months, means an unprecedented amount of genomic information is becoming available. This growth is driven by new technologies and is expected to revolutionize human health research and development.

What is the DAS (Distributed Annotation System) and how does it differ from other data access models?

DAS is a system that allows different data providers to serve their data via synchronized coordinate systems, enabling viewers to integrate information on the fly. It contrasts with centralized models by being distributed, allowing for greater flexibility and user control over data integration.

What are the main approaches to protein structure prediction?

Approaches range from comparative modeling, which infers structure based on known related sequences, to pure physics simulations. Fragment-based assembly is proving practical, offering a balance between accuracy and computational cost.

How does Ensembl handle continuous updates and schema changes?

Ensembl has a rigorous release cycle, updating its data and schema every two months. They manage changes through patch files and a culture of continuous improvement, allowing external users to update their own systems accordingly.

Can protein structures be used to annotate genomes using DAS?

Yes, if protein structures can be related to standard protein sequences like UniProt, then existing annotations against those sequences can be displayed on models of the protein structures using DAS.

Key Moments

Keeping Up With The Human Genome

Google Talks

Education7 min read38 min video

Aug 22, 2012|206 views|1

googlevideo

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

The human genome is massively growing in size and complexity, necessitating new infrastructure like the DAS protocol to integrate data from various sources, as traditional methods struggle to keep pace.

Key Insights

The Ensembl project started from scratch to handle the human genome data, storing it in an RDBMS and providing an API for web and programmatic access.

The human genome sequence is estimated to be around 3 gigabases, with Ensembl currently annotating about one-third of it.

The Ensembl database contains approximately 30 different genomes, ranging from yeast to large mammalian genomes.

The size of assembled genome sequences is growing with a 13-month doubling time, while the archive for raw, unassembled sequence data doubles every 11 months and is currently 35 terabytes.

New sequencing technologies can produce 100-300x more data per machine, with costs reduced by a factor of 10, aiming for $1,000 per genome with higher quality.

The DAS (Distributed Annotation System) protocol is presented as a solution for data integration, allowing users to integrate data from multiple distributed servers on the fly rather than relying on a central repository.

The challenge of annotating and accessing the human genome

The human genome sequence, discovered around 2000, presented a significant scaling challenge for the young field of bioinformatics. Being 30 times larger than previous genomes, it required new systems to store, analyze, and provide access to the immense amount of data. The Ensembl project was initiated to address this by developing an RDBMS for data storage, a pipeline for pre-computed analyses, and an API for both web-based and programmatic access. The Welcome Trust Sanger Institute, where the speaker is involved, sequenced one-third of the human genome as part of a public international partnership. This institute is a large center with a substantial computer facility and is funded by the Welcome Trust, indicating the scale of 'big science' involved. Ensembl has expanded beyond the human genome to include around 30 different, large genomes, going down to references like yeast. The core of Ensembl's work involves organizing sequence data, which is delivered in millions of pieces, into a usable coordinate system, analogous to Google Maps but in one dimension. Over 80 different types of information, including genes, are layered on top of this sequence data, with gene annotation proving to be a particularly difficult challenge due to fragmented structures in higher organisms.

Ensembl's infrastructure and data handling

Ensembl's database is built on MySQL with Perl and Java APIs, featuring layered objects for elements like genes. The system undergoes continuous improvement, with updates every two months, often involving schema changes to accommodate new data or refactor storage methods. The project is open-source, with code available under a BSD license and data dumps freely accessible. Development occurs in small sub-teams, managing a release cycle that has evolved from one month to a more sustainable two months to balance development and updates. The project maintains a healthy 'paranoia' about its relevance and usage, monitoring API accesses and web page views, a crucial aspect for securing ongoing funding from organizations like the Welcome Trust.

Exponential growth in genomic data

The amount of genomic data is not static; it continues to grow exponentially. The assembled human genome sequence follows a rough 13-month doubling time, a trend that has persisted for a long time. Concurrently, a new archive for raw, unassembled sequence data is doubling every 11 months and has reached 35 terabytes, making it one of the larger Oracle databases. This relentless growth is comparable to Moore's Law in computing, highlighting that sequencing technology is just another form of information processing with seemingly unbounded potential. The speaker notes that the field has only scratched the surface of what could be sequenced in the natural world. The focus is shifting from sequencing a representative individual to collecting data across many individuals, driven by a revolution in sequencing technology. New machines can produce 100-300 times more data, and costs have already decreased tenfold, with further reductions anticipated. The goal is to reach $1,000 per genome with higher quality, a target that, while still a couple of orders of magnitude away, is within sight.

Future of human health research and data interpretation

The future of human health research will be increasingly reliant on this massive influx of genomic data. The process involves taking a reference genome, layering variation data, identifying genes, analyzing variations within those genes, and relating them to medicine. The ability to sequence an individual's genome is becoming achievable, allowing for integration with existing databases to understand an individual's genetic makeup. While complete interpretation is not yet possible, the understanding will grow as collective databases expand. This has practical implications, such as identifying adverse drug reactions, which are a significant cause of mortality. Ensembl is already integrating resequencing data and dealing with large databases, providing data mining interfaces and APIs for access, though download issues and the need for flexibility are acknowledged.

The Distributed Annotation System (DAS) for data integration

A major challenge is how to integrate data from various sources without creating data monopolies or overwhelming scientists. The DAS protocol, proposed by Lincoln Stein, offers a decentralized approach. Instead of users depositing data into a central system, DAS allows external contributors to serve their own data, which can then be integrated on the fly by clever viewers. This system uses common infrastructure to synchronize coordinate systems, ensuring that data from different providers can be combined seamlessly. This is a stark contrast to linking models where users navigate between different websites with varying interfaces. DAS promotes standardized servers and viewers that integrate data, giving users control over what information they display. This allows smaller groups to contribute their data without the overhead of establishing a competing central database, and it enables users to access and combine data from multiple sources transparently, even projecting gene variations onto protein structures. Currently, around 200 DAS servers are in the registry, supporting diverse coordinate systems and even enabling servers to build upon others.

Evolution of genomic interpretation and prediction

The field is moving towards a deeper understanding of genomics, aiming to interpret individual mutations. This will likely involve highly CPU-intensive computations, similar to approaches seen in protein structure prediction. In protein structure prediction, there's a spectrum from comparative modeling (inferring structure from known related sequences) to pure physics-based simulations. Intermediate and hybrid approaches, like fragment-based assembly popularized by the Baker group, are proving more practical. In genome annotation, while ab initio gene prediction has limitations, evidence-based methods can be automated with lower accuracy. Comparative genomics, initially thought to simplify gene identification, has revealed that even non-gene regions can be similar due to regulatory elements. The speaker suggests that understanding gene structures requires predicting motifs and signals, potentially involving significant computational power. Casp competitions highlight different strategies: some groups focus on evolutionary information with less CPU, while others use costly fragment-based methods that yield high accuracy, sometimes approaching crystallographic resolution.

Addressing data sharing and credit allocation

The open nature of the human genome project has been a driver for data sharing. However, the increasing volume and diversity of databases pose a risk of scientists getting lost. There is a need to find ways to split data and presentation to foster competition and identify optimal tools for visualization. Emerging open models, like the hypothetical 'Fister', suggest potential difficulties with large-scale decentralized projects. Science has always been cooperative, but the current data richness in biology demands better handling mechanisms. Practical issues include increasing integration and processing bandwidth. While the DAS protocol provides a framework for interoperability, challenges remain in evolving protocols for broader adoption. The NCBI, for instance, does not currently use DAS, though UC Santa Cruz has some DAS capabilities. The evolution of protocols is ongoing, with efforts to create more flexible specifications for handling metadata and distributed searches. The speaker acknowledges questions about how to ensure servers remain active and how to monitor usage, especially as governments evaluate the value of such resources for continued funding.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

Ensembl is one of the leading genome browsers and provides access to genomic data, including gene annotations and comparative genomics information. It plays a crucial role in organizing and making vast amounts of genomic sequence data accessible to researchers worldwide.

Topics

Technology & Innovation Science & Mathematics Data Sharing Distributed Systems Comparative Genomics Data Integration Protein Structure Prediction Genome Annotation

Mentioned in this video

Organizations

European Bioinformatics Institute

An institute located near the Sanger Institute, equivalent to NCBI, housing around 400-500 bioinformaticians.

Baker Group

A research group successful at the CASP competition, known for popularizing fragment-based assembly in protein structure prediction.

BioSapiens

An EU-funded project in Europe that adopted DAS as a mechanism for exchanging data among bioinformaticians working on protein sequences.

Welcome Trust Sanger Institute

An institute that sequenced one-third of the human genome as part of a public international partnership and is funded by the Wellcome Trust.

Human Genome Project

The project that sequenced the human genome, with the first sequences appearing in 1995 and the human genome around 2000.

UC Santa Cruz

A direct competitor to Ensembl, offering a simpler browser engine.

NCBI

The National Center for Biotechnology Information, mentioned as an equivalent to the European Bioinformatics Institute and a competitor in genome browsers.

Concepts

Human Chromosomes

Refers to the 22 autosomes and X and Y chromosomes, which vary in length and are used as a coordinate system for addressing genomic information.

Mouse Strains

Data from different mouse strains has been incorporated, allowing for projection of information between incomplete sequences.

Moore's Law

Used as an example of exponential growth in information processing and computers, paralleled by the growth in sequencing data.

Bacterial Genomes

Contrasted with higher organisms, bacterial genomes are described as continuous and made of a single unit, making gene identification simpler.

Legislation & Policy

BSD License

The license under which Ensembl's code is available, indicating its open-source nature.

Software & Apps

Oracle Database

Mentioned as the type of database used for Ensembl's raw unassembled sequence archive, which is currently 35 terabytes.

Google Maps

Used as an analogy for Ensembl's 1D coordinate system, contrasting with Google Maps' 2D system for placing annotations on top of sequences.

MySQL

The database system on which Ensembl is based, utilizing Perl and Java APIs.

UniProt

Standard protein sequences used as a reference for annotation, which can be linked to protein structures.

Ensembl

One of the major genome browsers providing access to genomic data, now containing around 30 genomes.

People

Timothy Hubbard

A lead on the Ensembl project, responsible for annotating one-third of the human genome.

Lincoln Stein

Proposed the DAS protocol and received the first grant to set up client-server libraries for it.

Locations

Chimpanzee

Mentioned as a close relative in the family tree of genomes, illustrating genome relationships.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free