Overview
ABSTRACT
The masses of bioinformatics data available on the Web for molecular biology are constantly growing. Accessing and conjointly making use of such data is imperative for new discoveries in biology. The purpose of this paper is to give the reader all the necessary pointers to identify bioinformatics reference databases for molecular biology, familiarize the reader with the problems raised by the joint use of these distributed and highly heterogeneous data, sketch a panorama of systems offering unified data access and guide users in choosing a system that will meet their needs.
Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.
Read the articleAUTHORS
-
Sarah COHEN-BOULAKIA: Senior Lecturer HDR - Doctorate from Université Paris Sud - Inria, Institute of Computational Biology, Montpellier, France - Laboratoire de recherche en informatique, CNRS UMR 8623 Université Paris Sud, Orsay, France
-
Patrick VALDURIEZ: Research Director - Doctorate from Paris 6 University - Inria, LIRMM, Institute of Computational Biology, Montpellier, France
INTRODUCTION
Molecular biology is a discipline that studies the mechanisms of living organisms at the molecular level: understanding the mechanisms governing cell activity, determining the functional role of a group of proteins or identifying a set of genes involved in a disease. Advances in knowledge of molecular biology are closely linked to progress in multiple fields: biology, chemistry, physics, electronics, mathematics and computer science.
Since the early 1990s, new technologies have emerged, such as high-throughput analysis techniques. These technologies generate an extremely large amount of data. In this context, the size of a genome corresponds to the quantity of DNA contained in one copy of the genome, measured in number of nucleotides (with the unit megabase, one million nucleotides). In 2015, sequencing techniques enabled a single machine to sequence 200 human genomes in a week, at a cost of $0.03 per megabase, whereas the Human Genome Project took 12 years to sequence the first human genome, involving hundreds of laboratories and costing an estimated $10,000 per megabase.
Since the early 2010s, many laboratories have been equipped with this type of machine. As a result, between 2010 and 2015, the volume of sequencing data generated doubled every five months.
What's more, the data generated in this way do not, on their own, enable us to understand the various mechanisms of living organisms. They are referred to as "raw data". Other analyses must then be carried out to complete them, not just by conventional biological experimental analyses, but by computer analyses, once again generating very large volumes of bioinformatics data.
All the raw data and the results of their analysis are stored in biological databases, available (more often than not) on the web. The number and content of these databases are growing considerably. These rapidly evolving databases are both distributed across the web and highly heterogeneous: each database has its own data format and structure, the data they contain reflect different areas of expertise, and the scientific terms used to describe the data often differ from one database to another. Nevertheless, they contain a wealth of information and are therefore highly complementary.
The ability to interrogate, compare and reconcile bioinformatics data is essential to the advancement of knowledge in molecular biology. Exploiting this volume and diversity of distributed, highly heterogeneous and constantly evolving information is a real challenge.
In this article, our aim is to provide an overview of the current state of the art in bioinformatics databases for molecular biology, and above all to offer guidance on how to choose the right solution...
Exclusive to subscribers. 97% yet to be discovered!
You do not have access to this resource.
Click here to request your free trial access!
Already subscribed? Log in!
The Ultimate Scientific and Technical Reference
KEYWORDS
Information retrieval | Public bioinformatic Databases | Standard and systems for managing and querying bioinformatic data
This article is included in
Bioprocesses and bioproductions
This offer includes:
Knowledge Base
Updated and enriched with articles validated by our scientific committees
Services
A set of exclusive tools to complement the resources
Practical Path
Operational and didactic, to guarantee the acquisition of transversal skills
Doc & Quiz
Interactive articles with quizzes, for constructive reading
Querying and managing bioinformatics data for molecular biology
Bibliography
Databases
Sites of the main databases cited in this document
DDBJ http://www.ddbj.nig.ac.jp/ (page consulted on January 20, 2015)
Ensembl http://www.ensembl.org/index.html (page consulted on January 20,...
Events
DILS: Data Integration in the Life Sciences
Annual international conference dedicated to the field of biological data integration
SSDBM: Statistical and Scientific Data Base Management
Annual international conference dedicated to data management in scientific and statistical databases
Standards and norms
Standards for representing provenance
PROV-Overview http://www.w3.org/TR/prov-overview/ (page consulted on January 20, 2015)
The XML (Extensible Markup Language) standard http://www.w3.org/XML/...
Exclusive to subscribers. 97% yet to be discovered!
You do not have access to this resource.
Click here to request your free trial access!
Already subscribed? Log in!
The Ultimate Scientific and Technical Reference