Bioperl databases 1.5
How to use databases with Bioperl
This document is designed to let you use Bioperl with databases. Bioperl can work with a number of sequence formats and allows users to interconvert easily amongst these formats. Bioperl can also index flat files in some of these standard formats, allowing very fast retrieval. You can also use Bioperl to retrieve sequences from remote databases via the Web or to build and query in-house relational databases (RDBs).
Different scripts are provided to get you started with some of these approaches, for example:
The core of the backend system is found in following modules
Bio::DB::BioSeqI is the abstract interface for the databases (hence the I). Bio::DB::GenBank and Bio::DB::GenPept are concrete implementations for network access to the GenBank and GenPept databases at NCBI, and others, using HTTP as a protocol. See the Bio::DB::GenBank manpage, the Bio::DB::GenPept manpage, and REMOTE SEQUENCE RETRIEVAL (Bio::DB::*) for more information.
The Index modules EMBL and Fasta, as they are designed as sequence databases, conform to the Bio::DB::BioSeqI interface, meaning they can be used whereever the Bio::DB::BioSeqI is expected.
Flat file indexing of Fasta files is also provided by Bio::DB::Fasta, please see the Bio::DB::Fasta manpage for more information - this module has some useful features not contained in Bio::Index::Fasta.
Bioperl offers a number of modules for retrieving sequences over the network. The available remote databases include GenBank, GenPept, GDB, EMBL, SwissProt, XEMBL, and remote Ace servers. A typical method is
get_Seq_by_id($id)
which returns a Seq object, or
get_Stream_by_id($ref_to_array_of_ids)
which returns a SeqIO object. See the Bio::DB::GenBank manpage, the Bio::DB::GenPept manpage, the Bio::DB::GDB manpage, the Bio::DB::EMBL manpage, the Bio::DB::SwissProt manpage, the Bio::DB::XEMBL manpage, and the Bio::DB::Ace manpage, and the Bio::SeqIO manpage for more information.
If you want to use Bioperl indicies of Fasta, EMBL/SwissProt .dat files, SwissPfam, GenBank, or Blast files then the bp_fetch.PLS and bp_index.PLS scripts are great ways to start off (and also reading the scripts shows you how to use the Bioperl indexing stuff). bp_fetch.PLS and bp_index.PLS coordinate using two environment variables
BIOPERL_INDEX - directory where the indices are kept
BIOPERL_INDEX_TYPE - type of DBM file to use for the index
The basic way of indexing a database, once BIOPERL_INDEX has been set up, is to go
bp_index.pl <index-name> <filenames as full path>
e.g., for Fasta files
bp_index.pl est /nfs/somewhere/fastafiles/est*.fa
Or, for EMBL/Swissprot files
bp_index.pl -fmt=EMBL swiss /nfs/somewhere/swiss/swissprot.dat
To retrieve sequences from the index go
bp_fetch.pl <index-name>:<id>
eg,
bp_fetch.pl est:AA01234
or
bp_fetch.pl swiss:VAV_HUMAN
bp_fetch.pl also has other options to connect to Genbank across the network.
mkdir /nfs/datadisk/bioperlindex/
or any other directory
setenv BIOPERL_INDEX /nfs/datadisk/bioperlindex/ setenv BIOPERL_INDEX_TYPE DB_File
in .cshrc or .tcshrc (or set and export in bash and its .bashrc). Another BIOPERL_INDEX_TYPE is SDBM_File, this one should come with a standard Perl package.
go
bp_index.pl swissprot /nfs/datadisk/swiss/swissprot.dat
etc. You are now ready to use bp_fetch.pl. See the Bio::Index::Fasta manpage, the Bio::Index::GenBank manpage, the Bio::Index::Blast manpage, the Bio::Index::EMBL manpage, the Bio::Index::SwissPfam manpage, and the Bio::Index::Swissprot manpage for more.
Flat file indexing of Fasta files is also provided by Bio::DB::Fasta, please see the Bio::DB::Fasta manpage for more information - this module provides some functionality absent from Bio::Index::Fasta.
The bioperl-db package works in conjunction with the BioSQL relational schema (http://obda.open-bio.org/). The most recent version of the bioperl-db package can be obtained at http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-db/?cvsroot=bioperl
The bioperl-db and BioSQL packages integrate neatly with Bioperl's objects and contain tables for sequences, entries, sequence features, taxonomic information, references, keywords, and more. Using a database adaptor factory, one can create ``persistent objects'' from bioperl objects. Persistent objects know how to store, update, and remove themselves to/from the database. For example,
use Bio::SeqIO;
use Bio::DB::BioDB;
# create the database-specific adaptor factory
$db = Bio::DB::BioDB->new(-database =>'biosql',
# user, pwd, driver, host ...
-dbcontext => $dbc);
# open stream of objects parsed from flatfile
my $stream = Bio::SeqIO->new(-fh => \*STDIN,
-format => 'genbank');
while(my $seq = $stream->next_seq()) {
# convert to persistent object
$pseq = $db->create_persistent($seq);
# $pseq now implements Bio::DB::PersistentObjectI
# in addition to what $seq implemented before
# insert into datastore
$pseq->create();
}
There was a presentation at BOSC03 in Brisbane on bioperl-db, including how to code several use cases. You may want to consult the slides http://www.open-bio.org/bosc2003/slides/Persistent_Bioperl_BOSC03.pdf.
Similarly, one can query the database with id's or using query objects. See Bio::DB::Query::BioQuery and Bio::DB::Query::QueryConstraint for examples.
With bioperl-db and BioSQL installed you will also be able to load sequence data, GO Ontology data, and NCBI taxonomic data into your own RDB.
BioSQL is joint project of the OpenBio efforts: bioperl, biopython, biojava, and bioruby. BioSQL creates a single relational schema that's accessible using any of these packages. To use BioSQL with Perl you'll need Bioperl, version 1.2 or later, bioperl-db from www.bioperl.org, and BioSQL. The BioSQL package can be obtained at www.open-bio.org. BioSQL currently supports the Mysql, Postgres, and Oracle database servers. Consult the BioSQL package for more details on installing and using BioSQL.
An alternative view of sequence feature data is provided by the Bio::DB::GFF module, part of the core package. This module takes a set of sequence features stored in GFF (gene-finding format) format and loads them into a relational database optimized for positional queries. This allows a variety of data mining operations, such as finding sequence features that are within a certain distance of each other or which overlap. The module can also store large contiguous sequences and extract subsequences rapidly.
See the Bio::DB::GFF manpage, the Bio::DB::GFF::RelSegment manpage, the Bio::DB::GFF::Feature manpage, and the Bio::DB::GFF::Adaptor::dbi manpage.
The Open Biological Database Access (OBDA) system provides a single, configurable retrieval interface to sequence databases regardless of their format: relational (as in BioSQL), local indexed flat file, or remote and Internet-accessible. The user creates a single configuration file, a ``registry'', that describes the data source and OBDA handles the rest. See the OBDA_Access HOWTO for more information (http://bioperl.org/HOWTOs) or the Bio::DB::Registry manpage.