Preparing contigs for GenBank submission

From BioPerl
Jump to: navigation, search

(see bioperl-l thread starting here)

  • The key issue in this scrap is how creating a GenBank record for a Bio::DB::GFF persisted object is not quite as simple as reading from the DB and writing to SeqIO. Read on for the particulars (responses are from Jason Stajich and Don Gilbert). --Ed.

Erich Schwarz discusses contig submissions sans Sequin:

I have newly sequenced contigs, with CDS predictions, loaded loaded into a MySQL database via Bio::DB::GFF. I'd like to export each contig, with its annotated CDSes, into a single GenBank-formatted record for each contig.

and notes the following:

I came up with code that would let me export protein translations of the contigs' CDSes in GenBank format...

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::GFF;
 
my $query_database = $ARGV[0];
my $dna = q{};
my $db = Bio::DB::GFF->new( -dsn => $query_database);
 
my $gb_file = 'example.gb';
my $seq_out = Bio::SeqIO->new( -file => ">$gb_file", -format => 'genbank');
 
my @contigs = sort 
              map { $_->display_id }
              $db->features( -types => 'contig:assembly' );
 
foreach my $contig (@contigs) { 
    my $segment1 = $db->segment($contig);
    my @p_txs = $segment1->features('processed_transcript');
    foreach my $p_t (sort @p_txs) {
        $dna = q{};
        my @CDSes = $p_t->CDS;
        my $cds_name = $CDSes[0]->display_id();
        foreach my $cds (@CDSes) { 
            # $cds->seq == Bio::PrimarySeq, *not* plain nt seq.!
            $dna = $dna . $cds->seq->seq;
        }
        my $full_cds = Bio::Seq->new( -display_id => $cds_name, 
                                      -seq => $dna, );
        my $prot = $full_cds->translate;
        $seq_out->write_seq($prot);
    }
}

Returning to this, I tried using $db->get_Seq_by_id($contig) to give me a Bio::Seq object for each contig (which I could then output into GenBank form), but that proved futile.


Jason Stajich provides some details:

If you want to get a specific segment you just do what you already have in your code:

my $segment = $db->segment($contig_name);

Or you can iterate through all the features - depends on how you named your segments/contigs/chromsomes, I named mine "contig:scaffold" for type:source:

my $iterator = $dbh->get_seq_stream(-type=>'scaffold');
while (my $s = $iterator->next_seq) {
 ...
}

Now you should be able to pass this segment object to

 $seqio->write_seq($segment);

However, Bio::DB::GFF::Feature doesn't implement the whole Bio::SeqI API so you probably have to create your own sequence and move the features over:

my $iterator = $dbh->get_seq_stream(-type=>'scaffold');
while (my $s = $iterator->next_seq) {
  my $seq = Bio::Seq->new();
  $seq->primary_seq($s->seq);
  for my $feature ( $s->features('processed_transcript') ) {
    my $f = Bio::SeqFeature::Generic->new(-location => $feature->location,
                                          -primary_tag => $feature->primary_tag,
			                  -source_tag  => $feature->primary_tag,
			                  -score => $feature->score,
			                  -seq_id => $feature->seq_id);
    $f->add_tag_value('locus_tag',$feature->name);
    # might also add all other tag/value pairs from this feature like  
    # DBXREF, etc.
    # can derive a CDS feature from this feature as well.
    # or perhaps derive a gene feature that only has start/end for the feature
    # or might add a translation tag/value pair for CDS features
    $seq->add_SeqFeature($f);
  }
  $out->write_seq($seq);
}

I suspect you'll have to edit the feature objects some to a) remove the ones you don't want to output, b) add additional info like translation frame..., c) add in other annotation information that may or may not be encoded as tag/values that [NCBI requires].


Don Gilbert suggests persisting in Chado rather than Bio::DB::GFF:

If the Bio::DB::GFF database to Genbank submission route doesn't get you where you want, you can also look at storing your data in a GMOD Chado database, then using Bulkfiles to produce the Genbank Submission file set.

Find a GenBank Submit output from Chado dbs in this tool release

  GMODTools-1.2b.zip      20-Jun-2008 

adding (in progress) [the] Genbank Submission table writer, bulkfiles -format=genbanktbl, with output suited to submit to NCBI per these specifications

See also the GMODTools wiki and this test case with genbank-submit output.


[back to top]


Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox