Adding duplicate sequences to an alignment object

From BioPerl
Jump to: navigation, search

(see bioperl-l thread here)

manni122 asks:

I am trying to read in a file with multiple pairwise alignments. Some IDs appear frequently. So if I am using this code below I get the error message:

--- MSG: Replacing one sequence xxx ---

Is there a way to read the data even with those similar names?


Chris Fields suggests (with comment):

The NSE (Name.version/start-end) is used to distinguish the sequences from one another, so if each sequence has one or more unique accession/version/start/end there should be no replacement (and no warning).

use Bio::LocatableSeq;
use Bio::SimpleAlign;
use Bio::AlignIO;
 
my $aln = Bio::SimpleAlign->new();
my $out = Bio::AlignIO->new(-format => 'clustalw');
 
for my $v (1..10) {
     my $ls = Bio::LocatableSeq->new(-id => 'ABCD1234',
                                     -version => $v,
                                     -alphabet => 'dna',
                                     -seq => '--atg---gta--');
     $aln->add_seq($ls);
}
$out->write_aln($aln);

with output...

CLUSTAL W(1.81) multiple sequence alignment


ABCD1234.1/1-6         --atg---gta--
ABCD1234.2/1-6         --atg---gta--
ABCD1234.3/1-6         --atg---gta--
ABCD1234.4/1-6         --atg---gta--
ABCD1234.5/1-6         --atg---gta--
ABCD1234.6/1-6         --atg---gta--
ABCD1234.7/1-6         --atg---gta--
ABCD1234.8/1-6         --atg---gta--
ABCD1234.9/1-6         --atg---gta--
ABCD1234.10/1-6        --atg---gta--
                         ***   ***

and comments:

If you think about it that's a feature. Any single sequence that appears in an alignment more than once is either (1) matching multiple regions (i.e. repeats, motifs, etc) so the location varies, or (2) the sequence was modified so the version changes (the last one is fairly new). Beyond that one has to question the logic of including multiple copies of exactly the same sequence record in a multiple alignment, so unless additional information distinguishing the potential duplicates is provided we assume unintentional (and erroneous) duplication and punt.

Weighing the options I would rather have the warning indicating a problem than nothing at all.

to the #top

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox