Chromosomes from an .agp file plus contigs

From BioPerl
Jump to: navigation, search

(see bioperl-l here and here)

The sound of one man coding...

Russell Smithies asks the rhetorical question:

Does anyone have a script for building chromosomes from an .agp file and a directory full of contigs?


and his answer:

 use Bio::DB::Fasta;
 use Bio::Seq;
 use Bio::SeqIO; 
 
 open(AGP,"Mt2.0_pgp.agp") or die $!;
 my @chr = ();
 
 my $db = Bio::DB::Fasta->new("contigs.fa");
 
 while(<AGP>){
 	chomp;
 	split /\s/;
 	# extend temp string if it's too short
 	do{$chr[$_[0]] .= ' ' x 1_000_000;}while length $chr[$_[0]] < $_[2] ;
 	if($_[4] !~ m/N/){
 		($start,$stop) = $_[8] eq '+'?($_[6], $_[7]):($_[7], $_[6]);
		$s = substr $chr[$_[0]], $_[1], $_[9], $db->seq($_[5],$start,$stop);
	}else{
		$s = substr $chr[$_[0]], $_[1], $_[5], "N" x $_[5] ;
	}
 } 
 
 #remove any trailing whitespace
 @chr = map{s/\s+//g;$_}@chr;
 
 #print the sequence. chromosomes are chr0 -> chr8
 foreach(0..$#chr){
     my $seqobj = Bio::Seq->new( -display_id => "chr$_", -seq => $chr[$_]);
     my $seq_out = Bio::SeqIO->new('-file' => ">chr$_.fa",'-format' => 'fasta');
     $seq_out->write_seq($seqobj);
 }


Russell adds: Please excuse my hacky use of substrings but this .agp file had overlapping runs of 'N' and this was the easiest way to deal with it; e.g.

0	1	50000	1	N	50000	clone	yes
0	50001	167645	2	F	AC144644.3	1	117645	+	117645
0	167646	217645	3	N	50000	clone	yes		
0	217646	317645	4	N	100000	contig	no		
0	317646	367645	5	N	50000	clone	yes		
0	367646	411754	6	F	AC146805.17	1	44109	+	44109

to the #top

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox