Getting all k-mer combinations of residues

From BioPerl
Jump to: navigation, search

(See the bioperl-l thread beginning here. See also Counting k-mers in large sets of large sequences.)

Contents

Question

Marco Blanchette asks:

Does anyone have a little function that I could use to generate all possible k-mer DNA sequences? For instance all possible 3-mer (AAA, AAT, AAC, AAG, etc...). I need something that I could input the value of k and get all possible sequences...

...and receives replies from Michael Eisen, Jamie Estill, Dave Messina, Mark Jensen, Chris Fields (#CF1, #CF2), Diego Riaño-Pachón, Malcolm Cook, Russell Smithies, Heikki, and Jay Hannah:

version 1 (Michael Eisen)

 #!/usr/bin/perl
 #
 $k = shift;
 @bases = ('A','C','G','T');
 @words = @bases;
 for $i (1..$k-1)
 {
    undef @newwords;
    foreach $w (@words)
    {
         foreach $b (@bases)
         {
             push (@newwords,$w.$b);
         }
    }
    undef @words;
    @words = @newwords;
 }
 foreach $w (@words)
 {
     print "$w\n";
 }

version 2 (Jamie Estill)

SeqIO works great for this. I've used something like the following. This is part of a larger program, so some of this not relevant to what you need ...

 $fasta_in = "your_file.fasta";
 $k = 3; 
 my $in_seq_num = 0;
 my $inseq = Bio::SeqIO->new( -file => "<$fasta_in",
                 -format => 'fasta');
   while (my $seq = $inseq->next_seq) {
    $in_seq_num++;
    if ($in_seq_num == 2) {
        print "\a";
        die "Input file should be a single sequence record\n";
    }
    # Calculate base cooridate data
    my $seq_len = $seq->length();
    my $max_start = $seq->length() - $k;
    # Print some summary data
    print STDERR "\n==============================\n" if $verbose;
    print STDERR "SEQ LEN: $seq_len\n" if $verbose;
    print STDERR "MAX START: $max_start\n" if $verbose;
    print STDERR "==============================\n" if $verbose;
    # CREATE FASTA FILE OF ALL K LENGTH OLIGOS
    # IN THE INPUT SEQUENCE
    print STDERR "Creating oligo fasta file\n" if $verbose;
    open (FASTAOUT, ">$temp_fasta") ||
        die "Can not open temp fasta file:\n $temp_fasta\n";
    for $i (0..$max_start) {
        $start_pos = $i + 1;
        $end_pos = $start_pos + $k - 1;
        my $oligo = $seq->subseq($start_pos, $end_pos);
        # Set counts array to zero
        $counts[$i] = 0;       
        print FASTAOUT ">$start_pos\n";
        print FASTAOUT "$oligo\n";
 
    }
    close (FASTAOUT);
 }

version 3 (Dave Messina)

Dave Messina writes:

Here's some code to generate and print all possible nmers. I'm really just using the module Math::Combinatorics to do all the dirty work here, so probably won't be as fast as if you wrote a custom recursive function as you suggest.

 #!/usr/local/bin/perl
 use strict;
 use warnings;
 use Math::Combinatorics;
 # do all codons (3-mers) as an example
 generate_possible_kmers(3);
 =head2 generate_possible_kmers
   Title   : generate_possible_kmers
   Usage   : generate_possible_kmers(n)
   Function: create and print the list of possible DNA kmers
   Returns : none
   Args    : n - the length of the desired 'mer'
 =cut
 sub generate_possible_kmers {
   my ($n) = @_;
   my $alphabet = [ qw( A C G T ) ];
   my $words_per_row = 10;
   my $i=0;
   my $o = Math::Combinatorics->new( count=>$n, data=>$alphabet,
   frequency=>[$n,$n,$n,$n] );
   while ( my @x = $o->next_multiset ) {
     my $p = Math::Combinatorics->new( data=>\@x , frequency=>[map{1} @x] );
        while ( my @y = $p->next_string ) {
          print join('', @y), ' ';
          $i++;
          if (($i % $words_per_row) == 0) { print "\n"; }
        }
   }
 }

version 4 (Mark Jensen)

Mark Jensen fires off: A little sloppy, but it recurses and is general---

 # ex...
 @combs = doit(3, [ qw( A T G C ) ]);
 1;
 # code
 sub doit {
     my ($n, $sym) = @_;
     my $a = [];
     doit_guts($n, $sym, $a, '');
     return map {$_ || ()} @$a;
 }
 sub doit_guts {
     my ($n, $sym, $store, $str)  = @_;
     if (!$n) {
         return $str;
     }
     else {
         foreach my $s (@$sym) {
             push @$store, doit_guts($n-1, $sym, $store, $str.$s);
         }
     }
 }

version 5 (Chris Fields)

Chris Fields comments:

To add to the pile: 'Mark Jason Dominus (mjd) tackles this problem in Higher-Order Perl using iterators, which also allows other nifty bits like 'give variants of A(CTG)T(TGA)', where anything in parentheses are wild-cards. The nice advantage of the iterator approach is you don't tank memory for long strings. Furthermore, as a bonus, you can now download the book for free: [1] The relevant chapter is here (p. 135): [2]

Here is the example code:

#!/usr/bin/perl -w
 
use strict;
use warnings;
use 5.010;
 
my $string = 'A(CT)G(AC)';
 
my $it = make_genes($string);
 
while (my $new = NEXTVAL($it) ) {
    say $new;
}
 
sub make_genes {
    my $pat = shift;
    my @tokens = split /[()]/, $pat;
    for (my $i = 1; $i < @tokens; $i += 2) {
        $tokens[$i] = [0, split(//, $tokens[$i])];
    }
    my $FINISHED = 0;
    return sub {
        return if $FINISHED;
        my $finished_incrementing = 0;
        my $result = "";
        for my $token (@tokens) {
            if (ref $token eq "") { # plain string
                $result .= $token;
            } else { # wildcard
                my ($n, @c) = @$token;
                $result .= $c[$n];
                unless ($finished_incrementing) {
                    if ($n == $#c) { $token->[0] = 0 }
                    else { $token->[0]++; $finished_incrementing = 1 }
                }
            }
        }
        $FINISHED = 1 unless $finished_incrementing;
        return $result;
    }
}
 
sub NEXTVAL { $_[0]->() }

version 6 (Diego Riaño-Pachón)

Diego Riaño-Pachón offers:

Just another option: why not just counting? Any DNA sequence is a number in base four (4),isn't it? So you have to count from zero to the number (in base 4) that you want.

 ############################################################################################
 ##DNA word can be represented as base 4 numbers, 
 ##this is an idea from Ingo Dreyer, I just implemented it in Perl
 ##The following hashes help in doing the transcoding from and to base 4
 ############################################################################################
 
 use Math::BigInt;
 use Math::BaseArith;
 
 my %transcoded_nt;
 $transcoded_nt{'A'}=0;
 $transcoded_nt{'C'}=1;
 $transcoded_nt{'G'}=2;
 $transcoded_nt{'T'}=3;
 
 my %transcode_base4;
 $transcode_base4{0}='A';
 $transcode_base4{1}='C';
 $transcode_base4{2}='G';
 $transcode_base4{3}='T';
 
 my @alphabet=sort keys %transcoded_nt;
 my $lengthalph=scalar(@alphabet);
 my $x= Math::BigInt->new($lengthalph);
 
 my $number_of_words=$x->bpow(scalar(keys %matrix));
 
 foreach my $i (0..$number_of_words-1){
  my @base_four_word= encode($i,[4,4,4,4,4,4,4,4,4,4,4]);
  my $transcoded_word=join('',@base_four_word);
  while (my ($position,$val) = each %transcode_base4){
   $transcoded_word=~s/$position/$val/g;
  }
 }


version 7 (Malcolm Cook)

Malcolm Cook gives it a bash:

just to round things out.... here's a quick bash script that does the same...

#!/bin/bash
k=$1				
s=$( printf "%${k}s" ); # a string with $k blanks
s=${s// /{A,T,G,C\}};   # substitute '{A,T,G,C}' for each of the k blanks
echo 'kmers using bash to expand:' $s > /dev/stderr
bash -c "echo  $s";     # let brace expanion of inferior bash compute the cross product

oh... and... if you need the results in a perl array, and you're running under some unix, try the even terser:

#!/usr/bin/env perl
my $k = shift;
my @kmer = split / /, `echo @{['{A,T,G,C}' x $k]}`;

finally (?|!), here's a perl one-liner that outputs one kmer per line, with switches for the length, $k, and the alphabet, $a:

perl -se '$,="\n"; print split / /, qx/echo @{["{$a}" x $k]}/;' -- -k=4 -a='A,T,G,C'

version 8 (Russell Smithies)

Russell Smithies parries with an elegant:

recursive use of map:

print "[", join(", ", @$_), "]\n" for
permute([ qw( A T G C ) ],[ qw( A T G C ) ],[ qw( A T G C ) ],[ qw( A T G C ) ]); 

sub permute {
  my $last = pop @_;
  unless (@_) {
    return map [$_], @$last;
  }
  return map { my $left = $_; map [@$left, $_], @$last } permute(@_);
} 

version 9 (Chris Fields)

Chris Fields delivers this coup de grâce :

Perl 6 (20 random 20-mers):

use v6;

say [~] <A T G C>.pick(20, :repl) for 1..20;

By the way, this works with the latest Rakudo. --Chris Fields 01:29, 6 January 2009 (UTC)

Heikki's favorites

from Heikki Lehvaslaiho: Thank you for everyone for these entertaining entries. In my books, Michel Eisen's version wins with sheer clarity. Recursion is always recommendable, too. Below are cleaned versions of these two. Feel free to improve them further.

 #!/usr/bin/env perl
 use warnings;
 use strict;
 sub kmers ($;$) {
     my $k = shift;
     my $alphabet = shift || [ qw( A T G C ) ];
     my @bases = @$alphabet;
     my @words = @bases;
     for ( 1 .. --$k ) {
         my @newwords;
         foreach my $w (@words) {
             foreach my $b (@bases) {
                 push (@newwords, $w.$b);
             }
         }
         @words = @newwords;
     }
     return @words;
 }
 my $k = shift;
 die "positive integer needed as the argument!"
     unless $k > 0 and $k =~ /^\d$/;
 map {print "$_\n"} kmers($k);
 #!/usr/bin/env perl
 use warnings;
 use strict;
 sub kmers ($;$) {
     my $n = shift;
     my $sym = shift || [ qw( A T G C ) ];
     sub kmers_guts {
         my ($n, $sym, $store, $str)  = @_;
         if ($n) {
             foreach my $s (@$sym) {
                 push @$store, kmers_guts($n-1, $sym, $store, $str.$s);
             }
         } else {
             return $str;
         }
     }
     my $a = [];
     kmers_guts($n, $sym, $a, '');
     return map {$_ || ()} @$a;
 }
 my $k = shift;
 die "positive integer needed as the argument!"
      unless $k =~ /^\d$/;
 map {print "$_\n"} kmers($k);

version 10 (Jay Hannah)

Here's an alternate to version 5. I think versions 5 and 10 are the only ones that use iterators, which is mandatory for extremely large sets so you don't waste / run out of memory. Using arrays "every possible DNA strand of lengths 1 to 14 would require 5GB of memory"! In contrast, iterators consume only a trivial amount of memory. --Jhannah 13:56, 20 February 2009 (UTC)

=head2 gen_permutate
 
An iterator for all possible DNA sequences between lengths $min and $max.
 
  my @DNA = qw/A C T G/;
  my $seq = gen_permutate(3, 4, @DNA);
  while ( my $strand = $seq->() ) {
     print "$strand\n";
  }
 
  # http://www.perl.com/pub/a/2005/06/16/iterators.html
  # Modified by jhannah
 
=cut
 
sub gen_permutate {
   my ($min, $max, @list) = @_;
   my @curr = ($#list) x ($min - 1);
 
   return sub {
       if ( (join '', map { $list[ $_ ] } @curr) eq $list[ -1 ] x @curr ) {
           @curr = (0) x (@curr + 1);
       }
       else {
           for ( my $pos = $#curr; $pos >= 0; $pos-- ) {
               ++$curr[ $pos ], last if $curr[ $pos ] < $#list;
               $curr[ $pos ] = 0;
           }
       }
       return undef if @curr > $max;
       return join '', map { $list[ $_ ] } @curr;
   };
}

version 11 ( Ian Korf)

(from an email to Russell Smithies)
This isn't the prettiest way, but it might be the most efficient. The strategy is based on the idea that you can't get much faster than a hard-coded loop.

foreach $c1 (@alphabet) {
     foreach $c2 (@alphabet) {
          foreach $c3 (@alphabet) {
              $h{"$c1$c2$c3"} = 1;
          }
     }
}

The kmer() function below simply builds up the same code for any k and any alphabet, executes it, and returns a reference to a hash. I don't use this method myself because when I work with k-mers, I tend to work in C, and then I use an iterative numeric method.

@alphabet = qw(A C G T);
$kmers = kmer(3, \@alphabet);
 
sub kmer {
	($k, $a) = @_;
	for (1..$k) {$w .= "\$c$_"}
	for (1..$k) {$code .= "for \$c$_ (\@\$a) {"}
	eval $code . "\$h{\"$w\"} = 1" . "}" x $k;
	\%h;
}


[back to top]


Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox