Suffix trees from thin air

From BioPerl
Jump to: navigation, search
  • Russell sent me this very cool hack. He found the original version by Ian Korf here; I couldn't resist hacking it further. --Ed.

Here's a beautiful use of the eval string flavor of eval: creating a suffix tree from a dictionary of words. The tree lookup seems like falling off a log as well; but there's a hidden danger, read on.

I took the original routine and BioPerled it up a little:

  • My little "improvement" getting rid of the original END value was too clever by half, as pointed out by Aaron. That leads to hammering of complete words by later suffixes, or of suffixes by later short words. So, I revert to Ian's original... --Ed.
package Suftree;
use strict;
 
sub new {
    my $class = shift;
    my @words = @_;
    my $self = {};
    bless($self, $class);
    $self->readDictionary(@words);
    return $self;
}
 
sub readDictionary {
    my $self = shift;
    my @words = @_;
    my %D;
    my $i = 0;
    for (@words) {
	chomp;
	my $a = $_;
	my $string = join('', map { "\{'${_}'\}" } split('',$a)).'{END}';
	eval "\$D$string = 1";  # <<===  cool way to make a suffix tree!
    }
    $self->dict(\%D);
} 
 
sub dict {
    my $self = shift;
    return $self->{_dict} = shift if (@_);
    return $self->{_dict};
}
 
# more to come...


  • Aaron also notes that there ain't no suffixes in this suffix tree, but that $i=0 looks suspicious. Hmm. So could replace readDictionary above with the following
sub readDictionary {
    my $self = shift;
    my @words = @_;
    my %D;
    for (@words) {
	chomp;
	my @a = '*'.split('');
        while ( shift @a ) {
	    my $string = join('', map { "\{'${_}'\}" } @a).'{END}';
	    eval "\$D$string = 1";  # <<===  cool way to make a suffix tree!
        }
    }
    $self->dict(\%D);
}
  • Now all the recursive search business below is moot, which I suppose is the point of a suffix tree anyway. --Ed.

Left-end search

Observe the method readDictionary for the trick. Here, Ian builds a hash of hash of ... assignments as a string**, then evals it. He is really leveraging Perl's automatic reference creation to cause Perl to build the entire data structure internally. Brilliant! But in that very thing lies dormant the seed of cruft...

Here are two search methods off the Suftree object:

package Suftree;
use strict;
 
# return -1 if $str DNE in dict; 0 if substring exists; 
# 1 if whole word is present
 
sub search {
    my $self = shift;
    my $str = shift;
    my $l = join('', map { "\{'${_}'\}" } split('',$str));
    if (eval "exists \$self->dict->$l") {
	return 1 if eval "exists \$self->dict->$l->\{END\}";
	return 0;
    }
    return -1;
}
 
sub search2 {
    my $self = shift;
    my $str = shift;
    my $d = $self->dict;
    for (split('',$str)) {
	last if !defined($d = $$d{$_});
    }
    for ($d) {
	!defined && do {return -1};
        grep(/END/,keys %$d) && do {return 1};
	return 1;
    }
}

Both return the right thing. search() is cool, it uses an eval string trick too, while search2() looks like somebody's C homework. But suppose we do

use Suftree;
use strict;
 
my $D = Suftree->new(<DATA>);
 
print $D->search('AGGA');               #returns 0
print $D->search('AAGGAGGGTAAAAATGA');  #returns 1
print $D->search('CGGA');               #returns -1, but look out!
 
print $D->search2('AGGA');              #returns 0
print $D->search2('AAGGAGGGTAAAAATGA'); #returns 1
print $D->search2('CGGA');              #returns -1
 
__DATA__
AAGGAGGTAAAAATGA
AAGGAGGTAAAATGA
AGGAGGTAAAAATGA
ATGGAGGTAAAAATGA
AAGGAGGGTAAAAATGA

In the failed search, $D->search('CGGA') , the evaled string in the lookup is

$D->dict->{C}{G}{G}{A}

What does Perl do? It creates

$D->dict->{C}
$D->dict->{C}->{G}

and

$D->dict->{C}->{G}->{G}

before it finds out that

$D->dict->{C}->{G}->{G}->{A}

does not exist. These extra refs are unfortunately now stored in the dictionary, and, e.g., $D->search('CG') will be successful, but not reflect the data. search2() avoids this by explicitly checking the existence of each branch, and jumping from the loop when the branch DNE.

Unconstrained search

  • Can use the second form of readDictionary above, and use search2(), or...

Using a recursive approach, we can check for the presence of any substring in the dictionary.

$D->recurse_search('C');     # returns -1
$D->recurse_search('GGTA');  # returns 0
$D->recurse_search('ATGA');  # returns 1
package Suftree;
use strict
 
# returns -1: not found; 0: internal substring; 1: right-end anchored substring
 
sub recurse_search{
my $self  = shift;
my $str = shift;
_recurse_hash($self,$self->dict,0,$str);
}
 
sub _recurse_hash {
    my ($self, $hash, $depth, $str) = @_;
    my $ret;
    $ret = $self->search2_hash($str, $hash);
    return $ret unless $ret < 0;
    for my $k (keys %$hash) {
	$ret = _recurse_hash($self,$$hash{$k},$depth+1,$str);
	return $ret unless $ret < 0;
    }
    return -1;
}
 
sub search2_hash {
    my $self = shift;
    my $str = shift;
    my $d = shift;
    for (split('',$str)) {
	last if !defined($d = $$d{$_});
    }
    for ($d) {
	!defined && do {return -1};
	ref($d) && do { return 0 };
	return 1;
    }
}

Footnote

  • **I compacted the original code using join and map.
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox