Sharing large arrays among multiple threads

From BioPerl
Jump to: navigation, search

(See the bioperl-l discussion [[1]]. See also Counting k-mers in large sets of large sequences.)

Marco Blanchette poses:

I am using the Perl threads utility to successfully multi threads several of my computing jobs on my workstation. My current problem is that I need to perform multiple processes using the same humongous array (more than 2 × 106 items). My problem is that the computing time for each iteration is not very long but I have a lot of iterations to do and every time a thread is created I am passing the huge array to the function and a fresh copy of the array is created. Thus, there is a huge amount of wasted resources (time and memory) use to create these data structures that are used by each threads but not modified.

The logical alternative is to use shared memory where all thread would have access to the same copy of the huge array. In principle Perl provide such a mechanism through the module threads::shared but I am unable to understand how to use the shared variables. Anyone has experience to share on threads::shared?



Jonathan Crabtree responds:

Here is a short test program, which runs correctly on perl 5.8.8 and may help to illustrate how the Perl threads::shared module expects you to create and share nested data structures. You have to manually share any nested references and I think that the order in which the sharing calls are made may also be significant:

 #!/usr/bin/perl
 
 use strict;
 use warnings;
 use threads;
 use threads::shared;
 
 # threads::shared test/demo program
 # creates a shared 2-dimensional array and checks that it can be seen
 in a thread
 # tested in perl v5.8.8 built for i486-linux-gnu-thread-multi
 
 ## ----------------------------------------
 ## globals
 ## ----------------------------------------
 
 # set the width and height of the 2d array to this value:
 my $ARRAY_SIZE = 10;
 
 ## ----------------------------------------
 ## main program
 ## ----------------------------------------
 
 # calls to &share take place in here, so a shared value is returned
 my $array = &make_shared_array();
 
 # print array contents before running thread
 print "shared array before running thread:\n";
 &check_and_print_array($array);
 
 # run thread
 my $thr = threads->create(\&do_the_job, $array);
 
 my $retval = $thr->join();
 print "join() returned: $retval\n";
 
 # print array contents after running thread
 print "shared array after running thread:\n";
 &check_and_print_array($array);
 
 exit(0);
 
 ## ----------------------------------------
 ## subroutines
 ## ----------------------------------------
 
 sub make_shared_array {
     # outermost array object must be made shared first
     my $a = &share([]);
 
     for (my $i = 0;$i < $ARRAY_SIZE;++$i) {
 # each of the rows must be explicitly shared
 my $row = &share([]);
 # and then added to the containing array
 $a->[$i] = $row;
 # assign each cell a unique integer for verification purposes
 my $base = $i * $ARRAY_SIZE;
 for (my $j = 0;$j < $ARRAY_SIZE;++$j) {
     $row->[$j] = $base + $j;
 }
     }
     return $a;
 }
 
 # print out the array, checking that its dimensions match what we expect
 sub check_and_print_array {
     my $arr = shift;
     die "not an array" if ((ref $arr) ne 'ARRAY');
     my $nr = scalar(@$arr);
     die "wrong number of rows in array" if ($nr != $ARRAY_SIZE);
 
     for (my $i = 0;$i < $nr;++$i) {
 my $row = $arr->[$i];
 die "row $i not an array" if ((ref $row) ne 'ARRAY');
 my $nc = scalar(@$row);
 die "wrong number of columns in row $i" if ($nc != $ARRAY_SIZE);
 
 for (my $j = 0;$j < $nc;++$j) {
     my $val = $row->[$j];
     printf("%10s", $val);
 }
 
 print "\n";
     }
 }
 
 # work to execute in the thread
 sub do_the_job {
     my $var = shift;
 
     # print the array once more in the thread
     print "shared array in thread:\n";
     &check_and_print_array($var);
 
     return "do_the_job returned ok";
 }

When I run it (on Ubuntu) the output looks like this:

shared array before running thread:
        0         1         2         3         4         5         6
       7         8         9
       10        11        12        13        14        15        16
      17        18        19
       20        21        22        23        24        25        26
      27        28        29
       30        31        32        33        34        35        36
      37        38        39
       40        41        42        43        44        45        46
      47        48        49
       50        51        52        53        54        55        56
      57        58        59
       60        61        62        63        64        65        66
      67        68        69
       70        71        72        73        74        75        76
      77        78        79
       80        81        82        83        84        85        86
      87        88        89
       90        91        92        93        94        95        96
      97        98        99
shared array in thread:
        0         1         2         3         4         5         6
       7         8         9
       10        11        12        13        14        15        16
      17        18        19
       20        21        22        23        24        25        26
      27        28        29
       30        31        32        33        34        35        36
      37        38        39
       40        41        42        43        44        45        46
      47        48        49
       50        51        52        53        54        55        56
      57        58        59
       60        61        62        63        64        65        66
      67        68        69
       70        71        72        73        74        75        76
      77        78        79
       80        81        82        83        84        85        86
      87        88        89
       90        91        92        93        94        95        96
      97        98        99
join() returned: do_the_job returned ok
shared array after running thread:
        0         1         2         3         4         5         6
       7         8         9
       10        11        12        13        14        15        16
      17        18        19
       20        21        22        23        24        25        26
      27        28        29
       30        31        32        33        34        35        36
      37        38        39
       40        41        42        43        44        45        46
      47        48        49
       50        51        52        53        54        55        56
      57        58        59
       60        61        62        63        64        65        66
      67        68        69
       70        71        72        73        74        75        76
      77        78        79
       80        81        82        83        84        85        86
      87        88        89
       90        91        92        93        94        95        96
      97        98        99

I haven't verified that doing this actually yields the memory savings you're looking for, but I don't see why it shouldn't.

[back to top]


Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox