fastasplit.pl

10/12/2014 14:00

Ce script perl permet de diviser un très gros fichier de séquence fasta en plusieurs petits fichiers fasta contenant un nombre de séquences que vous choississez. Cela est intéressant en particulier pour paralléliser vos futures analyses.

 

#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;
use Bio::SeqIO;

=head1 Name

fastasplit.pl

=head1 Synopsis

This script will split up a multi-fasta file into sets of multi-fasta files of a size you wish. For example, if you have a mult-fasta file with 108,243 sequences, you could split it into 11 files: 10 files of 10,000 sequences and a file of 8234. The output files have the same name as the original input file, followed by trunc_startNumber-endNumber. For example, the final file in this example would be infileName_trunc_100001-108243.

=head1 Usage

fastasplit.pl fastaFileName Number

where number is the number of sequences in each file (except the last file, which will contain the remaining sequences).

The output files will be created in the same directory as the input file.

=cut



my $infile =  shift or die "Usage: $0 <fastaFileName> <numberOfSeqsInEachFile>\n";
my $numSeqs = shift or die "Usage: $0 <fastaFileName> <numberOfSeqsInEachFile>\n";

my $compressed = 0;
if ($infile =~ m/\.gz$/)
{
  system("gunzip $infile");
  $compressed = 1;
  $infile =~ s/\.gz//;
}


my $in  = Bio::SeqIO->new(-file => $infile,
                       -format => 'Fasta');

my $count = 1;
my $rep = 1;

my $outfile = $infile . "_trunc_1-" . ($rep*$numSeqs);
my $out = Bio::SeqIO->new(-file => ">$outfile",
                               -format => 'Fasta');
while (my $seq = $in->next_seq())
{
    $out->write_seq($seq);
    $count++;
    if ($count > $numSeqs)
    {    
            my $startNum = ($rep * $numSeqs) + 1;
            $rep++;  
            my $endNum = $rep * $numSeqs;
            $outfile = $infile . "_trunc_" . $startNum . "-" . $endNum;    
            $out = Bio::SeqIO->new(-file => ">$outfile",
                                -format => 'Fasta');
            $count = 1;
    }
    

}
my $endNum = (($rep-1)*$numSeqs)+($count -1);
my $newOutfileName = $outfile;
$newOutfileName =~ s/_trunc_(\d.*)-(\d.*).*/_trunc_$1-$endNum/;
move($outfile, $newOutfileName);
if ($compressed) { system("gzip $infile") }