removeEmptySeqs.pl

10/12/2014 14:08

Ce script perl élimine toutes les séquences "vides". (Cela est utilie si vous télécharger de grandes bases de données comme GenBank).

 

#!/usr/bin/perl
#script to pull out any empty sequence entries in a multifasta file
#When using entrez at the NCBI to retrieve fasta formatted sequences, I find that there are entries
#returned where there is no sequence data. You cannot format fasta files with entries with no sequence
#data in this. This script runs through a fasta file and sends all sequence entries
#to a file in the same directory the script is running in. The output filename has _noEmpties
#appended to it.
#warning: there is no error checking in this script.
use strict;
use warnings;
use Bio::SeqIO;

=head1 Name

removeEmptySeqs.pl

=head1 Synopsis

This script runs through a fasta file and sends all sequence entries with sequences with length >= 1
to a file in the same directory the script is running in. It can take a compressed or uncompressed file as input.

=head1 Usage

removeEmptySeqs.pl fastaFileName

The output file will have the same name as the input file, with _noEmpties added to then end of the filename.

The output file will be created in the same directory as the input file.

=cut

my $infile =  shift or die "Usage: $0 <fastaFileName>\n";

my $compressed = 0;
if ($infile =~ m/\.gz$/)
{
  system("gunzip $infile");
  $compressed = 1;
  $infile =~ s/\.gz//;
}

my $errorfile = "emptySeqList.txt";

my $outfile = $infile . "_noEmpties";

my $in  = Bio::SeqIO->new(-file => $infile,
                       -format => 'Fasta');
                       
my $out = Bio::SeqIO->new(-file => ">$outfile",
                       -format => 'Fasta');

my $err =   Bio::SeqIO->new(-file => ">$errorfile",
                        -format => 'Fasta');
                       
my $count = 0;
while  (my $seq = $in->next_seq())
{

  my $length = $seq->length();

  if ($length > 0)
  {
       $out->write_seq($seq);
   }
  else
  {
      $err->write_seq($seq);
      $count ++;
  }
}


print "There were $count empty sequences in $infile. The header lines have been saved to $errorfile\n\n";

if ($compressed) { system("gzip $infile") }