removeEmptySeqs.pl
Ce script perl élimine toutes les séquences "vides". (Cela est utilie si vous télécharger de grandes bases de données comme GenBank).
#!/usr/bin/perl
#script to pull out any empty sequence entries in a multifasta file
#When using entrez at the NCBI to retrieve fasta formatted sequences, I find that there are entries
#returned where there is no sequence data. You cannot format fasta files with entries with no sequence
#data in this. This script runs through a fasta file and sends all sequence entries
#to a file in the same directory the script is running in. The output filename has _noEmpties
#appended to it.
#warning: there is no error checking in this script.
use strict;
use warnings;
use Bio::SeqIO;
=head1 Name
removeEmptySeqs.pl
=head1 Synopsis
This script runs through a fasta file and sends all sequence entries with sequences with length >= 1
to a file in the same directory the script is running in. It can take a compressed or uncompressed file as input.
=head1 Usage
removeEmptySeqs.pl fastaFileName
The output file will have the same name as the input file, with _noEmpties added to then end of the filename.
The output file will be created in the same directory as the input file.
=cut
my $infile = shift or die "Usage: $0 <fastaFileName>\n";
my $compressed = 0;
if ($infile =~ m/\.gz$/)
{
system("gunzip $infile");
$compressed = 1;
$infile =~ s/\.gz//;
}
my $errorfile = "emptySeqList.txt";
my $outfile = $infile . "_noEmpties";
my $in = Bio::SeqIO->new(-file => $infile,
-format => 'Fasta');
my $out = Bio::SeqIO->new(-file => ">$outfile",
-format => 'Fasta');
my $err = Bio::SeqIO->new(-file => ">$errorfile",
-format => 'Fasta');
my $count = 0;
while (my $seq = $in->next_seq())
{
my $length = $seq->length();
if ($length > 0)
{
$out->write_seq($seq);
}
else
{
$err->write_seq($seq);
$count ++;
}
}
print "There were $count empty sequences in $infile. The header lines have been saved to $errorfile\n\n";
if ($compressed) { system("gzip $infile") }