rejectSeqsWithTooManyNs.pl
Ce script écrit en perl permet de trier vos séquences nucléiques en format fasta en fonction de leur proportion en "N".
Attention de noter la proportion "à l'anglaise", c'est à dire 0.01 et non 0,01 si vous souhaitez fixer un seuil de 1% de N.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
use POSIX;
=head1 Name
rejectSeqsWithTooManyNs.pl
=head1 Synopsis
This script will look through each sequence in a fasta file and count how many "N" characters there are. If "N" characters make up
greater than 10% of your sequence,
then it will write the rejected sequence(s) into a file with the same name as your input file, with a suffix ".rejected". Sequences
with less than 10% n characters are written to a file with the name of your input file, with a suffix ".accepted".
=head1 Usage
rejectSeqsWithTooManyns.pl fastaFileName proportionNs
The final argument is the proportion of N characters allowed. Any sequence containing a higher proportion will be rejected. This number needs to be between 0 and 1.
Three output files are created:
fastaFileName.rejected contains the sequences with too many n characters.
fastaFileName.accepted contains the sequences that do not contain too many n characters
summary.txt contains a listing of the sequence file name, and the sequence ids in that file containing too many n characters,
how many n characters it contains and what percentage of the total length of the sequence this is.
The number of sequences with too many N characte are written to screen.
The script has been written so that if you use a for loop, or a foreach loop, to run it on a number of sequence files, each will get its
own .rejected and .accepted file, but the summary information will be writen to the same summary.txt file.
=head1 Changing the values
The character being searched for can easily be changed. The lines you are looking for to do this are:
my $count = ($seqString =~ tr/N//)
Change the "n" to the character you are interested in.
=cut
my $Usage = "\n$0 <fastafile> <proportionNs>\n\n";
my $infile = shift or die $Usage;
my $propNs = shift or die $Usage;
unless (-e $infile) {die "Can't find file $infile: $!\n\n";}
if (($propNs < 0) or ($propNs >1)) {die "The proportion of Ns should be a number between 0 and 1";}
my $in = Bio::SeqIO->new(-file => $infile , '-format' => 'Fasta');
my $rejects = $infile . "_gt_" . $propNs . "_rejected";
my $okaySeqs = $infile . "_gt_" . $propNs . "_accepted";
my $summary = "summaryFile.txt";
open (REJECT,">$rejects") or warn "couldn't open $rejects: $!\n";
open (ACCEPT,">$okaySeqs") or warn "couldn't open $okaySeqs: $!\n";
open (SUMMARY, ">>$summary") or warn "couldn't open $summary: $!\n";
#open (OUT25,">$outfile25") or warn "couldn't open $outfile25: $!\n";
print SUMMARY "$infile\n";
my $rejectCount = 0;
while ( my $seq = $in->next_seq() ) {
#print "Sequence ",$seq->id," first 10 bases ",$seq->subseq(1,10),"\n";
my $seqString = $seq->seq();
my $length = length($seqString);
my $count = ($seqString =~ tr/N//); #are there any n characters? If so, count 'em
if (($length >0) && ($count/$length > $propNs))
{
print SUMMARY "Sequence " . $seq->id . " has $count n's and this is " . floor($count*100/$length) . " percent of the sequence. Rejected.\n";
print REJECT ">" . $seq->id . "\n";
print REJECT $seq->seq() . "\n";
$rejectCount++;
}
elsif ($length == 0)
{
print SUMMARY "Sequence " . $seq->id . " has no bases remaining. Rejected.\n";
print REJECT ">" . $seq->id . "\n";
print REJECT $seq->seq() . "\n";
$rejectCount++;
}
else
{
print ACCEPT ">" . $seq->id . "\n";
print ACCEPT $seq->seq() . "\n";
}
}
my $percentNs = $propNs * 100;
print "$rejectCount sequences in $infile had more than " . $percentNs . "N's.\n";
close REJECT;
close ACCEPT;
close SUMMARY;