rejectSeqsWithTooManyNs.pl

10/12/2014 13:56

Ce script écrit en perl permet de trier vos séquences nucléiques en format fasta en fonction de leur proportion en "N".

Attention de noter la proportion "à l'anglaise", c'est à dire 0.01 et non 0,01 si vous souhaitez fixer un seuil de 1% de N.

 

#!/usr/bin/perl

use strict;
use warnings;

use Bio::SeqIO;
use POSIX;

=head1 Name

rejectSeqsWithTooManyNs.pl

=head1 Synopsis

This script will look through each sequence in a fasta file and count how many "N" characters there are. If "N" characters make up
greater than 10% of your sequence,
then it will write the rejected sequence(s) into a file with the same name as your input file, with a suffix ".rejected". Sequences
with less than 10% n characters are written to a file with the name of your input file, with a suffix ".accepted".

=head1 Usage

rejectSeqsWithTooManyns.pl fastaFileName proportionNs

The final argument is the proportion of N characters allowed. Any sequence containing a higher proportion will be rejected. This number needs to be between 0 and 1.

Three output files are created:

fastaFileName.rejected   contains the sequences with too many n characters.
fastaFileName.accepted   contains the sequences that do not contain too many n characters
summary.txt              contains a listing of the sequence file name, and the sequence ids in that file containing too many n characters,
                         how many n characters it contains and what percentage of the total length of the sequence this is.
                        
The number of sequences with too many N characte are written to screen.

The script has been written so that if you use a for loop, or a foreach loop, to run it on a number of sequence files, each will get its
own .rejected and .accepted file, but the summary information will be writen to the same summary.txt file.

=head1 Changing the values

The character being searched for can easily be changed. The lines you are looking for to do this are:

my $count = ($seqString =~ tr/N//)      

Change the "n" to the character you are interested in.

=cut

my $Usage = "\n$0 <fastafile> <proportionNs>\n\n";

my $infile = shift or die $Usage;
my $propNs = shift or die $Usage;

unless (-e $infile) {die "Can't find file $infile: $!\n\n";}

if (($propNs < 0) or ($propNs >1)) {die "The proportion of Ns should be a number between 0 and 1";}

my $in  = Bio::SeqIO->new(-file => $infile , '-format' => 'Fasta');
my $rejects = $infile . "_gt_" . $propNs . "_rejected";
my $okaySeqs = $infile . "_gt_" . $propNs . "_accepted";
my $summary = "summaryFile.txt";

open (REJECT,">$rejects") or warn "couldn't open $rejects: $!\n";
open (ACCEPT,">$okaySeqs") or warn "couldn't open $okaySeqs: $!\n";
open (SUMMARY, ">>$summary") or warn "couldn't open $summary: $!\n";
#open (OUT25,">$outfile25") or warn "couldn't open $outfile25: $!\n";

print SUMMARY "$infile\n";
my $rejectCount = 0;

while ( my $seq = $in->next_seq() ) {
    #print "Sequence ",$seq->id," first 10 bases ",$seq->subseq(1,10),"\n";

    my $seqString  = $seq->seq();
    my $length = length($seqString);
    my $count = ($seqString =~ tr/N//); #are there any n characters? If so, count 'em

    if (($length >0) && ($count/$length > $propNs))
    {
        print SUMMARY "Sequence " . $seq->id . " has $count n's and this is " . floor($count*100/$length) . " percent of the sequence. Rejected.\n";
        print REJECT ">" . $seq->id . "\n";
        print REJECT $seq->seq() . "\n";
        $rejectCount++;
    }
    elsif ($length == 0)
    {
    print SUMMARY "Sequence " . $seq->id . " has no bases remaining. Rejected.\n";
                print REJECT ">" . $seq->id . "\n";
                print REJECT $seq->seq() . "\n";
                $rejectCount++;

    }

    else
    {
        print ACCEPT ">" . $seq->id . "\n";
        print ACCEPT $seq->seq() . "\n";
    }
}    
my $percentNs = $propNs * 100;
print "$rejectCount sequences in $infile had more than " . $percentNs . "N's.\n";

close REJECT;
close ACCEPT;
close SUMMARY;