shortenID.pl

10/12/2014 14:33

Ce script perl diminue le nom des séquences fasta en supprimant les identifications multiples.

 

#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;


=head1 Name

shortenID.pl

=head1 Usage

shortenID.pl 

=head1 Synopsis

This scripts takes a fasta file and converts the ids from a form including pipe symbols and accession numbers
to one including only gi numbers.

It assumes you have ID information in the header line of the form:

>gi|120419786|gb|EH270482.2|EH270482 Gp_mxAA_21G01_M13R mxA Gammarus pulex cDNA clone Gp_mxAA_21G01 5', mRNA sequence

A new fasta file will be written out containing the sequence(s) with new headers of the form:

>gi.120419786 Gp_mxAA_21G01_M13R mxA Gammarus pulex cDNA clone Gp_mxAA_21G01 5', mRNA sequence

The effect is to remove pipe symbols, which pose problems for some programs, and to shorten the ID. Most programs know to consider the id to consist of only the elements right after the > symbol and before the first space.

=cut

unless (@ARGV ==1){ die "Usage:  shortenIDs.pl  fastaFileName";}

my $origFile = shift;  
my $newFile=$origFile . ".shIDs";

my $seq_in  = Bio::SeqIO->new( -format => 'fasta',
                                   -file => $origFile);

my $seq;
my $seq_out = Bio::SeqIO->new('-file' => ">$newFile",
                                       '-format' => 'fasta');

while( $seq = $seq_in->next_seq() )
{
    my $seqName = $seq->id;
     $seqName =~ s/\|/\./g; #replace pipe with dot
        $seqName =~ s/(gi\.\w*)\..*/$1/;  

        $seq->id($seqName);
    $seq_out->write_seq($seq);
}

print "Your sequences have been renamed and are in the file $newFile\n\n";