shortenID.pl
Ce script perl diminue le nom des séquences fasta en supprimant les identifications multiples.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
=head1 Name
shortenID.pl
=head1 Usage
shortenID.pl
=head1 Synopsis
This scripts takes a fasta file and converts the ids from a form including pipe symbols and accession numbers
to one including only gi numbers.
It assumes you have ID information in the header line of the form:
>gi|120419786|gb|EH270482.2|EH270482 Gp_mxAA_21G01_M13R mxA Gammarus pulex cDNA clone Gp_mxAA_21G01 5', mRNA sequence
A new fasta file will be written out containing the sequence(s) with new headers of the form:
>gi.120419786 Gp_mxAA_21G01_M13R mxA Gammarus pulex cDNA clone Gp_mxAA_21G01 5', mRNA sequence
The effect is to remove pipe symbols, which pose problems for some programs, and to shorten the ID. Most programs know to consider the id to consist of only the elements right after the > symbol and before the first space.
=cut
unless (@ARGV ==1){ die "Usage: shortenIDs.pl fastaFileName";}
my $origFile = shift;
my $newFile=$origFile . ".shIDs";
my $seq_in = Bio::SeqIO->new( -format => 'fasta',
-file => $origFile);
my $seq;
my $seq_out = Bio::SeqIO->new('-file' => ">$newFile",
'-format' => 'fasta');
while( $seq = $seq_in->next_seq() )
{
my $seqName = $seq->id;
$seqName =~ s/\|/\./g; #replace pipe with dot
$seqName =~ s/(gi\.\w*)\..*/$1/;
$seq->id($seqName);
$seq_out->write_seq($seq);
}
print "Your sequences have been renamed and are in the file $newFile\n\n";