shortenID2.pl

10/12/2014 14:35

Ce script perl permet de diminuer le nom des séquences fasta en ne gardant que l'identifiant du gène (GI number).

 

 

#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;


=head1 Name

shortenID.pl

=head1 Usage

shortenID.pl 

=head1 Synopsis

This scripts takes a fasta file and converts the ids from a form including pipe symbols and accession numbers
to one including only gi numbers.

It assumes you have ID information in the header line of the form:

>gi|120419786|gb|EH270482.2|EH270482 Gp_mxAA_21G01_M13R mxA Gammarus pulex cDNA clone Gp_mxAA_21G01 5', mRNA sequence

A new fasta file will be written out containing the sequence(s) with new headers of the form:

>gi120419786

=cut

unless (@ARGV ==1){ die "Usage:  shortenIDs.pl  fastaFileName";}

my $origFile = shift;  
my $newFile=$origFile . ".shIDs";

my $seq_in  = Bio::SeqIO->new( -format => 'fasta',
                                   -file => $origFile);

my $seq;
my $seq_out = Bio::SeqIO->new('-file' => ">$newFile",
                                       '-format' => 'fasta');

while( $seq = $seq_in->next_seq() )
{
    my $seqName = $seq->id;
     $seqName =~ s/\|/\./g; #replace pipe with dot
        $seqName =~ s/(gi)\.(\w*)\..*/$1$2/;  
    #my $desc = $seq->description;
        $seq->id($seqName);
    $seq->description("");
    
    $seq_out->write_seq($seq);
}

print "Your sequences have been renamed and are in the file $newFile\n\n";