Page d'accueil > shortenID2.pl

shortenID2.pl

10/12/2014 14:35

Ce script perl permet de diminuer le nom des séquences fasta en ne gardant que l'identifiant du gène (GI number).

#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;

=head1 Name

shortenID.pl

=head1 Usage

shortenID.pl

=head1 Synopsis

This scripts takes a fasta file and converts the ids from a form including pipe symbols and accession numbers
to one including only gi numbers.

It assumes you have ID information in the header line of the form:

>gi|120419786|gb|EH270482.2|EH270482 Gp_mxAA_21G01_M13R mxA Gammarus pulex cDNA clone Gp_mxAA_21G01 5', mRNA sequence

A new fasta file will be written out containing the sequence(s) with new headers of the form:

>gi120419786

=cut

unless (@ARGV ==1){ die "Usage: shortenIDs.pl fastaFileName";}

my $origFile = shift;
my $newFile=$origFile . ".shIDs";

my $seq_in = Bio::SeqIO->new( -format => 'fasta',
                                   -file => $origFile);

my $seq;
my $seq_out = Bio::SeqIO->new('-file' => ">$newFile",
                                       '-format' => 'fasta');

while( $seq = $seq_in->next_seq() )
{
   my $seqName = $seq->id;
    $seqName =~ s/\|/\./g; #replace pipe with dot
        $seqName =~ s/(gi)\.(\w*)\..*/$1$2/;
   #my $desc = $seq->description;
        $seq->id($seqName);
   $seq->description("");

   $seq_out->write_seq($seq);
}

print "Your sequences have been renamed and are in the file $newFile\n\n";

shortenID2.pl

Rechercher

Contact