FASTA sequence format

From BioPerl
Jump to: navigation, search


Contents

Description

One of the oldest and simplest sequence formats.

This file format can be parsed by the Bio::SeqIO system using the Bio::SeqIO::fasta module.

Examples

A sequence database with two protein sequences in FASTA format. The description line after the ">" is totally free-form, although applications often assume the first string after the ">" symbol is a sequence identifier of some sort. Traditionally the sequence lines are limited to a width of 60 characters.

>CATH_RAT
MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
>CATL_HUMAN
MNPTLILAAFCLGIASATLTFDHSLEAQWTKWKAMHNRLYGMNEEGWRRAVWEKNMKMIELHNQEYREGK
HSFTMAMNAFGDMTSEEFRQVMNGFQNRKPRKGKVFQEPLFYEAPRSVDWREKGYVTPVKNQGQCGSCWA
FSATGALEGQMFRKTGRLISLSEQNLVDCSGPQGNEGCNGGLMDYAFQYVQDNGGLDSEESYPYEATEES
CKYNPKYSVANDTGFVDIPKQEKALMKAVATVGPISVAIDAGHESFLFYKEGIYFEPDCSSEDMDHGVLV
VGYGFESTESDNNKYWLVKNSWGEEWGMGGYVKMAKDRRNHCGIASAASYPTV

An NCBI formatted sequence header which includes genBank-identifier number 142864, accession number M10040.1, and Locus name BACDNAE. This sequence was first submitted to the GenBank database as described by the gb prefixing the accession number. Other abbreviaions include emb for EMBL Database or pdb for PDB Database.

>gi|142864|gb|M10040.1|BACDNAE B.subtilis dnaE gene encoding DNA primase, complete cds
GTACGACGGAGTGTTATAAGATGGGAAATCGGATACCAGATGAAATTGTGGATCAGGTGCAAAAGTCGGC
AGATATCGTTGAAGTCATAGGTGATTATGTTCAATTAAAGAAGCAAGGCCGAAACTACTTTGGACTCTGT
CCTTTTCATGGAGAAAGCACACCTTCGTTTTCCGTATCGCCCGACAAACAGATTTTTCATTGCTTTGGCT
GCGGAGCGGGCGGCAATGTTTTCTCTTTTTTAAGGCAGATGGAAGGCTATTCTTTTGCCGAGTCGGTTTC
TCACCTTGCTGACAAATACCAAATTGATTTTCCAGATGATATAACAGTCCATTCCGGAGCCCGGCCAGAG
TCTTCTGGAGAACAAAAAATGGCTGAGGCACATGAGCTCCTGAAGAAATTTTACCATCATTTGTTAATAA
ATACAAAAGAAGGTCAAGAGGCACTGGATTATCTGCTTTCTAGGGGCTTTACGAAAGAGCTGATTAATGA
ATTTCAGATTGGCTATGCTCTTGATTCTTGGGACTTTATCACGAAATTCCTTGTAAAGAGGGGATTTAGT
GAGGCGCAAATGGAAAAAGCGGGTCTCCTGATCAGACGCGAAGACGGAAGCGGATATTTCGACCGCTTCA
GAAACCGTGTCATGTTTCCGATCCATGATCATCACGGGGCTGTTGTTGCTTTCTCAGGCAGGGCTCTTGG

Note

It is important to realise that there is no formal definition for the header line, so >CATL_HUMAN and >gi|7733636|ref|NP_887744 Gadget protein are both valid. NCBI has one format, Swissprot another, and so on. Therefore BioPerl has no guaranteed way of knowing where names, accessions, and particular identifiers are in the header line. There is some code which tries to guess accession_numbers out of these headers, when parsing BLAST reports but (see each_accession_number in Bio::Search::Hit::GenericHit).

File Extensions

There are no standard file extensions for FASTA formatted files. However, common ones are .fa and .fsa. NCBI distribute their genomic data in FASTA format using four different extensions: .fna for whole genomic DNA sequences, .faa for protein coding sequences (CDS), .ffn for the untranslated nucleotide sequences for each CDS, and .frn for nucleotide sequences of RNA related features.

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox