EMBL sequence format

From BioPerl
Jump to: navigation, search


Description

The EMBL flat format is a rich format for storing sequences and their associated meta-information, feature coordinates, and annotations. It shares details with the GenBank sequence format.

This file format can be parsed by the Bio::SeqIO system using the Bio::SeqIO::embl module.

Example

ID   SC10H5 standard; DNA; PRO; 4870 BP.
XX
AC   AL031232;
XX
DE   Streptomyces coelicolor cosmid 10H5.
XX
KW   integral membrane protein.
XX
OS   Streptomyces coelicolor
OC   Eubacteria; Firmicutes; Actinomycetes; Streptomycetes;
OC   Streptomycetaceae; Streptomyces.
XX
RN   [1]
RP   1-4870
RA   Oliver K., Harris D.;
RT   ;
RL   Unpublished.
XX
RN   [2]
RP   1-4870
RA   Parkhill J., Barrell B.G., Rajandream M.A.;
RT   ;
RL   Submitted (10-AUG-1998) to the EMBL/GenBank/DDBJ databases.
RL   Streptomyces coelicolor sequencing project,
RL   Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA
RL   E-mail: barrell@sanger.ac.uk
RL   Cosmids supplied by Prof. David A. Hopwood, [3]
RL   John Innes Centre, Norwich Research Park, Colney,
RL   Norwich, Norfolk NR4 7UH, UK.
XX
RN   [3]
RP   1-4870
RA   Redenbach M., Kieser H.M., Denapaite D., Eichner A.,
RA   Cullum J., Kinashi H., Hopwood D.A.;
RT   "A set of ordered cosmids and a detailed genetic and physical
RT   map for the 8 Mb Streptomyces coelicolor A3(2) chromosome.";
RL   Mol. Microbiol. 21(1):77-96(1996).
XX
CC   Notes:
CC
CC   Streptomyces coelicolor sequencing at The Sanger Centre is funded 
CC   by the BBSRC.
CC
CC   Details of S. coelicolor sequencing at the Sanger Centre 
CC   are available on the World Wide Web. 
CC   (URL; http://www.sanger.ac.uk/Projects/S_coelicolor/)
CC
CC   CDS are numbered using the following system eg SC7B7.01c. 
CC   SC (S. coelicolor), 7B7 (cosmid name), .01 (first CDS), 
CC   c (complementary strand).
CC
CC   The more significant matches with motifs in the PROSITE
CC   database are also included but some of these may be fortuitous.
CC
CC   The length in codons is given for each CDS.
CC
CC   Usually the highest scoring match found by fasta -o is given for
CC   CDS which show significant similarity to other CDS in the database.
CC   The position of possible ribosome binding site sequences are
CC   given where these have been used to deduce the initiation codon.
CC   
CC   Gene prediction is based on positional base preference in codons 
CC   using a specially developed Hidden Markov Model (Krogh et al., 
CC   Nucleic Acids Research, 22(22):4768-4778(1994)) and the FramePlot 
CC   program of Bibb et al., Gene 30:157-66(1984) as implemented at 
CC   http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl. CAUTION:  We may  
CC   not have predicted the correct initiation codon.  Where possible 
CC   we choose an initiation codon (atg, gtg, ttg or (att)) which is 
CC   preceded by an upstream ribosome binding site sequence (optimally 
CC   5-13bp before the initiation codon).  If this cannot be identified
CC   we choose the most upstream initiation codon.
CC     
CC   IMPORTANT: This sequence MAY NOT be the entire insert of
CC   the sequenced clone.  It may be shorter because we only
CC   sequence overlapping sections once, or longer, because we
CC   arrange for a small overlap between neighbouring submissions.
CC
CC   Cosmid 10H5 lies to the right of 3A7 on the AseI-B genomic restriction 
CC   fragment.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..4870
FT                   /organism="Streptomyces coelicolor"
FT                   /strain="A3(2)"
FT                   /clone="cosmid 10H5"
FT   CDS             complement(<1..327)
FT                   /note="SC10H5.01c, unknown, partial CDS, len >109 aa;
FT                   possible integral membrane protein"
FT                   /gene="SC10H5.01c"
FT                   /product="hypothetical protein SC10H5.01c"
FT   CDS             complement(350..805)
FT                   /note="SC10H5.02c, probable integral membrane protein, len:
FT                   151 aa; similar to S. coelicolor hypothetical protein
FT                   TR:O54194 (EMBL:AL021411) SC7H1.35 (155 aa), fasta scores;
FT                   opt: 431 z-score: 749.8 E(): 0, 53.5% identity in 114 aa
FT                   overlap."
FT                   /product="putative integral membrane protein"
FT                   /gene="SC10H5.02c"
FT   RBS             complement(812..815)
FT                   /note="possible RBS upstream of SC10H5.02c"
FT   CDS             complement(837..1301)
FT                   /note="SC10H5.03c, probable integral membrane protein, len:
FT                   154 aa"
FT                   /product="putative integral membrane protein"
FT                   /gene="SC10H5.03c"
FT   RBS             complement(1308..1312)
FT                   /note="possible RBS upstream of SC10H5.03c"
FT   CDS             complement(1427..1735)
FT                   /note="SC10H5.04c, unknown, len: 103 aa; possible membrane"
FT                   /gene="SC10H5.04c"
FT                   /product="hypothetical protein SC10H5.04c"
FT   RBS             complement(1738..1741)
FT                   /note="possible RBS upstream of SC10H5.05c"
FT   misc_feature    1800^1801
FT                   /note="Zero-length feature added to test Bioperl parsing"
FT   CDS             1933..2022
FT                   /note="SC10H5.05, questionable ORF, len: 29 aa"
FT                   /gene="SC10H5.05"
FT                   /product="hypothetical protein SC10H5.05"
FT   CDS             2019..2642
FT                   /note="SC10H5.06, probable membrane protein, len: 207 aa;
FT                   similar to S. coelicolor TR:O54192 SC7H1.33c (191 aa),
FT                   fasta scores; opt: 312 z-score: 355.2 E(): 1.6e-12, 36.8%
FT                   identity in 182 aa overlap"
FT                   /product="putative membrane protein"
FT                   /gene="SC10H5.06"
FT   RBS             2627..2631
FT                   /note="possible RBS upstream of SC10H5.07"
FT   CDS             2639..4048
FT                   /note="SC10H5.07, unknown, len: 469 aa"
FT                   /gene="SC10H5.07"
FT                   /product="hypothetical protein SC10H5.07"
FT   CDS             complement(4100..4297)
FT                   /note="SC10H5.08c, unknown, len: 65 aa"
FT                   /gene="SC10H5.08c"
FT                   /product="hypothetical protein SC10H5.08c"
FT   RBS             complement(4314..4319)
FT                   /note="possible RBS upstream of SC10H5.08c"
FT   CDS             complement(4439..>4870)
FT                   /note="SC10H5.09c, probable integral membrane protein,
FT                   partial CDS len: >143 aa; some similarity in C-terminus to
FT                   S. coelicolor hypothetical protein TR:O54106
FT                   (EMBL:AL021529) SC10A5.15 (114 aa), fasta scores; opt: 145
FT                   z-score: 233.8 E(): 9.2e-06, 33.3% identity in 81 aa
FT                   overlap. Overlaps and extends SC3A7.01c"
FT                   /product="putative integral membrane protein"
FT                   /gene="SC10H5.09c"
FT   misc_feature    4769..4870
FT                   /note="overlap with cosmid 3A7 from 1 to 102"
XX
SQ   Sequence 4870 BP; 769 A; 1717 C; 1693 G; 691 T; 0 other;
     gatcagtaga cccagcgaca gcagggcggg gcccagcagg ccggccgtgg cgtagagcgc        60
     gaggacggcg accggcgtgg ccaccgacag gatggctgcg gcgacgcgga cgacaccgga       120
     gtgtgccagg gcccaccaca cgccgatggc cgcgagcgcg agtcccgcgc tgccgaacag       180
     ggcccacagc acactgcgca gaccggcggc cacgagtggc gccaggacgg tgcccagcag       240
     gagcagcagg gtgacgtggg cgcgcgctgc actgtggccg ccccgtccgc ccgacgcgcg       300
     cggctcgtca tctcgcggtc ccaccaccgg tcggccccat tactcgtcct caaccctgtg       360
     gcgactgacg ttccccggac aggtcgtacc gattgccgcc acgccccacc acgcacaggg       420
     cccagacgac gaagcctgac atggtgatca tgacgacgga ccacaccggg tagtacggca       480
     gcgagaggaa gttggcgatg atcaccagcc cggcgatggc gaccccggtg acacgtgccc       540
     acatcgccgt tttgagcagc ccggcgctga cgaccatggc gagcgcgccg agcgcgagat       600
     ggatccaccc ccacccggtg agatcgaact ggaaaacgta gttgggcgtg gtgacgaaga       660
     cgtcgtcctc ggcgatggcc atgatgcccc ggaagaggct gagcagcccg gcgaggaaga       720
     gcatcaccgc cgcgaaggcg gtaaggcccg tcgcccattc ctgcctcgcg gtgtgtgccg       780
     ggtggtgggt atgtgacgtg gtcatctcgg acctcgtttc gtggaatgcg gatgcttcag       840
     cgagcggagg cgccggtgcc cgccgcgccc gtgtgccctg ccgggccgtg accggacagg       900
     accaattcct tcgccttgcg gaactcctcg tccgtgatgg caccccggtc tcggatctcg       960
     gagagccggg ccagctcgtc gacgctgctg gacccgccgc ccacggtctt cctgatgtag      1020
     gcgtcgaact cctcctgctg agcccgtgcc cgcgttgtct cccggctgcc catgttcttg      1080
     ccgcgagcga tcacgtagac gaaaacgccc aggaagggca ggaggatgca gaacaccaac      1140
     cagccggcct tcgcccagcc actcagtccg tcgtcccgga agatgtcggt gacgacgcgg      1200
     aagagcagga cgaaccacat gatccacagg aagatcatca gcatcgtcca gaaggcaccc      1260
     agcagtgggt agtcgtacgc caggtaggtc tgtgcactca tgtccgtcct ccgtcctccg      1320
     gggcgcggcc cggcggccct cgttccgtac tgacatcagg gtggtcacgg gtcccaccgg      1380
     tcggcatcac ccggcacggg tgagtggggc gccgaggccg tcgtggtcag gcccgggaca      1440
     ccggtgtgac cctggtggaa ggacgcgtcc cgtggggcac gcaccgccgg ccgagggcga      1500
     ccaccgcctc ggtcagtccg agcaggccca gccacaggcc gagaagtcgg gtcagggcac      1560
     gggccgactc ggcgggcagc gcgaggacga cgattccggc gacgtcgacg gccagcgggt      1620
     tgcgcaggcc cagcactccg gccggggcgc ccggcaccag cgtggcgagg gccgatgcca      1680
     tgagccaggt ccaggaaccc ccaagcctgg cgaggacgtg cgccggatcg ctcaatgctc      1740
     cggtgaccgc cccgcccgac ccgtctccct tgtcggcagg ttccgccgca tcacgcggaa      1800
     cggagatggc tcccctgtgg atcgggcggc cgctgcgggg ccgcccggtt ggtcggtcgg      1860
     tgagcgccgg actccccctt cagctcttcc agggtcgggg tcgacaccga ggtcctggat      1920
     cacccgtcag gggtgatccg ggcatgccgt cgtggcggtg aggtgggata cgggaacgat      1980
     cggcccacgg gggaccggac gagacgaaga gacgtgagat gagcgatacg aactcgggcg      2040
     gcgggcgcca ggccgcttcc ggaccggccc cacgtggccg actccctttc cgccggcgcg      2100
     tggccctggt cgctgtcgca cgtcccctga tcgtcacggt cggtctcgtc accgcctact      2160
     acctgcttcc cctggacgag agactcagcg ccggcaccct ggtgtcgctg gtgtgcggac      2220
     tgctcgcagt ccttctggtg ttctgctggg aggtgcgggc catcacgcgc tccccgcatc      2280
     cgcgtctgag agcgatcgag ggcctggccg ccacgctggt gctgttcctg gtcctcttcg      2340
     ccggctccta ctacctgctg ggtcgctccg cgcccggctc cttcagcgag ccgctgaaca      2400
     ggacggacgc gctgtacttc actctgacca cgttcgccac cgtcggcttc ggggacatca      2460
     ccgcacgctc cgagaccggg cggatcctca cgatggcgca gatgacggga gggctactgc      2520
     tcgtcggagt cgccgcccgg gtgctggcga gcgcagtgca ggcggggctg caccgacagg      2580
     gccggggacc ggcggcatcg ccacgctccg gtgctgcgga ggagccggag gccggaccat      2640
     gaccgtaccc ggtggcttca ccgcctccct gccgccggcc gagcgagccg cgtacggcag      2700
     gaaggcccgt aaaagggcct cacgttcgtg ccacggctgg tacgagccgg ggcagcggcg      2760
     gcctgacccc gtcgacctgc tggagcgcca gtccggcgag cgtgtcccgg cactcgtgcc      2820
     catccgctac ggtcgcatgc tggagtcgcc gttccgcttc taccgcggtg cggcagcgat      2880
     catggcggcg gacctggcac ccctgcccag cagcggactc caggtgcaat tgtgcgggga      2940
     cgcgcacccg ttgaacttcc ggctcctggc ctcaccggag cgccggctgg tcttcgacat      3000
     caacgacttc gacgagacgc tgcccggccc cttcgagtgg gacgtcaaac ggctggcggc      3060
     cggattcgtg atcgcggccc ggtcgaacgg cttctcgtcc aaggaacaga accgcaccgt      3120
     tcgggcctgt gtgcgggcct accgggagcg catgagggag ttcgccgtca tgccgaccct      3180
     ggacatctgg tacgcccagg acgacgccga ccacgtacgg caactgctgg ctacggaggc      3240
     cagaggagaa gctgagcagc ggctcaggga cgcggctgcg aaggcccgca cacgcaccca      3300
     catgagggcg ttcgcgaagc tcacccgcgt cacggccgag ggccggcgca tcacccccga      3360
     cccgccgctg atcaccccac tcggcgatct gctcaccgac ccggccgaag ccggccggga      3420
     ggaggaactg cggtccgtcg tgaacggcta cgcacggtcc ctgccgcccg agcgccggca      3480
     cctgctgcgt cactaccggc ttgtggacat ggcgcgcaag gtggtcggcg tcggcagtgt      3540
     cggcacccgc tgctgggtac tgcttctgct cggcagggac gacgacgatc ctctgctgct      3600
     ccaggccaag gaagcctcgg aatcggtgct ggcggcccac acgggcggcg aacgctacga      3660
     ccatcagggc cgcagggtcg tggccggcca gcgtctgatc cagaccaccg gtgacatctt      3720
     tctcggctgg gcgcgcgtca ccggcttcga cggaaaggcc cgggacttct acgtgcgtca      3780
     actgtgggac tggaagggcg tcgcgcggcc ggaaaccatg gggcccgacc tgctctccct      3840
     cttcgcccgg ctgtgcggtg cctgcctggc gagggcccac gcccgttccg gtgaccccgt      3900
     cgcgctcgcc gcgtacctgg gcggcagcga ccgcttcgac ggcgcgctca ccgagttcgc      3960
     ccagtcctac gccgatcaga atgaacgcga ccacgaagct ctgctggcgg cctgccgctc      4020
     cggcagggtc acggccgccc gtttgtgagg ccgacccggg aacggccggc gggctggcac      4080
     acaccgccgc cggtcggcgt cattccggaa gctgccgcat ctccaggacg cgcaggccca      4140
     gcgactggca gcgggtgagc aacccgtaca gatgggcctc gtcgatcacc gtgccgaaca      4200
     gcacggtctg gccggacatg acgacgtgct ccagctccgg gaacgcgttg gccagcgtcc      4260
     gtgacaggtg tccctcgacg cggatctcgt agcgcacgag cggtcctttc accgtaggag      4320
     ctcgggacac cgcccggggc tccgggtcgg acggtgctct tggtgacgag cctgcgcctc      4380
     gtcgccctcc ggtgccctca cccagcacag gtgactccaa ccgcagtgtc agtgcctttc      4440
     agtgcgtcac tgtgatcttg acgacgacga tcaccaggcc gagcagtacg ttgaccgtcg      4500
     cggtgacggc caccagtcgt cgcgaggcgc ccgcgcggtg cgccgcggcg acggaccagc      4560
     ccacctgacc ggcgacggcg acggacagcg ccagccacag ggtgcccggg acgtccagcc      4620
     ccagtacggg gctgacggcg atggccgcgg ccggaggcac ggcggccttg acgatcggcc      4680
     actcctcgcg gcacacacgc agaatcaccc gccggtccgg agtgtgccgc gcgagacgcg      4740
     ctccgaacag ttcggcgtgg acgtgagcga tccagaacac caagctggtg agcaacagca      4800
     gaagaaccag ttcggcgcgg gggaacgagc ccagggtgcc ggcgccgatc acgacggagg      4860
     ctgcgagcat                                                             4870
//
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox