GFF code audit

From BioPerl
Jump to: navigation, search

Note:This is a tracker page and a stub for now.

Contents

Introduction

We are planning a code audit related to the various ways GFF output is generated via BioPerl classes. Feel free to modify as needed. However, please limit discussions to the Discussion Page or the mail list.

Current classes which generate GFF format

Maybe a table here, with class/GFF format supported/output

Classes

Below are a list of classes which either read or write GFF.

Scripts

  • scripts/Bio-DB-GFF/bulk_load_gff.PLS
  • scripts/Bio-DB-GFF/fast_load_gff.PLS
  • scripts/Bio-DB-GFF/genbank2gff.PLS
  • scripts/Bio-DB-GFF/genbank2gff3.PLS
  • scripts/Bio-DB-GFF/generate_histogram.PLS
  • scripts/Bio-DB-GFF/load_gff.PLS
  • scripts/Bio-DB-GFF/meta_gff.PLS
  • scripts/Bio-DB-GFF/process_gadfly.PLS
  • scripts/Bio-DB-GFF/process_sgd.PLS
  • scripts/Bio-DB-GFF/process_wormbase.PLS
  • scripts/Bio-SeqFeature-Store/bp_seqfeature_gff3.PLS
  • scripts/Bio-SeqFeature-Store/bp_seqfeature_load.PLS
  • scripts/graphics/feature_draw.PLS
  • scripts/graphics/frend.PLS
  • scripts/seq/unflatten_seq.PLS
  • scripts/utilities/search2BSML.PLS
  • scripts/utilities/search2gff.PLS

Examples

  • examples/Bio-DB-GFF/load_ucsc.pl
  • examples/biographics/feature_data.gff
  • examples/searchio/waba2gff.pl
  • examples/searchio/waba2gff3.pl
  • examples/tools/gb_to_gff.pl
  • examples/tools/gff2ps.pl

Problems with current output

  • Features have to build in the Parser objects generally to be GTF/GFF2 or GFF3 compatible.
    • There is the two-level/three aspects for gene -> mRNA -> CDS for GFF3
    • Last column of key/value pairs needs to be different for GTF and GFF3. ID/Parent/Name has to be consistently set for features for GFF3, but for GTF should be something different.
  • CDS-typed features require a phase component, which currently isn't mapped into SeqFeatures via SeqIO or bp_genbank2gff3 (see Issue #2322).
  • Consistency/flexibility when generating GFF3 from Bio::SearchIO-generated data.

Examples

Here is some gene feature data in GTF and GFF3. Note the order of exon/CDS interleaving is not required in GTF, but is how the results look when sorted by start position*strand. In GFF3 it is not required that gene feature preceed the mRNA, but the Gbrowse (Bio::DB::SeqFeature and Bio::DB::GFF at least) take this shortcut in parsing so it is best to keep them ordered in this fashion.

GFF3

Chrom1  SNAP    gene    505     3447    .       +       .       ID=gene000002;Name=Chrom1.0-snap.1
Chrom1  SNAP    mRNA    505     3447    .       +       .       ID=mRNA000002;Name=Chrom1.0-snapCCIN.1.1
Chrom1  SNAP    exon    505     673     21.624  +       .       ID=exon000013;Parent=mRNA000002
Chrom1  SNAP    exon    730     1446    46.298  +       .       ID=exon000014;Parent=mRNA000002
Chrom1  SNAP    exon    1472    3447    147.456 +       .       ID=exon000015;Parent=mRNA000002
Chrom1  SNAP    CDS     505     673     21.624  +       0       ID=cds000013;Parent=mRNA000002
Chrom1  SNAP    CDS     730     1446    46.298  +       2       ID=cds000014;Parent=mRNA000002
Chrom1  SNAP    CDS     1472    3447    147.456 +       2       ID=cds000015;Parent=mRNA000002

GTF

Chrom1  SNAP    start_codon     505     507     .       +       .       transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Chrom1  SNAP    CDS     505     673     21.624  +       0       exontype "initial"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Chrom1  SNAP    exon    505     673     21.624  +       .       exontype "initial"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Chrom1  SNAP    CDS     730     1446    46.298  +       2       exontype "internal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Chrom1  SNAP    exon    730     1446    46.298  +       .       exontype "internal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Chrom1  SNAP    CDS     1472    3447    147.456 +       2       exontype "terminal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Chrom1  SNAP    exon    1472    3447    147.456 +       .       exontype "terminal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Chrom1  SNAP    stop_codon      3445    3447    .       +       .       transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";

Proposals

GMOD Discussion

  • Build Hierarchical Features (see Bio::SeqFeature::Slim CVS on lightweight_feature_branch branch)
    • These will explicitly have PARENT and ID semantic fields
    • Map GTF and GFF to this hiearcharchy
    • SO compliance and validation can be done on this, but not explicitly coded in to keep the obj lightweight.
    • Configurable filters which define what the Group/Parent is and the ID field from GTF or GFF3

Lightweight SF objects

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox