Create a bacterial GFF from a genbank file for BCFtools/csq

August 14 2018

Consequence calling

Consequence calling is the computational prediction of functional consequences from nucleotide sequence variants such as SNPs and indels. A typical effect is a single amino acid being altered inside a coding sequence. There are a number of programs for making predictions such as snpEff and VEP from ensembl. These two programs are somewhat difficult to get started with. bcftools/csq is relatively easy to use with a single command. The inputs requires are the vcf file obtained from variant calling, a reference sequence and annotation file in GFF3 format.

GFF format problems

However bcftools/csq has quite specific requirements for the gff file format. For bacterial genomes, gff files can have varying formats due to how they are generated both by ensembl and the NCBI. See this issue on the bcftools github page for more details.

In short the format required looks like this:

NC_002945.4	feature	gene	1	1524	.	+	.	ID=gene:BQ2027_MB0001;Name=dnaA;biotype=protein_coding;
NC_002945.4	feature	mRNA	1	1524	.	+	.	ID=transcript:BQ2027_MB0001;Parent=gene:BQ2027_MB0001;biotype=protein_coding
NC_002945.4	feature	CDS	1	1524	.	+	0	ID=CDS:BQ2027_MB0001;Parent=transcript:BQ2027_MB0001;biotype=protein_coding

In attempting to use the M.Bovis annotation, I found that the gff file provided on genbank would not work as it is nothing like the format above. The easiest solution was to make a new gff file from the genbank file using a Python script. The code is given below and may be of use to others using non-standard bacterial genomes. Though it is possible this issue will only affect a small number of older annotations. Also the code might need to be tweaked to suit your genbank input. For example the M.bovis file has very long ‘note’ qualifiers that are probably not in most annotations.

Running bcftools/csq

With the proper gff file, you can then run the calling using this command:

bcftools csq -f Mbovis_AF212297.fa -g Mbovis_csq_format.gff sample.vcf -Ot -o sample.csq.tsv


To run this you need the biopython and bcbio-gff packages.

def GFF_bcftools_format(in_file, out_file):
    """Convert a bacterial genbank file from NCBI to a GFF3 format that can be used in bcftools csq.
        in_file: genbank file
        out_file: name of GFF file

    from BCBio import GFF
    in_handle = open(in_file)
    out_handle = open(out_file, "w")
    from Bio.SeqFeature import SeqFeature
    from Bio.SeqFeature import FeatureLocation
    from copy import copy

    for record in SeqIO.parse(in_handle, "genbank"):
        #make a copy of the record as we will be changing it during the loop
        new = copy(record)
        new.features = []
        #loop over all features
        for i in range(0,len(record.features)):          
            feat = record.features[i]
            q = feat.qualifiers
            #remove some unecessary qualifiers
            for label in ['note','translation','product','experiment']:
                if label in q:
                    del q[label]
            if(feat.type == "CDS"):
                #use the CDS feature to create the new lines
                tag = q['locus_tag'][0]
                q['ID'] = 'CDS:%s' %tag
                q['Parent'] = 'transcript:%s' %tag
                q['biotype'] = 'protein_coding'

                #create mRNA feature
                m = SeqFeature(feat.location,type='mRNA',strand=feat.strand)
                q = m.qualifiers
                q['ID'] = 'transcript:%s' %tag
                q['Parent'] = 'gene:%s' %tag
                q['biotype'] = 'protein_coding'

            elif(record.features[i].type == "gene"):
                #edit the gene feature
                q['ID'] = 'gene:%s' %q['locus_tag'][0]
                q['biotype'] = 'protein_coding'
                if 'gene' in q:
                    q['Name'] = q['gene']
        #write the new features to a GFF                                      
        GFF.write([new], out_handle)