M. bovis spoligotyping from WGS reads

January 15 2021


Spoligotyping (spacer oligonucleotide typing) is a widely used genotyping method for M. tb (Mycobacterium tuberculosis species), which exploits the genetic diversity in the direct repeat (DR) locus in Mtb genome. Each DR region consists of several copies of the 36 bp DR sequence, which are interspersed with 34 bp to 41 bp non-repetitive spacers. A set of 43 unique spacer sequences is used to classify Mtb strains based on their presence or absence. This a molecular method traditionally conducted using a PCR-based or other method. Whole genome sequence is a far more sensitive method of phylogenetic identification of strains and makes other typing methods redundant if you have the whole sequence. It may be useful however to relate the sequence back to the spoligotype in some circumstances. It is not hard to calculate the spoligotype by analysis of the raw sequence reads (or an assembly).


The method is simple and consists of blasting the spacer sequences against a database made from the raw reads. The reads can be assembled or concatentated together first but the method will work regardless. The results are parsed and hits per spacer counted. A threshold is used to determine if a spacer is present or not. Low coverage hits and those with less than 1 or 2 mismatches should also be removed first. The leaves a 43 digit presence/absence string whihc is a binary code with 1 denoting the presence and 0 denoting the absence for each spacer. This can be translated into octal or hexadecimal code and looked up in a database of known types that correspond to that code. This is basically the same method used by the SpoTyping tool. SpoTyping uses a threshold of 5 hits by default. The value my vary according to the input reads but should be at least 2 to account for spurious hits.

Scheme for detecting spoligotype from short reads.


DR spacers are put into a fasta file with a number for each according to the order that will determine the SB number binary code. You can download this file here. The file looks like this, each sequence is 25 nucleotides long:


We then just make the blast database from the reads by translating the fastq file into fast format and using makeblastdb. The following method does this and the retrieval of hits. It puts the results into a pandas DataFrame and filters them. They are then aggregated to get hits per spacer. This is converted to the binary code. The methods make_blast_database and blast_fasta are imported from the tools module in the snpgenie package. They can be copied from the repository if needed separately. This method uses the reads direct without assembly or concatentation. It limits to the first 500000 reads for efficiency but this could be changed. If you have paired end reads it will probably work using one of the files.

from snpgenie import tools

def get_spoligotype(filename, reads_limit=500000, threshold=2):
    """Get spoligotype from reads"""

    ref = 'dr_spacers.fa'
    #convert reads to fasta
    tools.fastq_to_fasta(filename, 'temp.fa', reads_limit)
    #make blast db from reads
    #blast spacers to db
    bl = tools.blast_fasta('temp.fa', ref, evalue=0.1,
                           maxseqs=100000, show_cmd=False)
    #filter hits
    bl=bl[(bl.qcovs>95) & (bl.mismatch<2)]
    #group resulting table to get hits per spacer sequence
    x = bl.groupby('qseqid').agg({'pident':np.size}).reset_index()
    x = x[x.pident>=threshold]
    found = list(x.qseqid)
    #convert hits to binary code
    for i in range(1,44):
        if i in found:
    s =''.join(s)  
    return s

We can then look up the binary code in the Mbovis.org database using a table downloaded form the website. There may be a more comprehensive source. The table looks like this:

def get_sb_number(binary_str):
    """Get SB number from binary pattern usinf database reference"""

    df = pd.read_csv('Mbovis.org_db.csv')
    x = df[df['binary'] == binary_str]
    if len(x) == 0:
        return x.iloc[0].SB

Use as follows:

b = get_spoligotype('test.fastq')

This method appears to work on test data with known types but hasn’t been rigorously benchmarked. It’s not likely to be 100% reliable and results should be checked for errors. You will notice that a single missed hit will produce the wrong result entirely.