Bacterial species identification from WGS using 16S genes

June 04 2024

Background

The 16S rRNA gene is present in all bacteria and contains both highly conserved and hypervariable regions. The conserved regions allow for the design of universal primers that can amplify the gene across a wide range of bacteria and archaea. The hypervariable regions (e.g., V1-V9) provide species-specific sequences that can be used to distinguish between different taxa. This is a widely used means of species identification. With WGS we have the whole genome and relying on 16S alone is not necessary. However it is a useful first step in checking your sequences are what you think. The only snag for this method is that you need to assemble the reads first.

Basically this method just takes the assembly and does a local blast to the 16S ncbi sequences. The results are then sorted by precent identity and coverage. You can download the 16S sequences from here. Note that the tools module from snipgenie is imported here for some of the functions.

import numpy as np
import pandas as pd
import pylab as plt
from Bio import AlignIO, SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from snipgenie import tools

def get_blast_coverage(bl, fasta):
    """Get alignment coverage of blast results from original sequence lengths"""
    
    df=tools.fasta_to_dataframe('16S_ncbi.fa')
    df=df.rename(columns={'length':'sslen'})
    bl=bl.merge(df,left_on='sseqid',right_on='name',how='left')
    bl['perc_cov'] = bl.apply(lambda x: round(x.length/x.sslen*100,2),1)
    return bl

def blast_16S(filename, hits=100, pident=99.5):
    """Perform blast to 16S genes"""

    tools.make_blast_database('16S_ncbi.fa')
    bl=tools.blast_fasta('16S_ncbi.fa',filename,maxseqs=hits)
    bl = get_blast_coverage(bl, '16S_ncbi.fa')
    #bl['genus'] = bl.stitle.apply(lambda x: x.split()[1])
    bl['species'] = bl.stitle.apply(lambda x: ' '.join(x.split()[1:3]))
    cols = ['sseqid','sslen','length','perc_cov','pident','stitle','species']
    bl = bl[(bl.pident>=pident) & (bl.perc_cov>=80)].sort_values('pident',ascending=False)  
    return bl

def extract_sequences_by_ids(input_fasta, output_fasta, ids_to_extract):
    """Extract sequences from fasta file with given ids"""
    
    sequences = SeqIO.parse(input_fasta, "fasta")    
    # Filter sequences that match the given IDs
    filtered = (seq for seq in sequences if seq.id in ids_to_extract)   
    SeqIO.write(filtered, output_fasta, "fasta")
    return

def append_sequences_to_fasta(fasta_file, seqs):
    """Append SeqRecords to a FASTA file, overwriting the old file."""

    existing_seqs = list(SeqIO.parse(fasta_file, "fasta"))
    if type(seqs) is not list:
        seqs = [seqs]
    existing_seqs.extend(seqs)    
    #with open(fasta_file, "w") as output_handle:
    SeqIO.write(existing_seqs, fasta_file, "fasta")
    return

def get_tree(fasta_file):
    """Get phylo tree from fasta_file with fasttree"""
    
    out = 'temp.newick'
    cmd=f'mafft {fasta_file} > temp.aln'
    tmp = subprocess.check_output(cmd, shell=True, stderr=subprocess.PIPE)
    cmd=f'fasttree temp.aln > {out}'
    tmp = subprocess.check_output(cmd, shell=True, stderr=subprocess.PIPE)
    return out

Usage

Here is how the functions above are utilised to check an assembly. For this example I used an assembly of a known Mycoplasma alkalescens strain, selected at random. Note that you can just use the blast table to get the closest match. The other code is to make a phylogeny from the alignment which might give a less ambigiuous result for some cases.

#run local blast
tools.make_blast_database('16S_ncbi.fa')
bl = blast_16S('M.alkalescens_NCTC10135.fa',pident=90,hits=20)
#the following parts are if we want to make a tree
#add target sequence into results
targetseq = SeqRecord(Seq(bl.iloc[0].qseq), id='NCTC10135')
m=bl.set_index('sseqid')[:15]
names = list(m.index)
#extract the original sequences for our alignment
extract_sequences_by_ids('16S_ncbi.fa', 'temp.fa', names)
append_sequences_to_fasta('temp.fa',targetseq)
#get a phylogeny
treefile = get_tree('temp.fa')

Here’s the blast result table:

sseqid species pident
NR_025984.1 Metamycoplasma alkalescens 99.932
NR_026035.1 Metamycoplasma auris 98.973
NR_025988.1 Metamycoplasma canadense 98.704
NR_029180.1 Metamycoplasma gateae 98.545
NR_041743.1 Mycoplasmopsis arginini 98.494

Here’s what the tree looks like with our assembly being the black tip. It is correctly identified as Mycoplasma alkalescens. Obviously results will vary depending on your sample. A mixed sample with DNA from two strains will need to be treated carefully.

There is a Jupyter notebook here with this code. This method is also implemented as plugin in the SNiPgenie GUI.

  • https://en.wikipedia.org/wiki/16S_ribosomal_RNA