Create a fasta alignment from a multi sample VCF

April 07 2020

UPDATE: This method is not recommended, use the function in the follow up post instead.

Background

A typical endpoint of microbial whole genome sequencing analysis is to construct a MSA (multiple sequence alignment) of the variable sites, most commonly the SNVs (ignoring indels). This is then used to infer a phylogeny. The input is a vcf with all sample sites. To produce a multi sample vcf, you can either call the variants for each sample merge all the single vcfs together or call all samples at once. (Merging many vcfs appears to be memory intensive so I would prefer the latter). The resulting file is then converted to a fasta alignment. There are scripts available to do this but I was not able to find one written in Python that I could easily use. The following details a Python method to do this.

Note: vcf files should always be filtered to get high quality sites for this to produce a reliable phylogeny. False negatives will give erroneous results. This is not detailed here.

Requires these Python packages: pyvcf, pyfaidx, biopython and pandas. The programs tabix (htslib) is used to index the vcf.

Imports

import sys,os
import pandas as pd
from Bio import SeqIO
from pyfaidx import Fasta
from pyfaidx import FastaVariant
import vcf

Code

This particular function returns a list of SeqRecord objects and a matrix of the sites in the form of a pandas DataFrame. It iterates over each sample once to produce a set of all unique sites in all samples, using the FastaVariant object from pyfadix. The nucleotide at each sites is recorded for the reference genome. In a second loop over each sample we extract the variant nucleotide at each of the sites and store these in a sequence. The results is a list of sequences per sample. This can then be saved to a fasta file and used as an MSA for input into RAXml or other tree constuction programs. This code has not been fully compared to other algorithms of this kind so use it at your discretion.

def fasta_alignment_from_vcf(vcf_file, ref):
    """
    Get a fasta alignment for all snp sites in a multi
    sample vcf file, including the reference sequence.
    """

    #index vcf
    cmd = 'tabix -p vcf -f {i}'.format(i=vcf_file)
    tmp = subprocess.check_output(cmd,shell=True)
    #get samples from vcf
    vcf_reader = vcf.Reader(open(vcf_file, 'rb'))
    samples = vcf_reader.samples
    print ('%s samples' %len(samples))
    result = []

    #reference sequence
    reference = Fasta(ref)
    chrom = list(reference.keys())[0]

    #get the set of all sites first
    sites=[]
    for sample in samples:      
        variant = FastaVariant(ref, vcf_file,
                                 sample=sample, het=True, hom=True)
        pos = list(variant[chrom].variant_sites)
        sites.extend(pos)

    sites = sorted(set(sites))
    print ('using %s sites' %len(sites))
    #get reference sequence for site positions
    refseq=[]
    for p in sites:
        refseq.append(reference[chrom][p-1].seq)
    refseq = ''.join(refseq)
    #seqrecord for reference
    refrec = SeqRecord(Seq(refseq),id='ref')
    result.append(refrec)

    sites_matrix = {}
    #iterate over variants in each sample
    for sample in samples:        
        seq=[]
        variant = FastaVariant(ref, vcf_file,
                                 sample=sample, het=True, hom=True)     
        for p in sites:        
            rec = variant[chrom][p-1:p]    
            seq.append(rec.seq)
        seq = ''.join(seq)
        #make seqrecord for the samples sites  
        seqrec = SeqRecord(Seq(seq),id=sample)
        result.append(seqrec)
        sites_matrix[sample] = list(seqrec)
    df = pd.DataFrame(sites_matrix)
    df.index=sites    
    return result, df