SNiPgenie - a tool for SNP site detection from NGS data

May 12 2020

Background

Finding differences between a reference genome (often the known laboratory type strain) and another genome is a common tool to study variation of sampled isolates from a population. Samples are sequenced and the fastq files then aligned to the reference. A variant caller is used to then specify the differences (variants) such as SNPs and INDELs. These generally need to be further filtered to rule out false positives. Such a sequence of steps are done by so-called ‘pipelines’ that string together the commands for each step required in the right order. There are by now plenty of pipelines made by different groups. The problem with the term is that it can describe anything from a simple hard coded script, unusable elsewhere, to re-usable workflows that carry out extra tasks beyond just the steps mentioned above. Snippy is one such tool.

SNiPgenie is a tool for microbial variant calling and phylogenetic analysis from raw read data. It was primarily written to be used with bacterial isolates of M. Bovis but can be applied to other species. It could also be called a pipeline but unlike most other tools for this purpose it has both a command line and GUI component. The GUI is made using the Qt toolkit with PySide2. This software is written in Python. It was developed on Ubuntu linux but should also run on Windows 10 in the future.

Installation

pip install -e git+https://github.com/dmnfarrell/snipgenie.git#egg=snipgenie
sudo apt install bwa samtools bcftools tabix parallel

Command line usage

Run snipgenie for the cli or snipgenie-gui for the desktop version. You require a reference genome and reads in fastq format at minimum as input.

snipgenie -r reference.fa -g reference.gff -i data_files -o results

Add your own filters and provide number of threads:

snipgenie -r reference.fa -g reference.gff -i data_files -t 8 -o results` \
 -f 'QUAL>=40 && INFO/DP>=20 && MQ>40'

By default the program won’t overwrite intermediate files when re-run. So it can be interrupted and resumed. Make sure there are no old tmp.**.bam files in the mapped folder if an alignment got interrupted.

Inputs

Folders are searched recursively for inputs with extensions *.f*q.gz. So be careful you don’t have files in the folders you don’t want included. So the following file structure will load both sets of files if you provide the parent folder as input.

data/
├── ERR1588781
│   ├── ERR1588781_1.fq.gz
│   └── ERR1588781_2.fq.gz
└── ERR1588785
    ├── ERR1588785_1.fastq.gz
    └── ERR1588785_2.fastq.gz

Filenames are parsed and a sample name extracted for each pair (if paired end). This is simply done by splitting on the _ symbol. So a file called /path/13-11594_S85_L001-4_R1_001.fastq.gz will be given a sample name 13-11594. As long as the sample names are unique this is ok. If you had a file names like A_2_L001-4_R1_001, A_3_L001-4_R1_001 you should split on ‘-‘ instead. You can specify this in the labelsep option.

Use from Python

You can run a workflow from within Python by importing the snipgenie package and invoking the WorkFlow class. You need to provide the options in a dictionary with the same keywords as the command line. Notice in this example we are loading files from two folders.

import sngenie
args = {'threads':8, 'outdir': 'results', 'labelsep':'-',
        'input':['/my/folder/',
                 '/my/other/folder'],
        'reference': None, 'overwrite':False}
W = snipgenie.app.WorkFlow(**args)
W.setup()
W.run()

GUI

A desktop application version of this tool is in progress and should mostly work but the command line tool is recommended at present. It is intended to make the tool more accessible. Inputs are loaded into a table and the analysis steps can be run indivually. Additional tools like fastq qualities can be visualised as below.