Retrieving genome assemblies via Entrez with Python

August 31 2019

Background

GenBank provides access to information on all it’s assembled genomes via the Assembly database. It’s not that hard to download assemblies individually or in bulk via the website. Though you have to know what you are looking for when it comes to micro-organisms since there can be many assembled genomes for a species. A typical example of a search result would be this page for all M. tuberculosis assemblies. This page returns 6622 results and it shows a big button allowing you to download the assemblies all at once in multiple formats such as the fasta file or gff and so on.

The above is fine but you may want to do batch downloads without using the web pages. For this the NCBI provides programmatic access via the Entrez query and database system. These utilities can be used via the command line (esearch) but for assemblies I found Python was more flexible. The BioPython package is used to access the Entrez utilities. For the case of assemblies it seems the only way to download the fasta file is to first get the assembly ids and then find the ftp link to the RefSeq or GenBank sequence using Entrez.esummary. Then a url request can be used to download the fasta file. The code is presented below and may be adapted to download any of the other formats.

Code

This code requires pandas and biopython to run.

def get_assembly_summary(id):
    """Get esummary for an entrez id"""
    from Bio import Entrez
    esummary_handle = Entrez.esummary(db="assembly", id=id, report="full")
    esummary_record = Entrez.read(esummary_handle)
    return esummary_record

def get_assemblies(term, download=True, path='assemblies'):
    """Download genbank assemblies for a given search term.
    Args:
        term: search term, usually organism name
        download: whether to download the results
        path: folder to save to
    """

    from Bio import Entrez
    #provide your own mail here
    Entrez.email = "A.N.Other@example.com"
    handle = Entrez.esearch(db="assembly", term=term, retmax='200')
    record = Entrez.read(handle)
    ids = record['IdList']
    print (f'found {len(ids)} ids')
    links = []
    for id in ids:
        #get summary
        summary = get_assembly_summary(id)
        #get ftp link
        url = summary['DocumentSummarySet']['DocumentSummary'][0]['FtpPath_RefSeq']
        if url == '':
            continue
        label = os.path.basename(url)
        #get the fasta link - change this to get other formats
        link = os.path.join(url,label+'_genomic.fna.gz')
        print (link)
        links.append(link)
        if download == True:
            #download link
            urllib.request.urlretrieve(link, f'{label}.fna.gz')
    return links

We can then call the function as follows:

links = get_assemblies("mycobacterium tuberculosis", download=True)

Using the command line

To show how you can accomplish the same task as above using the command line you can try the code below. Unless you are really familiar with using bash commands this might be very hard to read! (This code is from a biostars post)

esearch -db assembly -query 'mycobacterium tuberculosis' \
    | esummary \
    | xtract -pattern DocumentSummary -element FtpPath_GenBank \
    | while read -r line ;
    do
        fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/') ;
        wget "$line/$fname" ;
    done

The ncbi-genome-download tool

There are other tools to do this from the command line. These are very well covered in this blog. In particular the ncbi-genome-download tool is very convenient and flexible to use. However I am not sure if it can be used to accomplish the above for genome assemblies.