# Retrieving genome assemblies via Entrez with Python

August 31 2019

## Background

GenBank provides access to information on all it’s assembled genomes via the Assembly database. It’s not that hard to download assemblies individually or in bulk via the website. Though you have to know what you are looking for when it comes to micro-organisms since there can be many assembled genomes for a species. A typical example of a search result would be this page for all M. tuberculosis assemblies. This page returns 6622 results and it shows a big button allowing you to download the assemblies all at once in multiple formats such as the fasta file or gff and so on.

The above is fine but you may want to do batch downloads without using the web pages. For this the NCBI provides programmatic access via the Entrez query and database system. These utilities can be used via the command line (esearch) but for assemblies I found Python was more flexible. The BioPython package is used to access the Entrez utilities. For the case of assemblies it seems the only way to download the fasta file is to first get the assembly ids and then find the ftp link to the RefSeq or GenBank sequence using Entrez.esummary. Then a url request can be used to download the fasta file. The code is presented below and may be adapted to download any of the other formats.

## Code

This code requires pandas and biopython to run.

def get_assembly_summary(id):
"""Get esummary for an entrez id"""
from Bio import Entrez
esummary_handle = Entrez.esummary(db="assembly", id=id, report="full")
return esummary_record

Args:
term: search term, usually organism name
path: folder to save to
"""

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"
handle = Entrez.esearch(db="assembly", term=term, retmax='200')
ids = record['IdList']
print (f'found {len(ids)} ids')
for id in ids:
#get summary
summary = get_assembly_summary(id)
url = summary['DocumentSummarySet']['DocumentSummary'][0]['FtpPath_RefSeq']
if url == '':
continue
label = os.path.basename(url)
#get the fasta link - change this to get other formats


We can then call the function as follows:

links = get_assemblies("mycobacterium tuberculosis", download=True)


## Using the command line

To show how you can accomplish the same task as above using the command line you can try the code below. Unless you are really familiar with using bash commands this might be very hard to read! (This code is from a biostars post)

esearch -db assembly -query 'mycobacterium tuberculosis' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_GenBank \
| while read -r line ;
do
fname=$(echo$line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/') ; wget "$line/\$fname" ;
done