Fetch assemblies and associated biosample data using Entrez tools with Biopython

January 19 2024

Background

This is a somewhat altered version of code from an old post for downloading assemblies. GenBank provides access to information on all it’s assembled genomes via the assembly database. You have several options: download assemblies individually or in bulk via the website or the Entrez database search system using command line tools without the website. The later is good for large automated searches/downloads. The BioPython package provides an interface to the entrez tools too.

  • esearch - searches an NCBI database for a query and finds the unique identifiers (UIDs) for all records that match.
  • esummary - returns document summaries (DocSums) for a list of input UIDs.

Method

In this case we wish to get meta data on a set of assemblies and also retrieve extra fields from the corresponding entry in the biosample database. We use the BioSampleId to perform a linked search on the biosample db and add it to our result. The final table will also contain links to download the assembly fasta file. Obviously this can be customised to use any other linked db like bioproject. In this case I wanted the geo_loc_name field for the biosample.

First the esearch is performed and the returned ids are used to get an esummary for each in turn. This contains assembly accession and biosample id which we put in a dictionary (called row). We then run another esearch using that biosampleid. This returns sample data, some of which must be parsed from the XML using BeautifulSoup. All the attributes are added to the dict. (Some of these attributes will be empty and many you likely don’t need.) The dict is added to a list which is used to create a pandas DataFrame at the end. Note that I use the FtpPath_GenBank to get the link to the assembly file as the RefSeq one is sometimes not present. At the end there is a short function to download the links in the table.

Code

You should first install these packages (on Ubuntu based systems):

sudo apt install ncbi-entrez-direct
import os,sys,glob,re
import numpy as np
import pandas as pd
from tqdm import tqdm
from Bio import Entrez

def get_assembly_summary(id, db="assembly"):
    """Get esummary for an entrez id"""

    esummary_handle = Entrez.esummary(db=db, id=id, report="full")
    esummary_record = Entrez.read(esummary_handle)
    return esummary_record

def get_assemblies(term, retmax='5000', prev=None):
    """Download genbank assembly meta data for a given search term.
    Args:
        term: search term, usually organism name
        retmax: max return results
        prev: previous results to avoid re-downloading, a pandas dataframe
    """

    if prev is not None:
        found = list(prev.id)
    else:
        found = []
    #provide your own mail here
    Entrez.email = "A.N.Other@example.com"
    handle = Entrez.esearch(db="assembly", term=term, retmax=retmax)
    record = Entrez.read(handle)
    ids = record['IdList']
    print (f'found {len(ids)} ids')
    links = []
    result = []
    for id in tqdm(ids):
        if id in found:
            continue
        row = {'id':id}
        #get summary
        rec = get_assembly_summary(id)

        asm_summ = rec['DocumentSummarySet']['DocumentSummary'][0]
        fields = ['AssemblyAccession','BioSampleAccn','BioSampleId','SubmitterOrganization']
        for key in fields:
            row[key] = asm_summ[key]
        row['GenbankAccession'] = asm_summ['Synonym']['Genbank']

        #biosample info is a separate request using the BioSampleId
        handle = Entrez.esummary(db="biosample", id=asm_summ['BioSampleId'], report="full")
        rec2 = Entrez.read(handle)
        sampledata = rec2['DocumentSummarySet']['DocumentSummary'][0]['SampleData']
        #parse xml in sampledata
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(sampledata)
        all_attr = soup.findAll('attribute')
        for attr in all_attr:
            #print (attr,attr['attribute_name'],attr.text)
            row[attr['attribute_name']] = attr.text
        #get url
        #url = asm_summ['FtpPath_RefSeq']
        url = asm_summ['FtpPath_GenBank']
        #print (url)
        if url != '':
            label = os.path.basename(url)
            #get the fasta link - change this to get other formats
            link = os.path.join(url,label+'.fna.gz')
            row['link'] = link
        result.append(row)
    result = pd.DataFrame(result)
    if prev is not None:
        result = pd.concat([prev, result])
    return result

def download_links(df, path):
    for i,r in df.iterrows():
        label = r.AssemblyAccession
        urllib.request.urlretrieve(link, os.path.join(path, f'{label}.fna.gz'))

We call the method like this:

res = get_assemblies('Mycoplasmopsis bovis')

The resulting table will look like this:

  id         AssemblyAccession BioSampleAccn BioSampleId  ..
0 19098541   GCF_032463445.1  SAMN37573123    37573123 ..
1 18700801   GCF_016452265.2  SAMN15246619    15246619 ..
2 18700791   GCF_016452245.2  SAMN15246620    15246620 ..
3 18486841   GCF_016452225.2  SAMN15246621    15246621 ..