9 KiB
# Import biopython
import Bio
import requests
# You must run this cell, but you can ignore its contents.
import hashlib
def ads_hash(ty):
"""Return a unique string for input"""
ty_str = str(ty).encode()
m = hashlib.sha256()
m.update(ty_str)
return m.hexdigest()[:10]
Bioinformatics with HTTP¶
Not only can we use the Star Wars API with HTTP, we can also access the NCBI's databases over HTTP. Here is more information from the NCBI. Note that in this exercise, we will be doing low volume queries without using a specialized software library. Some libraries and other software is available to automatically do this for here. For example, below we use the NCBIWWW
module from biopython. Here we do it "the hard way" at a low level.
If you start using the NCBI web resources extensively, please read the NCBI's documentation about providing them with an email address to contact you.
def get_protein_fasta(accession):
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=%s&rettype=fasta&retmode=text"%(accession,)
return requests.get(url).text
da1 = get_protein_fasta('NP_524481.2')
da1
Great, so we can get FASTA files directly from the NBCI using the accession.
Q1 Get the FASTA for accession NP_733001.1
. Put the result in the variable da2
, which should be a string.¶
# YOUR CODE HERE
raise NotImplementedError()
# This checks that the above worked
assert ads_hash(da2)=='16538bd802'
Using the biopython library for bioinformatics, including NCBI queries¶
import Bio
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio import SeqIO
from io import StringIO
import os
We can work with FASTA sequences using the biopython library. It expects multiple sequences in a given FASTA file, so we loop over them:
Each record here is an instance of the Seq class.
Let's copy the sequence to a raw python string called da2_seq
:
da2_seq = None
for record in SeqIO.parse(StringIO(da2), "fasta"):
print(record)
assert(da2_seq is None)
da2_seq = str(record.seq)
da2_seq
In addition to "raw" HTTP requests using the requests
library, biopython also is able to call the NCBI for you. It is using HTTP to perform the call, but this is hidden from you. Below, we do a BLAST search based on the sequence we just downloaded.
We can limit our search to just a few organisms using the NCBI taxon ID. The easiest way to find these it to start typing in the BLAST web search entry page and copy the taxon ID from there.
Here are a few taxon IDs for some insects and then some code to limit our NCBI query just to these taxa.
# Bombus terrestris 30195
# Apis mellifera 7460
# Locusta migratoria 7004
# Drosophila melanogaster 7227
# Tribolium castaneum 7070
taxids = (30195, 7460, 7004, 7227, 7070)
taxid_query = ' OR '.join(['txid%d[ORGN]'%taxid for taxid in taxids])
taxid_query
Now with our query limited to these specific groups, we are going to run a BLAST search. As with the web browser interface, this can take some time, so the code below is written to only run the web search when the output file is not present. Therefore, once you run the web search the first time, it will not run again unless you delete the file.
Futhermore, as mentioned in the bio python tutorial, we need to be careful with our result handle when we get it because it can be read only once. So, here we the results of our search to a local file. Later, we can read this as often as we want.
This may take some time as we are running a full BLAST search on the NCBI servers.
fname = "da2_blast.xml"
if not os.path.exists(fname):
result_handle = NCBIWWW.qblast("blastp", "nr", da2_seq, entrez_query=taxid_query)
with open(fname, "w") as out_handle:
out_handle.write(result_handle.read())
else:
print("not overwriting file %s"%fname)
blast_record = NCBIXML.read(open(fname))
for alignment in blast_record.alignments:
print(alignment)
Let's do another blast search for the first protein we had. Again, this can take a long period of time to run on the NCBI servers.
da1_seq = None
for record in SeqIO.parse(StringIO(da1), "fasta"):
da1_seq = str(record.seq)
fname = "da1_blast.xml"
if not os.path.exists(fname):
result_handle = NCBIWWW.qblast("blastp", "nr", da1_seq, entrez_query=taxid_query)
with open(fname, "w") as out_handle:
out_handle.write(result_handle.read())
else:
print("not overwriting file %s"%fname)
In the results, each alignment returns a sequence of HSPS ("High Scoring Pairs").
blast_record = NCBIXML.read(open(fname))
for alignment in blast_record.alignments:
print(alignment)
print("%d HSPs"%len(alignment.hsps))
for hsps in alignment.hsps:
print(hsps)
print()