11 KiB
Bioinformatics introduction¶
In this exercise, you will perform some bioinformatic analysis. There is no specific correct answer here, but rather a series of tasks which to fulfill. You are encouraged to use online resources and discussions with colleagues to help you at all stages.
The majority of the work for this exercise is manually interacting with websites and understanding what types of data they use and first steps of using the online tools. The Python here is mainly used as a path for you to follow through the steps one-by-one.
Pick a protein-coding gene interesting to you. In case you cannot think of anything, I suggest: ACE2, TP53, TGFbeta1, Drosophila DopEcR, TAS1R2, Drosophila nAChRalpha1. Find the gene at NCBI Gene.
Task 1: What is the full name of this gene? What is the NCBI GeneID?¶
YOUR ANSWER HERE
Find the complete sequence for one protein isoform using NCBI protein database website. Hint: it may be easier to find the protein sequence after first finding the gene at NCBI Gene, go to "RefSeq" sequences and then clicking on the protein sequence, which will probably start with NP_
.
Download a FASTA file with the seqence of this protein isoform.
Task 2: Make a variable named original_fasta
which is a string containing the FASTA format protein sequence¶
You can create a multi-line string in Python with triple quotes like this:
my_string = """line 1
line 2
line 3"""
Or this this:
my_string = '''line 1
line 2
line 3'''
# YOUR CODE HERE
raise NotImplementedError()
# Ensure that we have the biopython package installed
!pip install biopython
# This is a test of the above, do not change this code.
import io
import Bio.SeqIO
records = [record for record in Bio.SeqIO.parse(io.StringIO(original_fasta), "fasta")]
assert len(records)==1
assert isinstance(records[0], Bio.SeqRecord.SeqRecord)
Now perform a protein BLAST search for homologous sequences using the NCBI BLAST website.
View the FASTA data with these sequences for at least 5 total sequences. Do not take other isoforms of the same gene in the same species. Take either: A) other genes in the same species or B) potentially homologous genes in other species. Do not take both A and B. You may limit your search to specific species to fulfill these criteria or your own curiosity.
Task 3: Make a variable named others
which is a list of strings containing the FASTA format protein sequences¶
# YOUR CODE HERE
raise NotImplementedError()
# This is a test of the above, do not change this code.
assert len(others)>=5
seen = [original_fasta]
for this_fasta in others:
assert type(this_fasta)==str
assert this_fasta not in seen
records = [record for record in Bio.SeqIO.parse(io.StringIO(this_fasta), "fasta")]
assert len(records)==1
Now, let's join your original fasta data and the data you found with BLAST all together in one big multi-sequence FASTA data string.
FASTA files can have multiple sequences in one file just by concatenating (or "joining" or "adding") them together.
all_list = [original_fasta] + others
all_string = '\n'.join(all_list)
print(all_string)
Task 4: Perform multi-species alignment using Clustal Omega at the EBI website.¶
This website runs multiple sequence alignment software. You can directly upload the multiple sequence FASTA file you generated above and let their computer do the alignment.
Cut and paste the multiple sequence FASTA above into the Clustal Omega entry page at the EBI website. Keep all parameters at their default values (Protein, Output format ClustalW with character counts).
Enter the multi-sequence alignment here below as a multi-line string called msa
.
# YOUR CODE HERE
raise NotImplementedError()
# This is a test of the above, do not change this code.
records = [record for record in Bio.SeqIO.parse(io.StringIO(msa), "clustal")]
assert len(records)>=6