25 KiB
Bioinformatics introduction¶
In this exercise, you will perform some bioinformatic analysis. There is no specific correct answer here, but rather a series of tasks which to fulfill. You are encouraged to use online resources and discussions with colleagues to help you at all stages.
The majority of the work for this exercise is manually interacting with websites and understanding what types of data they use and first steps of using the online tools. The Python here is mainly used as a path for you to follow through the steps one-by-one.
Pick a protein-coding gene interesting to you. In case you cannot think of anything, I suggest: ACE2, TP53, TGFbeta1, Drosophila DopEcR, TAS1R2, Drosophila nAChRalpha1. Find the gene at NCBI Gene.
Task 1: What is the full name of this gene? What is the NCBI GeneID?¶
Drosophila nAChRalpha1, GeneID:42918
Find the complete sequence for one protein isoform using NCBI protein database website. Hint: it may be easier to find the protein sequence after first finding the gene at NCBI Gene, go to "RefSeq" sequences and then clicking on the protein sequence, which will probably start with NP_
.
Download a FASTA file with the seqence of this protein isoform.
Task 2: Make a variable named original_fasta
which is a string containing the FASTA format protein sequence¶
You can create a multi-line string in Python with triple quotes like this:
my_string = """line 1
line 2
line 3"""
Or this this:
my_string = '''line 1
line 2
line 3'''
original_fasta = """>NP_001262917.1 nicotinic acetylcholine receptor alpha1, isoform C [Drosophila melanogaster]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAADLSPTFEKPYAREMEKTIEGSRFIAQH
VKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFELLK
MGSENTL"""
# Ensure that we have the biopython package installed
!pip install biopython
# This is a test of the above, do not change this code.
import io
import Bio.SeqIO
records = [record for record in Bio.SeqIO.parse(io.StringIO(original_fasta), "fasta")]
assert len(records)==1
assert isinstance(records[0], Bio.SeqRecord.SeqRecord)
Now perform a protein BLAST search for homologous sequences using the NCBI BLAST website.
View the FASTA data with these sequences for at least 5 total sequences. Do not take other isoforms of the same gene in the same species. Take either: A) other genes in the same species or B) potentially homologous genes in other species. Do not take both A and B. You may limit your search to specific species to fulfill these criteria or your own curiosity.
Task 3: Make a variable named others
which is a list of strings containing the FASTA format protein sequences¶
# Type your answer here and then run this and the following cell.
others = [
""">XP_001981981.3 acetylcholine receptor subunit alpha-like 1 [Drosophila erecta]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAAAAADLSPTFEKPYAREMEKTIEG
SRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAK
KKFELLKMGSENTL""",
""">XP_017020195.1 PREDICTED: acetylcholine receptor subunit alpha-like 1 isoform X1 [Drosophila kikkawai]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FGEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRFI
AQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFE
LLKMGSENTL""",
""">XP_037710630.1 acetylcholine receptor subunit alpha-like 1 [Drosophila subpulchrella]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRFI
AQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFE
LLKMGSDNTL""",
""">XP_002032513.1 acetylcholine receptor subunit alpha-like 1 [Drosophila sechellia]
MGSVLFTAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRF
IAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKF
ELLKMGSENTL""",
""">XP_017009134.1 PREDICTED: acetylcholine receptor subunit alpha-like 1 [Drosophila takahashii]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRF
IAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKF
ELLKMGSDNTL"""
]
# This is a test of the above, do not change this code.
assert len(others)>=5
seen = [original_fasta]
for this_fasta in others:
assert type(this_fasta)==str
assert this_fasta not in seen
records = [record for record in Bio.SeqIO.parse(io.StringIO(this_fasta), "fasta")]
assert len(records)==1
Now, let's join your original fasta data and the data you found with BLAST all together in one big multi-sequence FASTA data string.
FASTA files can have multiple sequences in one file just by concatenating (or "joining" or "adding") them together.
all_list = [original_fasta] + others
all_string = '\n'.join(all_list)
print(all_string)
Task 4: Perform multi-species alignment using Clustal Omega at the EBI website.¶
This website runs multiple sequence alignment software. You can directly upload the multiple sequence FASTA file you generated above and let their computer do the alignment.
Cut and paste the multiple sequence FASTA above into the Clustal Omega entry page at the EBI website. Keep all parameters at their default values (Protein, Output format ClustalW with character counts).
Enter the multi-sequence alignment here below as a multi-line string called msa
.
# This is a test of the above, do not change this code.
msa = """CLUSTAL O(1.2.4) multiple sequence alignment
XP_017020195.1 MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ 60
XP_037710630.1 MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ 60
XP_002032513.1 MGSVLFTAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ 60
XP_017009134.1 MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ 60
NP_001262917.1 MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ 60
XP_001981981.3 MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ 60
******:*****************************************************
XP_017020195.1 LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN 120
XP_037710630.1 LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN 120
XP_002032513.1 LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN 120
XP_017009134.1 LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN 120
NP_001262917.1 LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN 120
XP_001981981.3 LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN 120
************************************************************
XP_017020195.1 YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR 180
XP_037710630.1 YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR 180
XP_002032513.1 YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR 180
XP_017009134.1 YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR 180
NP_001262917.1 YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR 180
XP_001981981.3 YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR 180
************************************************************
XP_017020195.1 HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT 240
XP_037710630.1 HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT 240
XP_002032513.1 HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT 240
XP_017009134.1 HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT 240
NP_001262917.1 HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT 240
XP_001981981.3 HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT 240
************************************************************
XP_017020195.1 LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV 300
XP_037710630.1 LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV 300
XP_002032513.1 LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV 300
XP_017009134.1 LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV 300
NP_001262917.1 LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV 300
XP_001981981.3 LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV 300
************************************************************
XP_017020195.1 PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK 360
XP_037710630.1 PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK 360
XP_002032513.1 PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK 360
XP_017009134.1 PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK 360
NP_001262917.1 PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK 360
XP_001981981.3 PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK 360
************************************************************
XP_017020195.1 EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC 420
XP_037710630.1 EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC 420
XP_002032513.1 EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC 420
XP_017009134.1 EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC 420
NP_001262917.1 EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC 420
XP_001981981.3 EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC 420
************************************************************
XP_017020195.1 FGEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAA----AAAAADLSPTFEKPY 476
XP_037710630.1 FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAA----AAAAADLSPTFEKPY 476
XP_002032513.1 FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAA---AAAAADLSPTFEKPY 477
XP_017009134.1 FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAA---AAAAADLSPTFEKPY 477
NP_001262917.1 FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAA-------AAAADLSPTFEKPY 473
XP_001981981.3 FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAAAAADLSPTFEKPY 480
*.************************************* **************
XP_017020195.1 AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ 536
XP_037710630.1 AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ 536
XP_002032513.1 AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ 537
XP_017009134.1 AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ 537
NP_001262917.1 AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ 533
XP_001981981.3 AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ 540
************************************************************
XP_017020195.1 APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL 570
XP_037710630.1 APSLYDQSQPIDILYSKIAKKKFELLKMGSDNTL 570
XP_002032513.1 APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL 571
XP_017009134.1 APSLYDQSQPIDILYSKIAKKKFELLKMGSDNTL 571
NP_001262917.1 APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL 567
XP_001981981.3 APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL 574
******************************:***
"""
# This is a test of the above, do not change this code.
records = [record for record in Bio.SeqIO.parse(io.StringIO(msa), "clustal")]
assert len(records)>=6