pm21-dragon/exercises/source/exercise-11/1__Bioinformatics_introduction.ipynb
2025-01-17 08:33:38 +01:00

25 KiB

None <html> <head> </head>

Bioinformatics introduction

In this exercise, you will perform some bioinformatic analysis. There is no specific correct answer here, but rather a series of tasks which to fulfill. You are encouraged to use online resources and discussions with colleagues to help you at all stages.

The majority of the work for this exercise is manually interacting with websites and understanding what types of data they use and first steps of using the online tools. The Python here is mainly used as a path for you to follow through the steps one-by-one.

Pick a protein-coding gene interesting to you. In case you cannot think of anything, I suggest: ACE2, TP53, TGFbeta1, Drosophila DopEcR, TAS1R2, Drosophila nAChRalpha1. Find the gene at NCBI Gene.

Task 1: What is the full name of this gene? What is the NCBI GeneID?

Drosophila nAChRalpha1, GeneID:42918

Find the complete sequence for one protein isoform using NCBI protein database website. Hint: it may be easier to find the protein sequence after first finding the gene at NCBI Gene, go to "RefSeq" sequences and then clicking on the protein sequence, which will probably start with NP_.

Download a FASTA file with the seqence of this protein isoform.

Task 2: Make a variable named original_fasta which is a string containing the FASTA format protein sequence

You can create a multi-line string in Python with triple quotes like this:

my_string = """line 1
line 2
line 3"""

Or this this:

my_string = '''line 1
line 2
line 3'''
In [5]:
original_fasta = """>NP_001262917.1 nicotinic acetylcholine receptor alpha1, isoform C [Drosophila melanogaster]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAADLSPTFEKPYAREMEKTIEGSRFIAQH
VKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFELLK
MGSENTL"""
In [6]:
# Ensure that we have the biopython package installed
!pip install biopython
Requirement already satisfied: biopython in /Users/andrew/anaconda3/envs/wm01-dragon/lib/python3.11/site-packages (1.84)
Requirement already satisfied: numpy in /Users/andrew/anaconda3/envs/wm01-dragon/lib/python3.11/site-packages (from biopython) (1.26.4)
In [7]:
# This is a test of the above, do not change this code.
import io
import Bio.SeqIO
records = [record for record in Bio.SeqIO.parse(io.StringIO(original_fasta), "fasta")]
assert len(records)==1
assert isinstance(records[0], Bio.SeqRecord.SeqRecord)

Now perform a protein BLAST search for homologous sequences using the NCBI BLAST website.

View the FASTA data with these sequences for at least 5 total sequences. Do not take other isoforms of the same gene in the same species. Take either: A) other genes in the same species or B) potentially homologous genes in other species. Do not take both A and B. You may limit your search to specific species to fulfill these criteria or your own curiosity.

Task 3: Make a variable named others which is a list of strings containing the FASTA format protein sequences

In [3]:
# Type your answer here and then run this and the following cell.
others = [
    """>XP_001981981.3 acetylcholine receptor subunit alpha-like 1 [Drosophila erecta]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAAAAADLSPTFEKPYAREMEKTIEG
SRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAK
KKFELLKMGSENTL""",
    """>XP_017020195.1 PREDICTED: acetylcholine receptor subunit alpha-like 1 isoform X1 [Drosophila kikkawai]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FGEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRFI
AQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFE
LLKMGSENTL""",
    """>XP_037710630.1 acetylcholine receptor subunit alpha-like 1 [Drosophila subpulchrella]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRFI
AQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFE
LLKMGSDNTL""",
    """>XP_002032513.1 acetylcholine receptor subunit alpha-like 1 [Drosophila sechellia]
MGSVLFTAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRF
IAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKF
ELLKMGSENTL""",
    """>XP_017009134.1 PREDICTED: acetylcholine receptor subunit alpha-like 1 [Drosophila takahashii]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRF
IAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKF
ELLKMGSDNTL"""
]
In [4]:
# This is a test of the above, do not change this code.
assert len(others)>=5
seen = [original_fasta]
for this_fasta in others:
    assert type(this_fasta)==str
    assert this_fasta not in seen    
    records = [record for record in Bio.SeqIO.parse(io.StringIO(this_fasta), "fasta")]
    assert len(records)==1

Now, let's join your original fasta data and the data you found with BLAST all together in one big multi-sequence FASTA data string.

FASTA files can have multiple sequences in one file just by concatenating (or "joining" or "adding") them together.

In [5]:
all_list = [original_fasta] + others
all_string = '\n'.join(all_list)
print(all_string)
>NP_001262917.1 nicotinic acetylcholine receptor alpha1, isoform C [Drosophila melanogaster]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAADLSPTFEKPYAREMEKTIEGSRFIAQH
VKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFELLK
MGSENTL
>XP_001981981.3 acetylcholine receptor subunit alpha-like 1 [Drosophila erecta]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAAAAADLSPTFEKPYAREMEKTIEG
SRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAK
KKFELLKMGSENTL
>XP_017020195.1 PREDICTED: acetylcholine receptor subunit alpha-like 1 isoform X1 [Drosophila kikkawai]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FGEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRFI
AQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFE
LLKMGSENTL
>XP_037710630.1 acetylcholine receptor subunit alpha-like 1 [Drosophila subpulchrella]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRFI
AQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKFE
LLKMGSDNTL
>XP_002032513.1 acetylcholine receptor subunit alpha-like 1 [Drosophila sechellia]
MGSVLFTAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRF
IAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKF
ELLKMGSENTL
>XP_017009134.1 PREDICTED: acetylcholine receptor subunit alpha-like 1 [Drosophila takahashii]
MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQLIDVNLKNQI
MTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGNYEVTIMTKAILHHTGKVVWK
PPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLRHLKQTADSDNIEVGIDLQDYYISVEWDIMR
VPAVRNEKFYSCCEEPYLDIVFNLTLRRKTLFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILL
SLTVFFLLLAEIIPPTSLTVPLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILP
KLLCIERPKKEEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC
FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAADLSPTFEKPYAREMEKTIEGSRF
IAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQAPSLYDQSQPIDILYSKIAKKKF
ELLKMGSDNTL

Task 4: Perform multi-species alignment using Clustal Omega at the EBI website.

This website runs multiple sequence alignment software. You can directly upload the multiple sequence FASTA file you generated above and let their computer do the alignment.

Cut and paste the multiple sequence FASTA above into the Clustal Omega entry page at the EBI website. Keep all parameters at their default values (Protein, Output format ClustalW with character counts).

Enter the multi-sequence alignment here below as a multi-line string called msa.

In [8]:
# This is a test of the above, do not change this code.
msa = """CLUSTAL O(1.2.4) multiple sequence alignment


XP_017020195.1      MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ	60
XP_037710630.1      MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ	60
XP_002032513.1      MGSVLFTAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ	60
XP_017009134.1      MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ	60
NP_001262917.1      MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ	60
XP_001981981.3      MGSVLFAAVFIALHFATGGLANPDAKRLYDDLLSNYNRLIRPVGNNSDRLTVKMGLRLSQ	60
                    ******:*****************************************************

XP_017020195.1      LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN	120
XP_037710630.1      LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN	120
XP_002032513.1      LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN	120
XP_017009134.1      LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN	120
NP_001262917.1      LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN	120
XP_001981981.3      LIDVNLKNQIMTTNVWVEQEWNDYKLKWNPDDYGGVDTLHVPSEHIWLPDIVLYNNADGN	120
                    ************************************************************

XP_017020195.1      YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR	180
XP_037710630.1      YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR	180
XP_002032513.1      YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR	180
XP_017009134.1      YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR	180
NP_001262917.1      YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR	180
XP_001981981.3      YEVTIMTKAILHHTGKVVWKPPAIYKSFCEIDVEYFPFDEQTCFMKFGSWTYDGYMVDLR	180
                    ************************************************************

XP_017020195.1      HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT	240
XP_037710630.1      HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT	240
XP_002032513.1      HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT	240
XP_017009134.1      HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT	240
NP_001262917.1      HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT	240
XP_001981981.3      HLKQTADSDNIEVGIDLQDYYISVEWDIMRVPAVRNEKFYSCCEEPYLDIVFNLTLRRKT	240
                    ************************************************************

XP_017020195.1      LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV	300
XP_037710630.1      LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV	300
XP_002032513.1      LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV	300
XP_017009134.1      LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV	300
NP_001262917.1      LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV	300
XP_001981981.3      LFYTVNLIIPCVGISFLSVLVFYLPSDSGEKISLCISILLSLTVFFLLLAEIIPPTSLTV	300
                    ************************************************************

XP_017020195.1      PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK	360
XP_037710630.1      PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK	360
XP_002032513.1      PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK	360
XP_017009134.1      PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK	360
NP_001262917.1      PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK	360
XP_001981981.3      PLLGKYLLFTMMLVTLSVVVTIAVLNVNFRSPVTHRMAPWVQRLFIQILPKLLCIERPKK	360
                    ************************************************************

XP_017020195.1      EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC	420
XP_037710630.1      EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC	420
XP_002032513.1      EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC	420
XP_017009134.1      EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC	420
NP_001262917.1      EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC	420
XP_001981981.3      EEPEEDQPPEVLTDVYHLPPDVDKFVNYDSKRFSGDYGIPALPASHRFDLAAAGGISAHC	420
                    ************************************************************

XP_017020195.1      FGEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAA----AAAAADLSPTFEKPY	476
XP_037710630.1      FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAA----AAAAADLSPTFEKPY	476
XP_002032513.1      FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAA---AAAAADLSPTFEKPY	477
XP_017009134.1      FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAA---AAAAADLSPTFEKPY	477
NP_001262917.1      FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAA-------AAAADLSPTFEKPY	473
XP_001981981.3      FAEPPLPSSLPLPGADDDLFSPSGLNGDISPGCCPAAAAAAAAAAAAAAADLSPTFEKPY	480
                    *.*************************************       **************

XP_017020195.1      AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ	536
XP_037710630.1      AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ	536
XP_002032513.1      AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ	537
XP_017009134.1      AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ	537
NP_001262917.1      AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ	533
XP_001981981.3      AREMEKTIEGSRFIAQHVKNKDKFESVEEDWKYVAMVLDRMFLWIFAIACVVGTALIILQ	540
                    ************************************************************

XP_017020195.1      APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL	570
XP_037710630.1      APSLYDQSQPIDILYSKIAKKKFELLKMGSDNTL	570
XP_002032513.1      APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL	571
XP_017009134.1      APSLYDQSQPIDILYSKIAKKKFELLKMGSDNTL	571
NP_001262917.1      APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL	567
XP_001981981.3      APSLYDQSQPIDILYSKIAKKKFELLKMGSENTL	574
                    ******************************:***
"""
In [9]:
# This is a test of the above, do not change this code.
records = [record for record in Bio.SeqIO.parse(io.StringIO(msa), "clustal")]
assert len(records)>=6
</html>