1186 lines
1.1 MiB
Plaintext
1186 lines
1.1 MiB
Plaintext
![]() |
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# You must run this cell, but you can ignore its contents.\n",
|
|||
|
"\n",
|
|||
|
"import hashlib\n",
|
|||
|
"\n",
|
|||
|
"def ads_hash(ty):\n",
|
|||
|
" \"\"\"Return a unique string for input\"\"\"\n",
|
|||
|
" ty_str = str(ty).encode()\n",
|
|||
|
" m = hashlib.sha256()\n",
|
|||
|
" m.update(ty_str)\n",
|
|||
|
" return m.hexdigest()[:10]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# transcriptome clustering analysis\n",
|
|||
|
"\n",
|
|||
|
"In this exercise, you are going to analyze the results of an experiment in which the RNA was sequenced (a \"transcriptome\" was made) for many cells in cell culture. We expect that the total number of cell types is rather limited, although we sequenced many individual cells.\n",
|
|||
|
"\n",
|
|||
|
"The data here is fake, but the analysis methods are real and are in heavy use across lots of different labs and can be applied to many other types of problems beyond RNA sequencing data."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.decomposition import PCA\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We are given a dataset where the RNA expression levels of 50 genes from each of many cells was quantified. The data is in the file `RNAseq_data_50genes.csv`. Let's read this into a pandas DataFrame."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"df = pd.read_csv('RNAseq_data_50genes.csv')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Now let's have a first look at this data."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>gene 0</th>\n",
|
|||
|
" <th>gene 1</th>\n",
|
|||
|
" <th>gene 2</th>\n",
|
|||
|
" <th>gene 3</th>\n",
|
|||
|
" <th>gene 4</th>\n",
|
|||
|
" <th>gene 5</th>\n",
|
|||
|
" <th>gene 6</th>\n",
|
|||
|
" <th>gene 7</th>\n",
|
|||
|
" <th>gene 8</th>\n",
|
|||
|
" <th>gene 9</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>gene 40</th>\n",
|
|||
|
" <th>gene 41</th>\n",
|
|||
|
" <th>gene 42</th>\n",
|
|||
|
" <th>gene 43</th>\n",
|
|||
|
" <th>gene 44</th>\n",
|
|||
|
" <th>gene 45</th>\n",
|
|||
|
" <th>gene 46</th>\n",
|
|||
|
" <th>gene 47</th>\n",
|
|||
|
" <th>gene 48</th>\n",
|
|||
|
" <th>gene 49</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>2377</td>\n",
|
|||
|
" <td>2886</td>\n",
|
|||
|
" <td>1524</td>\n",
|
|||
|
" <td>2235</td>\n",
|
|||
|
" <td>2472</td>\n",
|
|||
|
" <td>1256</td>\n",
|
|||
|
" <td>1006</td>\n",
|
|||
|
" <td>1902</td>\n",
|
|||
|
" <td>911</td>\n",
|
|||
|
" <td>2285</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1115</td>\n",
|
|||
|
" <td>1381</td>\n",
|
|||
|
" <td>2226</td>\n",
|
|||
|
" <td>1810</td>\n",
|
|||
|
" <td>2124</td>\n",
|
|||
|
" <td>1479</td>\n",
|
|||
|
" <td>719</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>529</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>1251</td>\n",
|
|||
|
" <td>948</td>\n",
|
|||
|
" <td>3038</td>\n",
|
|||
|
" <td>3857</td>\n",
|
|||
|
" <td>1971</td>\n",
|
|||
|
" <td>1761</td>\n",
|
|||
|
" <td>2371</td>\n",
|
|||
|
" <td>632</td>\n",
|
|||
|
" <td>1705</td>\n",
|
|||
|
" <td>2251</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>954</td>\n",
|
|||
|
" <td>1874</td>\n",
|
|||
|
" <td>527</td>\n",
|
|||
|
" <td>1783</td>\n",
|
|||
|
" <td>1922</td>\n",
|
|||
|
" <td>1029</td>\n",
|
|||
|
" <td>173</td>\n",
|
|||
|
" <td>2267</td>\n",
|
|||
|
" <td>1078</td>\n",
|
|||
|
" <td>1343</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>2650</td>\n",
|
|||
|
" <td>1643</td>\n",
|
|||
|
" <td>1560</td>\n",
|
|||
|
" <td>2545</td>\n",
|
|||
|
" <td>1689</td>\n",
|
|||
|
" <td>1072</td>\n",
|
|||
|
" <td>1999</td>\n",
|
|||
|
" <td>1707</td>\n",
|
|||
|
" <td>579</td>\n",
|
|||
|
" <td>1655</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>589</td>\n",
|
|||
|
" <td>409</td>\n",
|
|||
|
" <td>967</td>\n",
|
|||
|
" <td>1762</td>\n",
|
|||
|
" <td>1789</td>\n",
|
|||
|
" <td>2424</td>\n",
|
|||
|
" <td>494</td>\n",
|
|||
|
" <td>1680</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1283</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>1622</td>\n",
|
|||
|
" <td>1581</td>\n",
|
|||
|
" <td>1333</td>\n",
|
|||
|
" <td>2218</td>\n",
|
|||
|
" <td>2346</td>\n",
|
|||
|
" <td>342</td>\n",
|
|||
|
" <td>1534</td>\n",
|
|||
|
" <td>1571</td>\n",
|
|||
|
" <td>456</td>\n",
|
|||
|
" <td>187</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>2535</td>\n",
|
|||
|
" <td>1524</td>\n",
|
|||
|
" <td>236</td>\n",
|
|||
|
" <td>1343</td>\n",
|
|||
|
" <td>1089</td>\n",
|
|||
|
" <td>1429</td>\n",
|
|||
|
" <td>930</td>\n",
|
|||
|
" <td>1415</td>\n",
|
|||
|
" <td>173</td>\n",
|
|||
|
" <td>1153</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>1863</td>\n",
|
|||
|
" <td>993</td>\n",
|
|||
|
" <td>1225</td>\n",
|
|||
|
" <td>1318</td>\n",
|
|||
|
" <td>1854</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1461</td>\n",
|
|||
|
" <td>634</td>\n",
|
|||
|
" <td>663</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1882</td>\n",
|
|||
|
" <td>1072</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>3040</td>\n",
|
|||
|
" <td>1031</td>\n",
|
|||
|
" <td>553</td>\n",
|
|||
|
" <td>1468</td>\n",
|
|||
|
" <td>2100</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1001</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>5 rows × 50 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" gene 0 gene 1 gene 2 gene 3 gene 4 gene 5 gene 6 gene 7 gene 8 \\\n",
|
|||
|
"0 2377 2886 1524 2235 2472 1256 1006 1902 911 \n",
|
|||
|
"1 1251 948 3038 3857 1971 1761 2371 632 1705 \n",
|
|||
|
"2 2650 1643 1560 2545 1689 1072 1999 1707 579 \n",
|
|||
|
"3 1622 1581 1333 2218 2346 342 1534 1571 456 \n",
|
|||
|
"4 1863 993 1225 1318 1854 0 1461 634 663 \n",
|
|||
|
"\n",
|
|||
|
" gene 9 ... gene 40 gene 41 gene 42 gene 43 gene 44 gene 45 gene 46 \\\n",
|
|||
|
"0 2285 ... 0 1115 1381 2226 1810 2124 1479 \n",
|
|||
|
"1 2251 ... 954 1874 527 1783 1922 1029 173 \n",
|
|||
|
"2 1655 ... 589 409 967 1762 1789 2424 494 \n",
|
|||
|
"3 187 ... 2535 1524 236 1343 1089 1429 930 \n",
|
|||
|
"4 0 ... 1882 1072 0 3040 1031 553 1468 \n",
|
|||
|
"\n",
|
|||
|
" gene 47 gene 48 gene 49 \n",
|
|||
|
"0 719 0 529 \n",
|
|||
|
"1 2267 1078 1343 \n",
|
|||
|
"2 1680 0 1283 \n",
|
|||
|
"3 1415 173 1153 \n",
|
|||
|
"4 2100 0 1001 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 50 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Q1 Understanding the raw data.\n",
|
|||
|
"\n",
|
|||
|
"In the first row (with index 0), how many reads were made of gene 0? Put the answer in the variable `n_reads_sample0_gene0`."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": false,
|
|||
|
"grade_id": "cell-867448dc531b9577",
|
|||
|
"locked": false,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": true,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Type your answer here and then run this and the following cell.\n",
|
|||
|
"n_reads_sample0_gene0 = 2377"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": true,
|
|||
|
"grade_id": "cell-927c0add61a309ae",
|
|||
|
"locked": true,
|
|||
|
"points": 1,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": false,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# This is a test of the above, do not change this code.\n",
|
|||
|
"assert type(n_reads_sample0_gene0)==int\n",
|
|||
|
"assert ads_hash(n_reads_sample0_gene0)=='1a5de96b83'"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"How many total cells were sequenced? This is the number of rows in the dataframe. Put this in the variable `n_cells_sequenced`."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": false,
|
|||
|
"grade_id": "cell-190def2eb393c7e4",
|
|||
|
"locked": false,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": true,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Type your answer here and then run this and the following cell.\n",
|
|||
|
"n_cells_sequenced = len(df)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": true,
|
|||
|
"grade_id": "cell-c04f0d4317e6abfe",
|
|||
|
"locked": true,
|
|||
|
"points": 1,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": false,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# This is a test of the above, do not change this code.\n",
|
|||
|
"assert type(n_cells_sequenced)==int\n",
|
|||
|
"assert ads_hash(n_cells_sequenced)=='284b7e6d78'"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"How many genes do we have in our dataset? This is actually the dimensionality of our dataset. We are counting reads for each of these genes, so if we have N genes, we have an N dimensional dataset. Put your answer in the variable `n_dim`."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": false,
|
|||
|
"grade_id": "cell-866e97e764ad0ed7",
|
|||
|
"locked": false,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": true,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Type your answer here and then run this and the following cell.\n",
|
|||
|
"n_dim = 50"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": true,
|
|||
|
"grade_id": "cell-b28c7b470e97c1de",
|
|||
|
"locked": true,
|
|||
|
"points": 1,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": false,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# This is a test of the above, do not change this code.\n",
|
|||
|
"assert type(n_dim)==int\n",
|
|||
|
"assert ads_hash(n_dim)=='1a6562590e'"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Quickly plotting pandas DataFrames with seaborn\n",
|
|||
|
"\n",
|
|||
|
"Several lectures ago, we discussed seaborn we are are going to use it below to make a plot with our transcriptomic data. Let's first practice with a simple dataset:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>column 1</th>\n",
|
|||
|
" <th>column 2</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" column 1 column 2\n",
|
|||
|
"0 1 1\n",
|
|||
|
"1 2 5\n",
|
|||
|
"2 3 5\n",
|
|||
|
"3 4 2\n",
|
|||
|
"4 5 2"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAGwCAYAAABVdURTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAsxklEQVR4nO3df3CU5b3//9diIAmQXX6FZDMEAUXQQJAJnJJWkRILGk2lzZkeWwcBlWMKokIZ09jxV20nMMdTkYMDovgjpsJMjWgQf0A9yWKZMAMaBBUoQpQ0Jsacym4SZBPg/v7RT/brkk3YDZvs7uXzMXPP9L7u6959X16196v3fe0dm2VZlgAAAAzRL9IFAAAAhBPhBgAAGIVwAwAAjEK4AQAARiHcAAAAoxBuAACAUQg3AADAKHGRLqCvnTt3Tl9++aWSkpJks9kiXQ4AAAiCZVlqbm5WWlqa+vXr/t7M9y7cfPnll0pPT490GQAAoAdqa2s1atSobvt878JNUlKSpH/9w7Hb7RGuBgAABMPj8Sg9Pd13He/O9y7cdDyKstvthBsAAGJMMEtKWFAMAACMQrgBAABGIdwAAACjEG4AAIBRCDcAAMAohBsAAGAUwg0AADAK4QYAABiFcAMAAIxCuAEAAEaJaLh59NFHZbPZ/LbU1NRuz3G5XMrKylJCQoLGjRunDRs29FG1MJ37VJuONbao+sQ3OvZ1i9yn2iJdEkLA/AHoEPG/LZWRkaG//vWvvv1LLrmky741NTXKzc3V4sWLVVpaqt27d2vJkiVKTk5Wfn5+X5QLQ3158lsVlh3Q+0ebfG0zx4/QqvxMpQ1JjGBlCAbzB+C7Iv5YKi4uTqmpqb4tOTm5y74bNmzQ6NGjtWbNGl155ZW66667dMcdd+iJJ57ow4phGveptk4XRknadbRJvy07wB2AKMf8AThfxMPN0aNHlZaWprFjx+rWW2/V8ePHu+xbVVWlOXPm+LXNnTtX+/btU3t7e8BzvF6vPB6P3wZ8V1NLW6cLY4ddR5vU1MLFMZoxfwDOF9Fw84Mf/EAlJSV699139eyzz6qhoUE//OEP9X//938B+zc0NCglJcWvLSUlRWfOnFFTU+D/cSsuLpbD4fBt6enpYR8HYpvndOBg3KH5AscRWcwfgPNFNNzceOONys/P1+TJk3X99ddr+/btkqSXXnqpy3NsNpvfvmVZAds7FBUVye12+7ba2towVQ9T2BP6d3s86QLHEVnMH4DzRfyx1HcNGjRIkydP1tGjRwMeT01NVUNDg19bY2Oj4uLiNHz48IDnxMfHy263+23Ad40YPEAzx48IeGzm+BEaMXhAH1eEUDB/AM4XVeHG6/Xq0KFDcjqdAY9nZ2dr586dfm07duzQtGnT1L8//+8MPeMYOECr8jM7XSBnjh+h1fmZcgzk4hjNmD8A57NZHc91ImDlypXKy8vT6NGj1djYqD/84Q9yuVw6ePCgLr30UhUVFamurk4lJSWS/vVT8EmTJunuu+/W4sWLVVVVpYKCAm3evDnon4J7PB45HA653W7u4sCP+1Sbmlra1Hy6XUkJ/TVi8AAujDGE+QPMFsr1O6LvufnHP/6hX/7yl2pqalJycrJmzJihPXv26NJLL5Uk1dfX68SJE77+Y8eO1VtvvaXly5fr6aefVlpamtauXcs7bhAWjoFcDGMZ8wegQ0Tv3EQCd24AAIg9oVy/o2rNDQAAwMUi3AAAAKMQbgAAgFEINwAAwCiEGwAAYBTCDQAAMArhBgAAGIVwAwAAjEK4AQAARiHcAAAAoxBuAACAUQg3AADAKIQbAABgFMINAAAwCuEGAAAYhXADAACMQrgBAABGIdwAAACjEG4AAIBRCDcAAMAohBsAAGAUwg0AADAK4QYAABiFcAMAAIxCuAEAAEYh3AAAAKMQbgAAgFEINwAAwCiEGwAAYBTCDQAAMArhBgAAGCVqwk1xcbFsNpvuv//+LvtUVlbKZrN12g4fPtx3hQIAgKgWF+kCJGnv3r3auHGjMjMzg+p/5MgR2e12335ycnJvlQYAAGJMxO/ctLS06LbbbtOzzz6roUOHBnXOyJEjlZqa6tsuueSSXq4SAADEioiHm6VLl+qmm27S9ddfH/Q5U6dOldPpVE5OjioqKrrt6/V65fF4/DYAAGCuiD6W2rJliz788EPt3bs3qP5Op1MbN25UVlaWvF6vXn75ZeXk5KiyslIzZ84MeE5xcbEee+yxcJYNAACimM2yLCsSX1xbW6tp06Zpx44dmjJliiRp1qxZuvrqq7VmzZqgPycvL082m03l5eUBj3u9Xnm9Xt++x+NRenq63G6337odAAAQvTwejxwOR1DX74g9lvrggw/U2NiorKwsxcXFKS4uTi6XS2vXrlVcXJzOnj0b1OfMmDFDR48e7fJ4fHy87Ha73wYAAMwVscdSOTk5OnjwoF/bokWLNHHiRBUWFga9SLi6ulpOp7M3SgQAADEoYuEmKSlJkyZN8msbNGiQhg8f7msvKipSXV2dSkpKJElr1qzRmDFjlJGRoba2NpWWlqqsrExlZWV9Xj8AAIhOUfGem67U19frxIkTvv22tjatXLlSdXV1SkxMVEZGhrZv367c3NwIVgkAAKJJxBYUR0ooC5IAAEB0iIkFxQAAAL2BcAMAAIxCuAEAAEYh3AAAAKMQbgAAgFEINwAAwCiEGwAAYBTCDQAAMArhBgAAGIVwAwAAjEK4AQAARiHcAAAAoxBuAACAUQg3AADAKIQbAABgFMINAAAwCuEGAAAYhXADAACMQrgBAABGIdwAAACjEG4AAIBRCDcAAMAohBsAAGAUwg0AADAK4QYAABiFcAMAAIxCuAEAAEYh3AAAAKMQbgAAgFEINwAAwCiEGwAAYJSoCTfFxcWy2Wy6//77u+3ncrmUlZWlhIQEjRs3Ths2bOibAgEAQEyIinCzd+9ebdy4UZmZmd32q6mpUW5urq699lpVV1frwQcf1L333quysrI+qhQAAES7iIeblpYW3XbbbXr22Wc1dOjQbvtu2LBBo0eP1po1a3TllVfqrrvu0h133KEnnniij6oFAADRLuLhZunSpbrpppt0/fXXX7BvVVWV5syZ49c2d+5c7du3T+3t7QHP8Xq98ng8fhsAADBXRMPNli1b9OGHH6q4uDio/g0NDUpJSfFrS0lJ0ZkzZ9TU1BTwnOLiYjkcDt+Wnp5+0XUDAIDoFbFwU1tbq/vuu0+lpaVKSEgI+jybzea3b1lWwPYORUVFcrvdvq22trbnRQMAgKgXF6kv/uCDD9TY2KisrCxf29mzZ7Vr1y6tW7dOXq9Xl1xyid85qampamho8GtrbGxUXFychg8fHvB74uPjFR8fH/4BAACAqBSxcJOTk6ODBw/6tS1atEgTJ05UYWFhp2AjSdnZ2dq2bZtf244dOzRt2jT179+/V+sFAACxIWLhJikpSZMmTfJrGzRokIYPH+5rLyoqUl1dnUpKSiRJBQUFWrdunVasWKHFixerqqpKmzZt0ubNm/u8fgAAEJ0i/mup7tTX1+vEiRO+/bFjx+qtt95SZWWlrr76aj3++ONau3at8vPzI1glAACIJjarY0Xu94TH45HD4ZDb7Zbdbo90OQAAIAihXL+j+s4NAABAqAg3AADAKIQbAABgFMINAAAwCuEGAAAYhXADAACMQrgBAABGIdwAAACjEG4AAIBRCDcAAMAohBsAAGAUwg0AADAK4QYAABiFcAMAAIxCuAEAAEYh3AAAAKMQbgAAgFEINwAAwCiEGwAAYBTCDQAAMArhBgAAGIVwAwAAjEK4AQAARiHcAAAAoxBuAACAUQg3AADAKIQbAABgFMINAAAwCuEGAAAYhXADAACMQrgBAABGiWi4Wb9+vTIzM2W322W325Wdna233367y/6VlZWy2WydtsOHD/dh1QAAIJrFRfLLR40apVWrVunyyy+XJL300ku65ZZbVF1drYyMjC7PO3LkiOx2u28/OTm512sFAACxIaLhJi8vz2//j3/8o9avX689e/Z0G25GjhypIUOG9HJ1AAAgFkXNmpuzZ89qy5Ytam1
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"simple_df = pd.DataFrame(data={'column 1':[1,2,3,4,5], 'column 2':[1,5,5,2,2]})\n",
|
|||
|
"display(simple_df)\n",
|
|||
|
"sns.scatterplot(data=simple_df, x='column 1', y='column 2');"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Q2 Visualizing the raw data\n",
|
|||
|
"\n",
|
|||
|
"Now let's use seaborn to make a quick plot of the data (stored as a Pandas DataFrame in the variable `df`). Use the seaborn `scatterplot` function to make a plot like the following. Your plot should include the X and Y axes labels. You need only a single line of code for this.\n",
|
|||
|
"\n",
|
|||
|
"."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": true,
|
|||
|
"grade_id": "cell-3eb44d7a59e553d8",
|
|||
|
"locked": false,
|
|||
|
"points": 1,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": true,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"<Axes: xlabel='gene 0', ylabel='gene 1'>"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAGwCAYAAABIC3rIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACm8ElEQVR4nO29eXxTdb7//0rTJm26pBvdpK2VFgVbEEERCoKgCFqR5aoD/rggDOpFYLjAVVFRXAZcxhUGx+vIMi7gd66C4jjMoIxIQUSwDqsMBQQcWkprm9KmTdr0/P6o55jlrMlJzknyfj4ePB40OUk+n7N8Pq/Pe/sYGIZhQBAEQRAEEcXEaN0AgiAIgiAIrSFBRBAEQRBE1EOCiCAIgiCIqIcEEUEQBEEQUQ8JIoIgCIIgoh4SRARBEARBRD0kiAiCIAiCiHpitW5AuNDV1YVz584hOTkZBoNB6+YQBEEQBCEDhmFw8eJF5OXlISZG2A5Egkgm586dQ35+vtbNIAiCIAjCD86ePYuePXsKvk+CSCbJyckAuk9oSkqKxq0hCIIgCEIOzc3NyM/P5+ZxIUgQyYR1k6WkpJAgIgiCIIgwQyrchYKqCYIgCIKIekgQEQRBEAQR9ZAgIgiCIAgi6iFBRBAEQRBE1EOCiCAIgiCIqIcEEUEQBEEQUQ8JIoIgCIIgoh4SRARBEARBRD0kiAiCIAiCiHpIEBEEQRAEEfXQ1h0EEYXY7E7UtzjR3N6BlIQ4ZCaaYLWYtG4WQRCEZpAgIogo41xTGx764AB2Hq/nXru+JBPPTu6HvNQEDVtGEAShHeQyI4gowmZ3+oghAPjyeD0e/uAAbHanRi0jCILQFhJEBBFF1Lc4fcQQy5fH61HfQoKIIIjohAQRQUQRze0dou9flHifIAgiUiFBRBBRREp8nOj7yRLvEwRBRCokiAgiishMMuH6kkze964vyURmEmWaEQQRnZAgIogowmox4dnJ/XxE0fUlmXhucj9KvScIImqhtHuCiDLyUhOwcsoA1Lc4cbG9A8nxcchMojpEBEFENySICCIKsVpIABEEQbhDLjOCIAiCIKIeEkQEQRAEQUQ9JIgIgiAIgoh6SBARBEEQBBH1kCAiCIIgCCLqIUFEEARBEETUQ4KIIAiCIIiohwQRQRAEQRBRDwkigiAIgiCiHhJEBEEQBEFEPSSICIIgCIKIekgQEQRBEAQR9ZAgIgiCIAgi6tFUEL3++uvo168fUlJSkJKSgiFDhuCvf/0r9/6MGTNgMBg8/l133XUe3+FwODBv3jxkZmYiMTER48ePx48//uhxTGNjI6ZNmwar1Qqr1Ypp06ahqakpFF0kCIIgCCIM0FQQ9ezZE88++yz27duHffv2YdSoUbj99ttx+PBh7pixY8eipqaG+/fpp596fMeCBQuwadMmbNy4EZWVlWhpaUFFRQVcLhd3zNSpU/Hdd99h69at2Lp1K7777jtMmzYtZP0kCIIgCELfGBiGYbRuhDvp6el44YUXMGvWLMyYMQNNTU3YvHkz77E2mw09evTA22+/jbvuugsAcO7cOeTn5+PTTz/FzTffjKNHj6Jv377Ys2cPBg8eDADYs2cPhgwZgu+//x6XX34573c7HA44HA7u7+bmZuTn58NmsyElJUXdThMEQRAEERSam5thtVol52/dxBC5XC5s3LgRra2tGDJkCPf6F198gaysLPTu3RuzZ89GXV0d997+/fvR0dGBMWPGcK/l5eWhtLQUu3fvBgB89dVXsFqtnBgCgOuuuw5Wq5U7ho8VK1ZwLjar1Yr8/Hw1u0sQBEEQhI7QXBAdPHgQSUlJMJvNuP/++7Fp0yb07dsXADBu3Di8++672L59O1588UV88803GDVqFGe5qa2thclkQlpamsd3Zmdno7a2ljsmKyvL53ezsrK4Y/hYsmQJbDYb9+/s2bNqdZkgCIIgCJ0Rq3UDLr/8cnz33XdoamrCBx98gOnTp2PHjh3o27cv5wYDgNLSUgwaNAiFhYX4y1/+gkmTJgl+J8MwMBgM3N/u/xc6xhuz2Qyz2exnrwiCIAiCCCc0txCZTCYUFxdj0KBBWLFiBfr3749XX32V99jc3FwUFhbi+PHjAICcnBw4nU40NjZ6HFdXV4fs7GzumPPnz/t814ULF7hjCIIgCIKIbjQXRN4wDOMRzOxOQ0MDzp49i9zcXADAwIEDERcXh23btnHH1NTU4NChQxg6dCgAYMiQIbDZbNi7dy93zNdffw2bzcYdQxAEQRBEdKOpy+yRRx7BuHHjkJ+fj4sXL2Ljxo344osvsHXrVrS0tGDZsmWYPHkycnNz8cMPP+CRRx5BZmYmJk6cCACwWq2YNWsWFi1ahIyMDKSnp2Px4sUoKyvDjTfeCADo06cPxo4di9mzZ+ONN94AANx7772oqKgQzDAjCIIgCCK60FQQnT9/HtOmTUNNTQ2sViv69euHrVu34qabbkJbWxsOHjyIP/3pT2hqakJubi5uuOEGvP/++0hOTua+4+WXX0ZsbCzuvPNOtLW1YfTo0Vi3bh2MRiN3zLvvvov58+dz2Wjjx4/HqlWrQt5fgiAIgiD0ie7qEOkVuXUMCIIgCILQD2FXh4ggCIIgCEIrSBARBEEQBBH1kCAiCIIgCCLqIUFEEARBEETUo3mlaiJ42OxO1Lc40dzegZSEOGQmmmC1mLRuFkEQBEHoDhJEEcq5pjY89MEB7Dxez712fUkmnp3cD3mpCRq2jCAIgiD0B7nMIhCb3ekjhgDgy+P1ePiDA7DZnRq1jCAIgiD0CVmIIpD6FqePGGL58ng96lucIXOdkduOIAiCCAdIEEUgze0dou9flHhfLchtRxAEQYQL5DKLQFLi40TfT5Z4Xw3IbUcQBEGEEySIIpDMJBOuL8nkfe/6kkxkJgXfZSXHbUcQ0YzN7sSJuhZUnWnEiQsttEggCI0hl1kEYrWY8Ozkfnj4gwP40std9dzkfiGJ4dGL244g9Ai5kwlCf5AgilDyUhOwcsoA1Lc4cbG9A8nxcchMCl1Asx7cdgShR6TcySunDKDEA4LQABJEEYzVol1GF+u2+5LHbRYqtx1B6BE9ZYESBPELFENEBAXWbecdyxRKtx1B6BFyJxOEPiELERE0tHbbEYQeIXcyQegTEkREUNHSbUcQeoTcyQShT8hlRgQFSikmCH7InUwQ+oQsRITqUEoxQYhD7mSC0B9kISJUhSpUE4Q8rBYTemUl4aqCNPTKSiIxRBAaQxYiQlUopZgAaFNfgiDCDxJEhKpQSjFBLlOCIMIRcpkRqkIpxdENuUwJgghXSBARqqKHjWUJ7aBNfQmCCFdIEBGqQinF2qGHUgfkMiUIIlyhGKIwRO8Bq5RSHHr0ErdDLlOCIMIVEkRhhl4mPimoQrUyAhG5eto9naowEwQRrpAg0hClk6CeJr5wRY/WtUBFrp5KHbAu04c/OOAhishlShCE3iFBpBH+TIJ6mvjUINTiRI/WNTVErt7idqLZZapHwU0QhDxIEGmAv5OgmhOf1gN3qMWJXq1raojcdIsJb00fBEdnF+LjjPj2TCPWVJ6C3ekCoE3cTjS6TPUouAmCkA8JIg3wdxJUK2BV64FbC3GiV+taoCL3XFMbHtt8CDurf+lbeXEGXpsyAPM3VGFQYRrF7YQAvQpugiDkQ4JIA/ydBNUIWNXDwK2FOFHDuhYMq1ogIpe7ltWe53JXdQMAYGlFX4zs3cPvNmptRQwn9Cq4CYKQDwkiDfB3ElQjYFUPA7cWMS+BWteCZVULROSKXctd1Q1YdtuVyPWzbVpbEcMNvcVxEQShHCrMqAGBVHNmA1Y/XzgCm+cMxecLR2DllAGyJz49DNxa1KoJ5JwHczuKQApZSl3LVkenX22i7TeUQ/WXCCL80VQQvf766+jXrx9SUlKQkpKCIUOG4K9//Sv3PsMwWLZsGfLy8pCQkICRI0fi8OHDHt/hcDgwb948ZGZmIjExEePHj8ePP/7ocUxjYyOmTZsGq9UKq9WKadOmoampKRRd5CXQas5Wiwm9spJwVUEaemUlKbLo6GHg1mJ7j0DOebC3o/B
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Type your answer here. It should make a plot just like above.\n",
|
|||
|
"sns.scatterplot(data=df, x='gene 0', y='gene 1')\n",
|
|||
|
"# plt.savefig('scatterplot.png')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# first impressions of the data\n",
|
|||
|
"\n",
|
|||
|
"So, how does our data look? At first glance it looks... like a bunch of random numbers with no real structure! But could there be some structure? For example, above we learned that although many cells have been sequenced, we expect these are from only a very limited number of cell types.\n",
|
|||
|
"\n",
|
|||
|
"How can we figure out something about these cell types?\n",
|
|||
|
"\n",
|
|||
|
"Let's make use of principal component analysis (PCA) and clustering from scikit-learn as some of the first tools in our toolkit."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Converting to plain numpy\n",
|
|||
|
"\n",
|
|||
|
"While Pandas is very convenient for many things, scikit learn uses plain numpy arrays and generally works best when the datatype is a floating point number rather than an integer. Let's do this conversion now and call our data `X`. (We also `copy()` this to a new numpy array to ensure it is contiguous in memory.)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"X = df.to_numpy(dtype=np.float64).copy()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## PCA\n",
|
|||
|
"\n",
|
|||
|
"Let's first run PCA on our data."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.decomposition import PCA\n",
|
|||
|
"pca = PCA().fit(X)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"The results of our analysis are stored in the variable `pca`. We can use this to project our original 50 dimensional data into its principle components and plot just the first two dimensions in this principle component space."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"projected = pca.transform(X)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlAAAAGwCAYAAABmTltaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACA5UlEQVR4nO3dfXwU5bk//s8kkBAirAkBkkAwAXny8CigBBCIVh5EOVhawXgoWEURkVKkEOivLXrEBKrWFsQHpNrjA3C+BbRWodCDEkMAAwYJIlBpQoAkQjRsEGKCyf37I+y6OzuzO7M7szu7+bxfL88pm9ndmU1m55rrvu7rloQQAkRERESkWVSod4CIiIgo3DCAIiIiItKJARQRERGRTgygiIiIiHRiAEVERESkEwMoIiIiIp0YQBERERHp1CrUOxCJmpqaUFFRgXbt2kGSpFDvDhEREWkghMDFixeRmpqKqCjvOSYGUCaoqKhAWlpaqHeDiIiI/HD69Gl07drV6zYMoEzQrl07AM2/gPbt24d4b4iIiEiL2tpapKWlOa/j3jCAMoFj2K59+/YMoIiIiMKMlvIbFpETERER6cQAioiIiEgnBlBEREREOjGAIiIiItKJARQRERGRTgygiIiIiHRiAEVERESkEwMoIiIiIp0YQBERERHpxACKiIiISCcGUEREREQ6MYAisoBKex0KT1aj0l4X6l0hIiINuJgwUYhtKirH0i0laBJAlATk/rg/pg3rFurdIiIiL5iBIgqhSnudM3gCgCYBLNtyhJkoIiKLYwaKKAQq7XUorb6Er7+tdwZPDo1CoKz6MlJscaHZOSIi8okBFIUtRxCSkRQfVsGGfMhOAuAaQ0VLEtKT2oZq94iISAMGUBSWwrVuSGnITpKAKAE0oTl4evrH/cIqICQiaokYQFHYUasbGt2rY9ADDy1ZMNdtDp6q8RiyEwJYkz0YifGxSE9qy+CJiCgMhE0R+YsvvogBAwagffv2aN++PTIzM7Ft2zbnz4UQWL58OVJTUxEXF4exY8fi888/d3uN+vp6PPbYY0hKSkJ8fDwmT56MM2fOuG1TU1ODGTNmwGazwWazYcaMGbhw4UIwDpE0Kq2+pFo3FEybisoxMm8Xstftx8i8XdhUVO51mxG5uzDv7WKPbaIlCTdel4DMHh0YPBERhYmwCaC6du2KvLw8HDhwAAcOHMCtt96K//zP/3QGSatWrcJzzz2HNWvWoKioCMnJybj99ttx8eJF52ssWLAAW7duxcaNG1FQUIBvv/0Wd955JxobG53bZGdn49ChQ9i+fTu2b9+OQ4cOYcaMGUE/XlKXkRSPKMn9sWDXDWmZPSffRii8TpQEU4fs2F+KiMgckhBC6Xs9LCQmJuL3v/89fv7znyM1NRULFizAkiVLADRnmzp37oyVK1fi4Ycfht1uR8eOHfHGG29g2rRpAICKigqkpaXhgw8+wPjx4/HFF1/ghhtuwL59+3DzzTcDAPbt24fMzEwcO3YMvXv31rRftbW1sNlssNvtaN++vTkH38JtKirHsi1H0CiEs24omDVQhSerkb1uv8fjG2YPR2aPDl63cbXm3sG4c2CqKfsYrnViREShouf6HTYZKFeNjY3YuHEjLl26hMzMTJSWlqKqqgrjxo1zbhMbG4sxY8agsLAQAHDw4EFcuXLFbZvU1FT069fPuc3evXths9mcwRMADB8+HDabzbmNkvr6etTW1rr9R+aaNqwbCnKysGH2cBTkZCkGBmZmX7RkwZS2kW8/JD3B8H0DWk5/KWbYiChUwiqAKikpwTXXXIPY2FjMmTMHW7duxQ033ICqqioAQOfOnd2279y5s/NnVVVViImJQUJCgtdtOnXq5PG+nTp1cm6jJDc311kzZbPZkJaWFtBxkjYptjjFuqFKex1WvH/UZ31SoO+d++P+iJaaIySl2XPybaSrLQvUtveXUhBhlToxM2mpQSMiMktYzcLr3bs3Dh06hAsXLmDz5s2YOXMmdu/e7fy5JLnf7gshPB6Tk2+jtL2v11m6dCkWLlzo/HdtbS2DKD8F2ttpU1E5cjaXuNUbmTVLb9qwbhjdqyPKqi+rzp6TbwPA6/Z6qQ3TObJfrkFUJPWXstJMTCJqmcIqgIqJicH1118PABg6dCiKiorwxz/+0Vn3VFVVhZSUFOf2586dc2alkpOT0dDQgJqaGrcs1Llz5zBixAjnNl999ZXH+54/f94ju+UqNjYWsbGxgR9gCxdozY7joqpU1GdWd+8UW5zP15RvY9Q++Aoicn/c36NOLFKCC28Ztkg5RiKytrAawpMTQqC+vh4ZGRlITk7Gzp07nT9raGjA7t27ncHRkCFD0Lp1a7dtKisrceTIEec2mZmZsNvt+OSTT5zb7N+/H3a73bkNmcOImh2li6pDJGVfHHwN02mpEwtXVpiJSUQtW9hkoJYtW4aJEyciLS0NFy9exMaNG/HRRx9h+/btkCQJCxYswNNPP42ePXuiZ8+eePrpp9G2bVtkZ2cDAGw2Gx544AE8/vjj6NChAxITE7Fo0SL0798fP/rRjwAAffv2xYQJEzB79my8/PLLAICHHnoId955p+YZeOQfIzIKSsNWgPmtAkJFyzCdlgxZOIr0DBsRWV/YBFBfffUVZsyYgcrKSthsNgwYMADbt2/H7bffDgBYvHgx6urqMHfuXNTU1ODmm2/Gjh070K5dO+dr/OEPf0CrVq1wzz33oK6uDrfddhtef/11REdHO7d56623MH/+fOdsvcmTJ2PNmjXBPdgWyIiaHflFNQrAg6MzcP/IjIi8sLb0IEJLDRoRkVnCug+UVbEPlH+M6u1Uaa9rURfVlna8RERm0XP9ZgBlAgZQ/tMaDAQ6W4+IiEhOz/U7bIbwqGXQUrPDDttERBRqYT0Lj1qeSO+wzc7aREThgRkoAhD6ITGt7x/J/X+YWSMiCh8MoCjkF2497x+pHbat2Fk71EE1EZGVcQivhQv1kJje99eyBl04MnrtukCHArnOHBGRd8xAtXChHhLz5/0D7f9jxcyKkZk1o5bEsVI2jIjIapiBauFCvSSGv++fYotDZo8Oui/oVs2sGJVZM2tJnECyYUREkYgBVAsX6iGxYL5/qIcrfTFi7Tojgp9QB9VEROGAQ3gU8iUxgvX+oR6u1CLQtevMWBInUurMiIiMxACKAIR+0dlgvH+kzuBzZVTwE+qgmojI6hhAUYvRUjIrRgU/oQ6qiYisjAEUtShmZ1asMsOPwQ8RkbkYQFGLY1ZwEeqGpEREFDychUdhz4j14wJ9DavP8CMiImMxA0VhzYisjxGvEQ4z/IiIyDjMQFHYMiLrY1TmKCMpHrLWSW4z/LRmuIzIphERkfmYgaKwVGmvw98PVwSc9Qkkc+RaMJ5/4rzbzyTAOcNPa4aLNVREROGDARSFHddAQ05vXyd/e0O57oMj8+S6O5IEjO7VUfO6ckauP2eVmYBERJGMQ3gUVuSBhit/+jr5s5SMfB8E3IMnoDkAKqu+rHlpFaPWn7PqWn9ERJGGGSgKK0qBBgD8ZlJf3DEgxa+Mi97eUGr7IHf4zAVMHpSqKcNlRJd0I7NYRETkHTNQFFbUFrr1N3hySLHFIbNHB02vobQP8gJyAFi1/TgAaMpwGbGosloW6/3DlSxKJyIymCSE0HAvTXrU1tbCZrPBbrejffv2od6dsFZpr8PBUzUQQmBoeqKzKFu+HEuwi63l+/DAqHS88nGpx3YbZg9HZo8OqLTXacpwad1O7bkj83YpZse0FqWzfoqIWjI9128GUCZgAGWMTUXlyNlc4qwvkgDkTW0OAgIJNIziug8APIKXaElCQU5WUPfPNbCT87U/SrMAR/fqyICKKEzwBihwDKBCjAFU4NSyKVEA9iy9NahfDq5fSgBUv6D0ZMbkX3RKmbZA9vf9w5V46v0vPH7myIgpPUf+eUtonk3ItgpE1sc2KMbQc/1mETlZklqhdhPg1qPJ7DsutXYFSl9QWovR5V90dw/ugi2fnlXMtPkjxRaHSQNS8PQHX2guSlf6vAUAwYJ0IsvjBJLQYBE5WZJSoTbQ/AfrCAL
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plt.plot(projected[:,0], projected[:,1],'.')\n",
|
|||
|
"plt.xlabel('PC1')\n",
|
|||
|
"plt.ylabel('PC2');"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Structure in the PCA space\n",
|
|||
|
"\n",
|
|||
|
"Now, what do you notice about the data in this PCA space?\n",
|
|||
|
"\n",
|
|||
|
"Now, instead of looking like a structure-free blob, we seem to have some structure. What kind of structure do we have?\n",
|
|||
|
"\n",
|
|||
|
"\"3 clusters\", I hope you are thinking. Biologically speaking, we are now guessing that there were this many cell types in our original sample.\n",
|
|||
|
"\n",
|
|||
|
"## PCA explained variance\n",
|
|||
|
"\n",
|
|||
|
"One of the first questions about PCA is how much of the variance in our data are \"explained\" by the first N components of the projected data. Let's plot this."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAGwCAYAAABVdURTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABUVUlEQVR4nO3deVhU5d8G8Puw7yO7gAi44wIuuOCSO0a+tlhpWS5lpaWpWZq0/NSyMC2zMlNbtMXUTGw1lVxzK0RUxA0VxGUQQZxBdmae9w9jchzUOTjDwHB/rovrcs45c+bLieT2WSUhhAARERGRlbCxdAFEREREpsRwQ0RERFaF4YaIiIisCsMNERERWRWGGyIiIrIqDDdERERkVRhuiIiIyKrYWbqAmqbVanHx4kW4u7tDkiRLl0NERERGEEKgoKAAgYGBsLG5fdtMvQs3Fy9eRHBwsKXLICIiomo4d+4cGjVqdNtr6l24cXd3B3D94Xh4eFi4GiIiIjKGWq1GcHCw7vf47dS7cFPZFeXh4cFwQ0REVMcYM6SEA4qJiIjIqjDcEBERkVVhuCEiIiKrwnBDREREVoXhhoiIiKwKww0RERFZFYYbIiIisioMN0RERGRVGG6IiIjIqjDcEBERkVWxaLjZuXMnhgwZgsDAQEiShJ9++umO79mxYwc6deoEJycnNGnSBEuWLDF/oURERFRnWDTcFBYWIjIyEosWLTLq+oyMDNx3333o1asXUlJS8Nprr2HSpElYt26dmSslIiIiYyhVxdhzOhdKVbHFarDoxpmxsbGIjY01+volS5agcePGWLhwIQAgPDwc+/fvx/vvv4+HH37YTFUSERERcD24ZOQWIszHFQEKZ4Pzq/7JwuvrU6EVgI0ExA9th+GdG9d4nXVqV/C9e/ciJiZG79igQYPw5Zdfory8HPb29gbvKS0tRWlpqe61Wq02e51ERER1yZ1CC6AfXCQA/xcRgIAGzlCqSpCtKsa5/GJkq0p012sF8FrCEdzTwveW9zSXOhVusrOz4e/vr3fM398fFRUVyM3NRUBAgMF74uPjMXv27JoqkYiIqE5Zk5SFuIT/WlteiWmJtkEKnL1ShLO5hcjMK8LpywXIyC3SvUcA+PWw8o731giBzNwihps7kSRJ77UQosrjleLi4jB16lTda7VajeDgYPMVSEREVIvc3Cqj1QpcVBUjM7cIB8/l44PNJyH+vVYrgHmbThh97/vaNkTHEE8EKJxhbyth/HfJ0Ir/zttKEkJ9XEz7DRmhToWbhg0bIjs7W+9YTk4O7Ozs4O3tXeV7HB0d4ejoWBPlERER1ajbdScVlVVg8bbT+HTbKV148fdwxNWicpRWaG9730aeTmjVUIFQbxeEeLvA3ckeU384aBBc3hzSWu9z44e2w2sJR6ARAraShHeHtq3xVhugjoWb6Oho/Prrr3rHNm/ejKioqCrH2xAREVmrG7uTJAl4tFMjeLo6IP3SNZy8VIDz+YazlS6pr49BtbeVEOLtioYeTth9Khc3ZBbYSsDa8d0NQklpheaOwWV458a4p4UvMnOLEOrjYpFgA1g43Fy7dg2nTp3Svc7IyMDBgwfh5eWFxo0bIy4uDhcuXMA333wDABg/fjwWLVqEqVOn4tlnn8XevXvx5ZdfYtWqVZb6FoiIiMyiqu6ks1eKcCJbjX8yruCr3Zm6a4UAfth/3qj7fjS8PQZHBMDO9vpqMGuSsoxqbTE2uAQonC0WaipZNNzs378fffv21b2uHBszevRorFixAkqlEllZWbrzYWFh2LBhA1566SV8+umnCAwMxMcff8xp4EREZFVW7MnA7F+P4t9hpWjk6Yy8a2UoLtfc9n0Dw/1xTwsfNPd3h8LZHoM//sugK6lLEy9dsAHktbbUhuBiDElUjsitJ9RqNRQKBVQqFTw8PCxdDhER1UM3tsrY2dgg7aIKaRfVSLuowqFzV3HhakmV73O0s0ELf3c09nLGhtTsm7qTJOya0VcvfFTVKmOJdWdMQc7v7zo15oaIiKg2u9N6MXnXSvHJ1nR8vecs5LYsfDAsEg+2D4KtzfXZwcZ0J9WWMTA1jeGGiIjIBG5eL+atB9oiPMAdKVlXcei8CgfP5ePclaq3JGjs5Yz2wZ5oE+iBgAZOmLLacGZS96beumAD1K0xMDWN4YaIiOguXbxapAs2wPX1Yt746YjR73/v4UhEN/1vSZPisjvPTALqZ3AxBsMNERGREW7scvJxc8TRi2rsP5uP5LNXsOdUnl5LSyUPJzt0CfNC++AGaB/sCT8PR9y7cOcdF7qrr91JpsJwQ0REdAdf78nErF/TdLOX7GwlVGhuP2rGRgI2TumFwAb6wcXYhe7YKlN9DDdERFTv3TwQuKCkHPsz87EvIw87T+bimFJ/0+UKjYC7kx06h3qhU4gnokI8kZ5zDTN/TtMLLTcHG4CtMjWB4YaIiOo1vZV+AQR5OuPi1eIqu5lutPTJTujezEf3umsTb/QP97Oq9WLqKoYbIiKyalVNzy6r0OJAVj42HcnG8j2ZumsFoNu2IMTbBV3DvNDS3x3vbDhmME4mzNfV4LMYWmoHhhsiIrJaN++/dF/bABSWVeDvM1duu9rvpyM6YHBEoO61m5NdrdgQkozDcENERFbpyAUVZqxL1S2WJwTwe6pSd97HzQEdG3si8eglg5V+O4Z46t2L42TqFoYbIiKqs27scvJ2dUTy2XzsOHkZO09extGbBgFXerJbYzzRNQQt/d1hYyMZvXEku5zqDoYbIiKqk27scgIAB1sJZXeYnm0rSZjQt5leSGGrjPVhuCEiojpDoxU4kJWPnw5ewMp9WXrnyjQCni726NPSD71b+KJncx9sOXaJrTL1EMMNERHVOjd2N7k52uGv9Fz8eewSth3PQX5R+S3f9+mIjnrTs9kqUz8x3BARUa1yc3eTrQTc2Nvk4WSHbk28qxwIzOnZBDDcEBFRLXH68jWs3X8OS3ac0TuuEUCwpzMGtWmI/uH+iAr1hL2tjdEDgan+YbghIqIaVdnlFOrtgiuF5dh4JBub0rKRnnPtlu+Z94j+rtkAu5zo1hhuiIioxqz6JwuvrU/VbUB5I3tbCR0be+KfjCsG3U0375pdiV1OVBWGGyIiMishBI5cUGNVUha+/zvL4Hyflr54sH0Q+rbyg8LZnt1NdNcYboiIyGRunOWkLq7Ar4cu4rfDF5GZV3TL94y7p6lelxO7m+huMdwQEZFJ3DzL6UZO9jbo3tQH247nGNXlxO4muhsMN0REdFdUReVY+c9ZzNt4wuBcz2Y+eDSqEQaE+8PV0Y5dTlQjGG6IiMgoN3Y5+bg5YufJy0g4cAGJxy6hrEJb5Xsm9G3GLieqcQw3RER0Rzd2OUkAXB1tca1UozvfxMcVGbmF7HKiWoHhhoiIbuu4Uo0Z61J1wUUAuFaqQQNnewzt2AhDOwahTaAHfth/jl1OVCsw3BARkQGtVmDvmTysTjqHP1KVqGqv7UUjOqBnc1/da3Y5UW3BcENERLrxNG4OdtiZfhlr9p/DuSvFt7zeVpLQ1M/N4Di7nKg2YLghIqrnvv/7LF5ff8Sgdcbd0Q4PdAjEY50bI+2iil1OVGcw3BAR1VPn84vw5V8ZWL4n0+Dcm4PDMaJrCJwdbAEAbYMU7HKiOoPhhojIit04fTtA4YwKjRZbj+dg1T9Z2H7ycpV7PAFA60CFLthUYpcT1RUMN0REVurG6ds2EtC3pR+OXFThkrpUd01UaAMkZ141eqNKorqA4YaIyAopVcV6WyFoBbDleA4AwMvVAY92aoTHujRGmI8rVw0mq8NwQ0RkZQpKyvHZ9tNV7vE0qV8zTOjXDI52/3U5cQo3WRuGGyKiOurm8TRnLl/DN3vPYu3+cygs0xhcbytJeLxrY71gU4njaciaMNwQEdVBetshSEALPzecuHRNd76ZnxvaBnrgl0MXoRVgdxPVKww3RER1zM3jaYSALtgMCPfDmO5h6NHMG5Ik4dXYVuxuonqH4YaIqA45m1eIuRuOVzmeZuHwSDzYoZHeMXY3UX3EcENEVAvdOJ6moYcT9pzOw/LdGdh
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plt.plot(np.cumsum(pca.explained_variance_ratio_),'.-')\n",
|
|||
|
"plt.xlabel('number of components')\n",
|
|||
|
"plt.ylabel('cumulative explained variance');"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## mini batch K-Means\n",
|
|||
|
"\n",
|
|||
|
"Given that it looks like our data may have three clusters, let's find these clusters using mini batch K-Means."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.cluster import MiniBatchKMeans\n",
|
|||
|
"from sklearn.metrics.pairwise import pairwise_distances_argmin"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Q3. Specifying *k*, the number of clusters\n",
|
|||
|
"\n",
|
|||
|
"As always, with a K-Means type algorithm, we must specify the number of clusters before running the algorithm. Use your thoughts from above and put this in the variable `n_clusters`."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": false,
|
|||
|
"grade_id": "cell-956d77de44aaa49b",
|
|||
|
"locked": false,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": true,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Type your answer here and then run this and the following cell.\n",
|
|||
|
"n_clusters = 3"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {
|
|||
|
"nbgrader": {
|
|||
|
"grade": true,
|
|||
|
"grade_id": "cell-b05818d9386980f7",
|
|||
|
"locked": true,
|
|||
|
"points": 1,
|
|||
|
"schema_version": 3,
|
|||
|
"solution": false,
|
|||
|
"task": false
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# This is a test of the above, do not change this code.\n",
|
|||
|
"assert type(n_clusters)==int\n",
|
|||
|
"assert ads_hash(n_clusters)=='4e07408562'"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Now we are going to actually run the algorithm from scikit learn."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"mbk = MiniBatchKMeans(n_clusters=n_clusters, batch_size=6, random_state=0, n_init='auto').fit(X);"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## plotting the clustering results in the original \"number of reads\" space\n",
|
|||
|
"\n",
|
|||
|
"Let's first plot the our raw read data in a scatter plot like above, but colored according to our cluster label. We will also plot our cluster centers here."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"mbk_means_cluster_centers = mbk.cluster_centers_\n",
|
|||
|
"mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAGwCAYAAABIC3rIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB9+klEQVR4nO3de3xT9f0/8Fd6S+ktpS290QqVm2AREJQW8YoW2BCZDlH37aowHE5hCMx9dajomKgTmV+dd0F0OvZzDicTuUzlJpQqUuQmghRpoXfa9AZpm57fHyWHJM3lJDkn5yR5PR8PHg+SnCSfc06a887n8/68PzpBEAQQERERhbAwtRtAREREpDYGRERERBTyGBARERFRyGNARERERCGPARERERGFPAZEREREFPIYEBEREVHIi1C7AYGiq6sLp0+fRnx8PHQ6ndrNISIiIgkEQUBzczMyMzMRFua8H4gBkUSnT59Gdna22s0gIiIiL5SXlyMrK8vp4wyIJIqPjwfQfUATEhJUbg0RERFJ0dTUhOzsbPE67gwDIoksw2QJCQkMiIiIiAKMu3QXJlUTERFRyGNARERERCGPARERERGFPAZEREREFPIYEBEREVHIY0BEREREIY8BEREREYU8BkREREQU8hgQERERUchjQEREREQhjwERERERhTwGRERERBTyuLgrUQhqa2rHZ6sPofpEE9L6J2BC0TDEJESp3SwiItWwh4goBH22+hDKD5+BqbUT5YfP4LPVh9RuEhGRqhgQEYWg6hNNELq6/y90ATUnmtVtEBGRyhgQEYWgtP4J0J3/69eFAan949VtEBGRyhgQEYWgCUXDkD00CdGxkcgemoQJRcPUbhIRkaqYVE0UgmISonDz3JFqN4OISDPYQ0REREQhjwERERERhTwGRERERBTyGBARERFRyGNARERERCGPARERERGFPAZEREREFPIYEBEREVHIY0BEREREIY8BEREREYU8BkREREQU8hgQERERUchjQEREREQhjwERERERhTwGRERERBTyVA2IXnnlFVx22WVISEhAQkIC8vPz8emnn4qP33333dDpdDb/8vLybF7DZDJh7ty5SElJQWxsLKZOnYqKigqbbRoaGlBYWAiDwQCDwYDCwkI0Njb6YxeJiIgoAKgaEGVlZeHpp5/G119/ja+//ho33HADbrnlFhw8eFDcZtKkSaisrBT/rV+/3uY15s+fj7Vr12LNmjXYsWMHWlpaMGXKFJjNZnGbu+66C6WlpdiwYQM2bNiA0tJSFBYW+m0/iYiISNt0giAIajfCWlJSEv785z9j1qxZuPvuu9HY2IiPPvrI4bZGoxF9+vTBu+++ixkzZgAATp8+jezsbKxfvx4TJ07E4cOHMWzYMBQXF2Ps2LEAgOLiYuTn5+O7777DkCFDHL62yWSCyWQSbzc1NSE7OxtGoxEJCQny7jQREREpoqmpCQaDwe31WzM5RGazGWvWrEFrayvy8/PF+7ds2YLU1FQMHjwYs2fPRk1NjfjYnj170NHRgYKCAvG+zMxM5ObmYufOnQCAXbt2wWAwiMEQAOTl5cFgMIjbOLJs2TJxiM1gMCA7O1vO3SUiIiINUT0g2r9/P+Li4qDX6zFnzhysXbsWw4YNAwBMnjwZ7733Hj7//HMsX74cX331FW644Qax56aqqgpRUVHo3bu3zWumpaWhqqpK3CY1NbXH+6amporbOPLwww/DaDSK/8rLy+XaZSIiItKYCLUbMGTIEJSWlqKxsREffvghioqKsHXrVgwbNkwcBgOA3NxcjBkzBv369cMnn3yCW2+91elrCoIAnU4n3rb+v7Nt7On1euj1ei/3ioiIiAKJ6j1EUVFRGDhwIMaMGYNly5ZhxIgReOGFFxxum5GRgX79+uHo0aMAgPT0dLS3t6OhocFmu5qaGqSlpYnbVFdX93it2tpacRsiIiIKbaoHRPYEQbBJZrZWX1+P8vJyZGRkAABGjx6NyMhIbN68WdymsrISBw4cwLhx4wAA+fn5MBqNKCkpEbfZvXs3jEajuA0RERGFNlWHzB555BFMnjwZ2dnZaG5uxpo1a7BlyxZs2LABLS0tWLJkCW677TZkZGTgxIkTeOSRR5CSkoKf/exnAACDwYBZs2Zh4cKFSE5ORlJSEhYtWoThw4fjxhtvBAAMHToUkyZNwuzZs/Haa68BAO69915MmTLF6QwzIiIiCi2qBkTV1dUoLCxEZWUlDAYDLrvsMmzYsAE33XQTzp49i/379+Odd95BY2MjMjIycP311+Mf//gH4uPjxddYsWIFIiIicPvtt+Ps2bOYMGEC3n77bYSHh4vbvPfee5g3b544G23q1Kl46aWX/L6/REREpE2aq0OkVVLrGBAREZF2BFwdIiIiIiK1MCAiIiKikMeAiIiIiEIeAyIiIiIKeQyIiIiIKOQxICIiIqKQp/paZqSctqZ2fLb6EKpPNCGtfwImFA1DTEKU2s0iIiLSHPYQBbHPVh9C+eEzMLV2ovzwGXy2+pDaTSIiItIkBkRBrPpEE4Su7v8LXUDNiWZ1G0RERKRRHDILYmn9E1B++AyELkAXBqT2j3f/JJlx2I6IiAIBe4iC2ISiYcgemoTo2EhkD03ChKJhfm8Dh+2IiCgQsIcoiMUkROHmuSNVbQOH7YiIKBAwICJFaWHYjkiLOJxMpC0cMiNFaWHYjkiLOJxMpC3sISJFaWHYjkiLOJxMpC3sISIiUkFa/wTozn8DcziZSH0MiIiIVMDhZCJt4ZAZEZEKOJxMpC3sISIiIqKQxx4iUgSnFBMRUSBhDxEpglOKiYgokDAgIkVwSjEREQUSBkSkCE4pDm21zSYUrSzByCc3oWhlCWqbTWo3iYjIJQZEpAhOKQ5tiz7Yhx3H6tDY1oEdx+qw6IN9ajeJiMglJlWTIjilOLTtq2iEuUsAAJi7BHxb0ahug4iI3GAPERHJbkRWIsLDdACA8DAdLstKVLdBRERuMCAiChJaytt5bvoIjB+Ygt4xkRg/MAXPTR+hWluIiKTQCYIgqN2IQNDU1ASDwQCj0YiEhARV28IaP+RI0coS7DhWB3OXgPAwHcYPTMHqmVeq3SwiIlVJvX6zhygAscZP8Glrase6F0vx5sJtWPdiKdqa2j1+DebtEBF5j0nVKvK2p4c1fryn1d41S5ArdEEMcj1NSh+RlWjTQ8S8HSIi6dhDpCJve3qCpcaPHL0intJq75ocQS7zdtSnpTwuIvIMAyIVeXsRlKPGjxrBiD01ghOt9q75EuRazuW/l+zGz1uj8OX867B65pXoE69XqLXkDOsvEQUuBkQq8vYiaKnxM2v51bh57kivhny00FOiRnCi1d41X4JcLZxL6sY8LqLAxRwiFU0oGobPVh9CzYlmpPaP92s1Zy30lKT1TxDzZvwVnPhyzJXMP/KlkKVS51Kr+VZaxjwuosClag/RK6+8gssuuwwJCQlISEhAfn4+Pv30U/FxQRCwZMkSZGZmolevXrjuuutw8OBBm9cwmUyYO3cuUlJSEBsbi6lTp6KiosJmm4aGBhQWFsJgMMBgMKCwsBCNjY3+2EWX5Ojp8ZYWekrUWN7Dl2Ou1Z4Ypc6lVvdXy5jHRRS4VA2IsrKy8PTTT+Prr7/G119/jRtuuAG33HKLGPQ8++yzeP755/HSSy/hq6++Qnp6Om666SY0N1/4BTx//nysXbsWa9aswY4dO9DS0oIpU6bAbDaL29x1110oLS3Fhg0bsGHDBpSWlqKwsNDv+6slWlhrTM2A0Bta6FVzRKlzqdX91bI+8Xqsnnkl9j5WwDwuogCjucKMSUlJ+POf/4yZM2ciMzMT8+fPx+9//3sA3b1BaWlpeOaZZ/DrX/8aRqMRffr0wbvvvosZM2YAAE6fPo3s7GysX78eEydOxOHDhzFs2DAUFxdj7NixAIDi4mLk5+fju+++w5AhQxy2w2QywWS6MEOkqakJ2dnZmijMSOpY92KpzRBf9tCkoF6vLdT2l4iCU8AVZjSbzVizZg1aW1uRn5+PsrIyVFVVoaCgQNxGr9fj2muvxc6dOwEAe/bsQUdHh80
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fig, ax = plt.subplots(nrows=1, ncols=1)\n",
|
|||
|
"\n",
|
|||
|
"x_gene_idx = 0\n",
|
|||
|
"y_gene_idx = 1\n",
|
|||
|
"\n",
|
|||
|
"for k in range(n_clusters):\n",
|
|||
|
" my_members = mbk_means_labels == k\n",
|
|||
|
" cluster_center = mbk_means_cluster_centers[k]\n",
|
|||
|
" line, = ax.plot(X[my_members, x_gene_idx], X[my_members, 1], '.', markersize=5)\n",
|
|||
|
" ax.plot(cluster_center[x_gene_idx], cluster_center[1], 'o', markersize=10, markeredgecolor='black', markerfacecolor=line.get_color())\n",
|
|||
|
"ax.set_xlabel('gene %d' % x_gene_idx)\n",
|
|||
|
"ax.set_ylabel('gene %d' % y_gene_idx);"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## plotting the clustering results in PCA space\n",
|
|||
|
"\n",
|
|||
|
"Hmm, the plot above was not too informative. It does not seem to show obvious clusters in the data, and the points look very interwoven with others, at least for these two genes.\n",
|
|||
|
"\n",
|
|||
|
"Let's re-plot our cluster assignments, but this time using the projection into PCA space."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlAAAAGwCAYAAABmTltaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABdUElEQVR4nO3de3wU1fk/8M8Sks2FZCEJ5ALhZgChUZFEISjloiYiaLFWsFRKvqXUiIAYLBb8/pRaLmoVL1C0VSpeqKAirSKXUBWUkggEUBAUUCCBJIRAsgmXXEjO7w++O81udrM7uzM7M7uf9+u1r5LdszPnTLDz8JznnDEJIQSIiIiIyGPttO4AERERkdEwgCIiIiKSiQEUERERkUwMoIiIiIhkYgBFREREJBMDKCIiIiKZGEARERERydRe6w4EoubmZpSWliI6Ohomk0nr7hAREZEHhBCora1FcnIy2rVrO8fEAEoFpaWlSElJ0bobRERE5IWSkhJ069atzTYMoFQQHR0N4MovICYmRuPeEBERkSdqamqQkpIi3cfbwgBKBbZpu5iYGAZQREREBuNJ+Q2LyImIiIhkYgBFREREJBMDKCIiIiKZGEARERERycQAioiIiEgmBlBEREREMjGAIiIiIpKJARQRERGRTAygiIiIiGRiAEVEREQkEwMoIiIiIpkYQBERERHJxACKiIiISCYGUEQ6cLSoAu8t2oWjRRVad4WIiDzAAIpIB/ZsPoEzxbXYs/mE1l0hIiIPMIAi0oFB2T3QuXs0BmX30LorRETkgfZad4AoGB0tqsCezScwKLsHUtO7SC8iIjIGZqDIsIxcN8QpOyIiYzNMAPXKK6/g2muvRUxMDGJiYpCZmYmNGzdKnwshMH/+fCQnJyMiIgIjRozAt99+a3eM+vp6zJgxA/Hx8YiKisJdd92FkydP2rWpqqrCpEmTYLFYYLFYMGnSJFRXV/tjiCSTkYMQTtkRERmbYQKobt264emnn8bu3buxe/dujBo1Cj/72c+kIOnZZ5/FkiVLsGzZMuzatQuJiYm47bbbUFtbKx1j1qxZWLduHVavXo3t27fj/PnzGDt2LJqamqQ2EydOxL59+7Bp0yZs2rQJ+/btw6RJk/w+XnJPD0GIJ1mwlm1sfwaA8fNu4LQdEZFBmYQQQutOeCs2NhZ//vOf8Zvf/AbJycmYNWsWHnvsMQBXsk0JCQl45pln8MADD8BqtaJz5854++23MWHCBABAaWkpUlJSsGHDBmRnZ+PQoUMYMGAACgsLMXjwYABAYWEhMjMz8d1336Ffv34e9aumpgYWiwVWqxUxMTHqDJ504b1Fu3CmuBadu0dj/Lwb3LYB4LY9ERFpQ8792zAZqJaampqwevVqXLhwAZmZmTh27BjKy8uRlZUltTGbzRg+fDh27NgBACgqKkJjY6Ndm+TkZKSlpUltCgoKYLFYpOAJAIYMGQKLxSK1caa+vh41NTV2LwoOnmTBWrbxd9bMyHViRER6ZqhVePv370dmZibq6urQoUMHrFu3DgMGDJCCm4SEBLv2CQkJOHHiSn1MeXk5wsLC0KlTp1ZtysvLpTZdurSeUunSpYvUxpnFixfjj3/8o09jI2PyZPWcYxt/Ttu1rBPjdCERkXIMlYHq168f9u3bh8LCQjz44IOYPHkyDh48KH1uMpns2gshWr3nyLGNs/bujjN37lxYrVbpVVJS4umQSEXMvuijTkxN/B0TkVYMlYEKCwtDamoqACAjIwO7du3CSy+9JNU9lZeXIykpSWpfUVEhZaUSExPR0NCAqqoquyxURUUFhg4dKrU5ffp0q/OeOXOmVXarJbPZDLPZ7PsASRG2PZbqLzaiprIu4LMvjntKtRTo+0sxw0ZEWjFUBsqREAL19fXo1asXEhMTsWXLFumzhoYGbNu2TQqO0tPTERoaatemrKwMBw4ckNpkZmbCarVi586dUpuvvvoKVqtVakPqUiKjYLupAgjo7IuNkbdz8FWgZ9iISL8Mk4GaN28eRo8ejZSUFNTW1mL16tXYunUrNm3aBJPJhFmzZmHRokXo06cP+vTpg0WLFiEyMhITJ04EAFgsFkyZMgWzZ89GXFwcYmNj8eijj+Kaa67BrbfeCgDo378/br/9dkydOhV//etfAQC/+93vMHbsWI9X4JFvlMgoDMru4TIjE4hajjfYBHqGjYj0yzAB1OnTpzFp0iSUlZXBYrHg2muvxaZNm3DbbbcBAObMmYNLly5h2rRpqKqqwuDBg5Gfn4/o6GjpGC+88ALat2+P8ePH49KlS7jllluwcuVKhISESG1WrVqFmTNnSqv17rrrLixbtsy/gw1iSgQDwXZTDbbxEhHpgaH3gdIr7gNFRERkPAG/DxQRERGRlhhAEREREcnEAIoMifv/EBGRlhhAkSEF89J9IiLSHgMoMqRA3f+HmTUiImMwzDYGpK62drPW4/kDdek+d9YmIjIGZqAIgPZTYlqfXy/0lFljNoyIyDUGUARA+xu31ufXi9T0Lhg/7wafs09KPhIn2INaIiJnOIVHALSfEvPn+bWervQHpR+JQ0RE9hhAUdAJhjojPhKHiEhdDKAo6ARDZoXBDxGRuhhAUdBhcEFERL5iETkRERGRTAygiIiIiGRiAEWkIO6dREQUHBhAESmIeycREQUHBlBkeEpkfZTKHHFDUCKi4MBVeGR4SuzrpNTeUFzhR0QUHJiBIsNTIuujxDHaymJ5muFiDRURkTEwA0WG5Pg4Fl+zPt4cw7EPbWWxPM1wBcMu6UREgYAZKDIkPRRrO/ahrSyWpxkutTNhRESkDGagyJD08DgWxz7YMka2gKplBsnTDJcS2TRmsYiI1McAigxJD8Xazvqgh+BFD8ElEVGgYwBFpCA9BC8tAzvHOi0iIlIGAyjStaNFFShYdxQAkHl3qu6DAD1kxlqSmxFjwEVE5BkWkZOu7dl8AjWVdaiprOPu3l6QW5TuWBjPgnQi49iwvwxjl36JDfvLtO5KUGAGinRtUHYPKQOl1bRYy6wMAEUyNI6ZHrUybXIzYo5TkHqo6SIizyzfehQHTtVg+dajuOOaJK27E/AYQJGuuQsA/DHl5JiVUSKgcAxMbJk222daBSuO11sPNV1E5JlpI1KxfOtRTBuRqnVXggIDKDI0f2RInGVlfA0oHI85KLsHvlj9PeovXkZyn46+dlkxeqvpIiLX7rgmiZknP2IARYbmjwyJYxChREDh7Jh7Np/ApdpalB6p9vn4RESkLgZQZGh6yJB4O43o+D1Pg0GulCMi0h5X4RH5yNvHyjh+LzW9C8bPu8FtzdenKw9q/hgbIqJgxwCKyEfePr/Om+/t2XwClxub0T60ndPvcdsBIiL/4BQeGZZeprK8nUb05nstp/mcfZfbDlAg2LC/TFpNxqJo0itmoMiwvJ0684Y/MjtKnMPbbJiafSKSq+V+RkR6xQCKWtHbTdNVf5QOFtrij2CtYN1RnCmulTbU9KYfzuqofPl9+jNIJbKZNiIVaV1juJ8R6ZphAqjFixfjhhtuQHR0NLp06YJx48bh+++/t2sjhMD8+fORnJyMiIgIjBgxAt9++61dm/r6esyYMQPx8fGIiorCXXfdhZMnT9q1qaqqwqRJk2CxWGCxWDBp0iRUV1erPUTd8NdN09Mbux5u4nKDNbWCUG/rpry9fv4MUols7rgmCetnDOP0HemaYQKobdu24aGHHkJhYSG2bNmCy5cvIysrCxcuXJDaPPvss1iyZAmWLVuGXbt2ITExEbfddhtqa2ulNrNmzcK6deuwevVqbN++HefPn8fYsWPR1NQktZk4cSL27duHTZs2YdOmTdi3bx8mTZrk1/FqSe2bpi24sGVc3N3YXfXHn4GVJyvkAPljaynz7lR07h6NzLv/+69ux0DM03605Mvv05vzERkdnylHnjAJIYTWnfDGmTNn0KVLF2zbtg0//elPIYRAcnIyZs2ahcceewzAlWxTQkICnnnmGTzwwAOwWq3o3Lkz3n77bUyYMAEAUFpaipSUFGzYsAHZ2dk4dOgQBgwYgMLCQgwePBgAUFhYiMzMTHz33Xf
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fig, ax = plt.subplots(nrows=1, ncols=1)\n",
|
|||
|
"\n",
|
|||
|
"projected_centers = pca.transform(mbk.cluster_centers_)\n",
|
|||
|
"\n",
|
|||
|
"for k in range(n_clusters):\n",
|
|||
|
" my_members = mbk_means_labels == k\n",
|
|||
|
" projected_cluster_center = projected_centers[k]\n",
|
|||
|
" line, = ax.plot(projected[my_members, 0], projected[my_members, 1], '.',\n",
|
|||
|
" markersize=1.8)\n",
|
|||
|
" ax.plot(projected_cluster_center[0], projected_cluster_center[1], 'o',\n",
|
|||
|
" markersize=10, markeredgecolor='black', markerfacecolor=line.get_color())\n",
|
|||
|
"ax.set_xlabel('PC 1')\n",
|
|||
|
"ax.set_ylabel('PC 2');"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## silhouette analysis with K-means\n",
|
|||
|
"\n",
|
|||
|
"OK, so that last plot (in PCA space) is looking better. The automatically detected clusters seem to agree with the idea we had from just looking at the data. Let's now use a silhouette analysis as one way to check whether this was a particularly good number of clusters for this data."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.metrics import silhouette_samples, silhouette_score\n",
|
|||
|
"from sklearn.cluster import KMeans\n",
|
|||
|
"import matplotlib.cm as cm"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"For n_clusters = 2 The average silhouette_score is : 0.10616550010654774\n",
|
|||
|
"For n_clusters = 3 The average silhouette_score is : 0.13476859870312252\n",
|
|||
|
"For n_clusters = 4 The average silhouette_score is : 0.10793693499115094\n",
|
|||
|
"For n_clusters = 5 The average silhouette_score is : 0.0668711029107827\n",
|
|||
|
"For n_clusters = 6 The average silhouette_score is : 0.050878820272807025\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABZoAAAKgCAYAAAAS1si3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1gUV9sG8Hthd+kgvQtiwYLYRWyIYO89sZeYGH01tiSaxBqjSUyMpmjim8Qeu0aNSmJPjCVq7C3GXlBRpEtZ9nx/8DEvCwsssDCU+3ddXMzOnDnzzOzM7syzZ84ohBACRERERERERERERESFZCJ3AERERERERERERERUtjHRTERERERERERERERFwkQzERERERERERERERUJE81EREREREREREREVCRMNBMRERERERERERFRkTDRTERERERERERERERFwkQzERERERERERERERUJE81EREREREREREREVCRMNBMRERERERERERFRkTDRTESl1sqVK6FQKKS/rNq0aSONHz58uDT+zp07OvMcPny4ZIOmUufw4cM6+8SdO3dKbNmzZ8+Wluvr61tsy4mPj8dbb70FHx8fqFQqaZkrV64stmVS2SbncVEa8LuC9Cmpz+z8DB8+XIqjTZs2ssVRGhR1W5TnbcnPMSIiKo2YaCaiEiGEwE8//YT27dvDxcUFKpUKlSpVgp+fH9q2bYspU6bg999/lzvMUs2Qi6WKnjyqqMaMGYMvv/wS9+7dg0ajkTscADn3xexJ7+joaDRq1Eiabmpqih9//FHvvAqFAgMGDNC7nB9++CFH2dmzZxfz2pEx5LV/EJHxlJYEurGV5yRyWeHr68vv3nxERETgnXfeQcuWLeHr6wsLCwtYWVkhICAAU6dOxZMnT+QOkYjIqJRyB0BEFcOQIUOwbt06nXGxsbGIjY3F7du3cejQIcTGxqJ169bS9CZNmmDhwoUlHSqR0bRv3x7W1tYAADs7u2JZRlpaGjZv3iy9btWqFbp06QJTU1M0adKkWJZZVFFRUQgPD8eFCxcAAKampli1ahUGDRqU6zzbtm3Dw4cP4enpqTP+66+/LtZYqXxycHDQ+X6pWrWqjNEQUW5eeeUVBAQEAAC8vb1ljoao4Hr27ImUlJQc4y9fvozLly9j1apVOHTokLSfExGVdUw0E1Gx27t3r06SOSgoCOHh4TAzM8P9+/dx/fp1HD9+PMd8derUQZ06dUoyVCKjat68OZo3b16sy4iMjERaWpr0etasWQgLCyvWZSYnJ8PU1BQqlarA80ZGRiIsLAxXr14FAKhUKqxbtw79+vXLcz6NRoNvv/0WH374oTTu6NGjOHfuXIFjoIorc9+1tbXF1KlT5Q6HiPLRsWNHdOzYUe4wqBTRaDRIS0uDhYWF3KEYzNTUFG3atEFwcDCEENi2bZt0HvTs2TOMGTMGR48elTlKIiLjYNcZRFTs9u3bJw1Xr14dx44dw7x58zBjxgwsX74cR44cwZMnT/D666/rzJdXH80FsXnzZjRt2hQWFhZwcnLC8OHDER0drbfs/v370adPH3h6ekKtVsPOzg5BQUH4+OOPER8fr1M2v77x8rud8OzZsxgxYgT8/Pxgbm4OGxsbNGnSBIsWLUJycnKO7bBq1Spp3JEjR3IsW6FQIDQ0VGcZVapU0duXdUGWn5/k5GS8//776NixI/z8/GBnZweVSgUnJye0bt0aX3/9dY7uHPRtu7Vr16Jx48Z5vk9Pnz7F22+/jbZt28LHxwc2NjZQq9VwdXVF+/btsXbtWgghDIq7bdu20vKHDRuWY/qSJUuk6a6urlIy9+LFixg8eDB8fX1hZmYGCwsLVK5cGW3btsX06dPx8OFDqY68ble+e/cu3njjDVSvXh0WFhYwNzeHp6cnWrRogcmTJ0sXIHnx9fWFj4+Pzrjw8HC9Xadcv34dY8aMkZZnZWWFmjVrYsKECXq7WMneD/rZs2fRuXNn2Nvbw8LCQmc9DfXgwQOEhIRI66ZWq7F58+Z8k8wmJhmnK8uXL9dpFfTVV1/pTM/Lv//+i3HjxqFmzZqwtLSEpaUl6tati1mzZiE2NjZH+c2bN2PQoEEICAiAi4sL1Go1rK2tUadOHYwfP96gbXb9+nX069cPDg4OsLCwQHBwsN4+NAuyT+UnLS0N//3vfxEeHg5nZ2eo1Wq4uLigRYsWBt8hklsf+IBx9unM+rMaMWJErvVGRkZi2rRpCAwMhI2NDczNzVGjRg1MnjwZjx8/zjd+fftuXp/f2dcxJiYGkyZNgre3N8zMzODv749ly5bp3XYXL15Et27dYGtrC1tbW3To0AFnzpwpUtcFRT12Dd0P87Jz50507NgRrq6uUKlUsLW1RdWqVdGzZ08sWLAAWq1WKrt//36MHDkSDRo0gJubG8zMzGBpaYnq1atj5MiRuHjxYo76s3eB8M8//6BXr16ws7ODg4MDBg4cKN1efujQIbRq1QqWlpZwdnbGqFGj8OLFC536sp8/JCcnY9asWahatSrMzMxQtWpVzJs3T+dHOkPExMRg3rx5aNKkCezs7GBmZgZfX1+MHj0a//77b4HqAoDff/8dbdq0gZWVFRwcHNCvXz/cvHkzz3m+++479OvXDzVr1oSTk5P0fjRo0ADTpk3Ds2fPpLKZ5wZz5syRxt29e1dvtzX//vsv3nrrLbRs2RLe3t6wsrKCmZkZvLy80L17d/zyyy8Gr9fixYul+v39/XWmNWjQQJqW9fwwazdIrq6u0ne5vu4xDD0v0ufp06d444034ObmBnNzc9SrVw9bt241eN0Ke/5SEEIIbNy4EV27doW7uzvUajUcHR3RpEkTvPvuuwbVkVe3InmdXz979gxTp05FnTp1YGVlBbVaDTc3NzRt2hT/+c9/cOLECZ367969K807Z86cXOst6LGTPf5bt25hwIABcHJyglqtxsmTJwEY9/uzuIwYMQI3b97E/v378eGHH2LevHk4e/asTmOaP//8M8d1BhFRmSWIiIrZ+PHjBQABQDg6Oorr168bNN+KFSuk+bJ/XIWEhEjjhw0bJo2/ffu2zjzt27fXeZ3516JFixzLmzx5st6ymX/Vq1cXd+/ezXVZhw4d0qnPx8dHmjZr1iydaV999ZUwNTXNdVlNmjQRMTExereDvr9Dhw7lWybrdirI8vMTFRWV77LDw8OFRqPJddu1aNHCoPfp1KlT+S5rxIgROvNk3za3b98WQgixdetWaZyFhUWO9W3evLk0ffLkyUIIIS5fviwsLS3zXP7evXulOmbNmiWN9/HxkcY/efJEODs751nPsmXL8t32WfcxfX+Z67px40Zhbm6eazkbGxvx66+/6tSd9Rhr0KBBjvXOrDs32bf7rFmzRJUqVaTX5ubmYs+ePQbN26NHD2l49erVQgghHj58KJRKpQAgevbsmWNZWW3dulVYWFjkuv5Vq1bVObaFEKJLly55bltbW1tx4cKFXLdZYGCgsLa2zjGfWq0Wly5dkuYp6D6Vl6ioKNGoUaNc68m6D+Z2XGRfj6yfG0IYZ5/OWn9+cR49elQ4ODjkWtbFxUWcPXs21/cht303r8/vrOvo6OgoatasqXfZy5cv11nuqVOn9L7n5ubmIjw8XO/65aeox66h+2FeDPkOevnypVR+3LhxeZZVq9Vi3759OssYNmyYNL1KlSrC3t4+x3z+/v5i7dq1wsTEJMe01q1b5xlz27Zt9cbSq1cvnfly27+FEOLatWuicuXKua6XlZVVjvciL7/88ov0GZb1z8HBQQQHB0uvQ0JCdOarU6dOntvX09NTPHz4UAiR8zjX97dixQohhBCbN2/Ot+ycOXMMWrfz58/rzPf48WMhhBBxcXE65x4zZsyQ5sm6D/Tv31/v+MxtYeh5Ufb5/f399X5vKhQKg9+7wp6/GCopKUl07Ngxz3XLLZasn2P6tlum3M6vX758Kfz9/fNc9rvvvpuj/vziLMyxk7X+6tWrCxcXlxzraszvT0P2qax/2bdpYUyZMkWnzmfPnhW5TiKi0oBdZxBRsatfv740/Pz5c9SsWROBgYFo0qQJmjRpgvD
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1800x700 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABZoAAAKgCAYAAAAS1si3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gU1foH8O9sS2+bnhCS0EINvfdepYiAUqSJIlwQEK9YKSKoqNcK6k8vKCAdRKQoIKBIEZAqRaSXJIT0nuzu+f2RmzGbbJJNnU3y/TxPnuzOnDnzzuzM7sw7Z85IQggBIiIiIiIiIiIiIqISUikdABERERERERERERFVbkw0ExEREREREREREVGpMNFMRERERERERERERKXCRDMRERERERERERERlQoTzURERERERERERERUKkw0ExEREREREREREVGpMNFMRERERERERERERKXCRDMRERERERERERERlQoTzURERERERERERERUKkw0E5HNWrVqFSRJkv9y69atmzx8woQJ8vCbN2+aTXPw4MGKDZpszsGDB822iZs3b1bYvBcsWCDPNyQkpNzmk5SUhOeeew7BwcHQarXyPFetWlVu86TKTcn9whbwt4Isqajv7KJMmDBBjqNbt26KxWELSrsuqvK65PcYERHZIiaaiahCCCHw7bffok+fPvDx8YFWq4W7uztq1aqFHj164Pnnn8cvv/yidJg2zZqTpeqePKqupk6dio8++gi3b9+GwWBQOhwA+bfFvEnv2NhYtGzZUh6vVqvx3//+1+K0kiRh1KhRFufz1Vdf5Su7YMGCcl46KguFbR9EVHZsJYFe1qpyErmyCAkJ4W9vEfbt24cJEyagadOm8PX1hVarhaOjI+rUqYMxY8bw/IeIqhyN0gEQUfUwbtw4rF271mxYQkICEhIScOPGDRw4cAAJCQno0qWLPL5169ZYtmxZRYdKVGb69OkDZ2dnAICbm1u5zCMrKwubNm2S33fu3BkDBw6EWq1G69aty2WepRUdHY1evXrh3LlzAAC1Wo2vv/4aY8aMKXCarVu34t69ewgMDDQb/sknn5RrrFQ16fV6s9+X2rVrKxgNERXk8ccfR+PGjQEAQUFBCkdDVHzfffcdvv76a7NhBoMB165dw7Vr1/Dtt9/i448/xr/+9S+FIiQiKltMNBNRudu9e7dZkrlt27bo1asX7OzscOfOHVy5cgVHjx7NN12jRo3QqFGjigyVqEx16NABHTp0KNd5REREICsrS34/f/589OzZs1znmZ6eDrVaDa1WW+xpIyIi0LNnT1y6dAkAoNVqsXbtWowYMaLQ6QwGAz777DO88cYb8rDDhw/jzJkzxY6Bqq+cbdfV1RVz585VOhwiKkK/fv3Qr18/pcMgG2IwGJCVlQUHBwelQ7GKnZ0d2rVrh+bNm8PHxwcqlQpnzpzBd999ByEEAOCVV17BM888U6LjKiIiW8OuM4io3O3du1d+XbduXRw5cgSLFy/Ga6+9hi+++AKHDh1CVFQUnn76abPpCuujuTg2bdqENm3awMHBAV5eXpgwYQJiY2Mtlt23bx+GDx+OwMBA6HQ6uLm5oW3btnjrrbeQlJRkVraovvGKup3w9OnTmDhxImrVqgV7e3u4uLigdevWeP/995Genp5vPeRuDXHo0KF885YkCd27dzebR2hoqMW+rIsz/6Kkp6fjlVdeQb9+/VCrVi24ublBq9XCy8sLXbp0wSeffJKvOwdL627NmjVo1apVoZ/TgwcP8MILL6BHjx4IDg6Gi4sLdDodfH190adPH6xZs0Y+aC9Kjx495PmPHz8+3/gPP/xQHu/r6ysnc8+fP4+xY8ciJCQEdnZ2cHBwQM2aNdGjRw+89NJLuHfvnlxHYbcr37p1C8888wzq1q0LBwcH2NvbIzAwEB07dsScOXPkRGxhQkJCEBwcbDasV69eFrtOuXLlCqZOnSrPz8nJCfXr18fMmTMtdrGStx/006dPY8CAAfDw8ICDg4PZclrr7t276Nq1q7xsOp0OmzZtKjLJrFJlH6588cUXyMjIkId//PHHZuML8/fff2P69OmoX78+HB0d4ejoiCZNmmD+/PlISEjIV37Tpk0YM2YMGjduDB8fH+h0Ojg7O6NRo0aYMWOGVevsypUrGDFiBPR6PRwcHNC+fXuLfWgWZ5sqSlZWFv7v//4PvXr1gre3N3Q6HXx8fNCxY0er7xApqA98oGy26Zz6c5s4cWKB9UZERGDevHkIDw+Hi4sL7O3tUa9ePcyZMweRkZFFxm9p2y3s+zvvMsbHx2P27NkICgqCnZ0dwsLCsGLFCovr7vz583jkkUfg6uoKV1dX9O3bF6dOnSpV1wWl3Xet3Q4L8/3336Nfv37ybd+urq6oXbs2hg4diqVLl8JkMsll9+3bh0mTJqF58+bw8/ODnZ0dHB0dUbduXUyaNAnnz5/PV3/eLhD++usvDBs2DG5ubtDr9Rg9ejSioqIAAAcOHEDnzp3h6OgIb29vTJ48GXFxcWb15T1+SE9Px/z581G7dm3Y2dmhdu3aWLx4sdlFOmvEx8dj8eLFaN26Ndzc3GBnZ4eQkBBMmTIFf//9d7HqAoBffvkF3bp1g5OTE/R6PUaMGIFr164VOs3nn3+OESNGoH79+vDy8pI/j+bNm2PevHl4+PChXDbn2GDhwoXysFu3blnstubvv//Gc889h06dOiEoKAhOTk6ws7NDjRo1MHjwYPzwww9WL9cHH3wg1x8WFmY2rnnz5vK43MeHubtB8vX1lX/LLXWPYe1xkSUPHjzAM888Az8/P9jb26Np06bYsmWL1ctW0uOX4hBCYMOGDRg0aBD8/f2h0+ng6emJ1q1b48UXX7SqjsK6FSns+Prhw4eYO3cuGjVqBCcnJ+h0Ovj5+aFNmzb417/+hWPHjpnVf+vWLXnahQsXFlhvcfedvPFfv34do0aNgpeXF3Q6HY4fPw6gbH8/y8t7772Ho0ePYvny5ViwYAFef/11bN26FU899ZRcJjExsVTbDBGRTRFEROVsxowZAoAAIDw9PcWVK1esmm7lypXydHm/rrp27SoPHz9+vDz8xo0bZtP06dPH7H3OX8eOHfPNb86cORbL5vzVrVtX3Lp1q8B5HThwwKy+4OBgedz8+fPNxn388cdCrVYXOK/WrVuL+Ph4i+vB0t+BAweKLJN7PRVn/kWJjo4uct69evUSBoOhwHXXsWNHqz6nEydOFDmviRMnmk2Td93cuHFDCCHEli1b5GEODg75lrdDhw7y+Dlz5gghhPjzzz+Fo6NjofPfvXu3XMf8+fPl4cHBwfLwqKgo4e3tXWg9K1asKHLd597GLP3lLOuGDRuEvb19geVcXFzEjz/+aFZ37n2sefPm+ZY7p+6C5F3v8+fPF6GhofJ7e3t7sWvXLqumHTJkiPz6m2++EUIIce/ePaHRaAQAMXTo0Hzzym3Lli3CwcGhwOWvXbu22b4thBADBw4sdN26urqKc+fOFbjOwsPDhbOzc77pdDqduHDhgjxNcbepwkRHR4uWLVsWWE/ubbCg/SLvcuT+3hCibLbp3PUXFefhw4eFXq8vsKyPj484ffp0gZ9DQdtuYd/fuZfR09NT1K9f3+K8v/jiC7P5njhxwuJnbm9vL3r16mVx+YpS2n3X2u2wMNb8BqWlpcnlp0+fXmhZnU4n9u7dazaP8ePHy+NDQ0OFh4dHvunCwsLEmjVrhEqlyjeuS5cuhcbco0cPi7EMGzbMbLqCtm8hhLh8+bKoWbNmgcvl5OSU77MozA8//CB/h+X+0+v1on379vL7rl27mk3XqFGjQtdvYGCguHfvnhAi/35u6W/lypVCCCE2bdpUZNmFCxdatWxnz541my4yMlIIIURiYqLZscdrr70mT5N7Gxg5cqTF4TnrwtrjorzTh4WFWfzdlCTJ6s+upMcv1kpNTRX9+vUrdNkKiiX395il9ZajoOPrtLQ0ERYWVui8X3zxxXz1FxVnSfad3PXXrVtX+Pj45FvWsvz9tGabyv2Xd50WR3Jysvjtt99E3bp15fq8vb2FyWQqcZ1ERLaEXWc
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1800x700 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABZoAAAKgCAYAAAAS1si3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdsG8Hs3W7Kb3juBUEINvYTee0dAKdJEEV8QEF/FRhFBRVEsoH4qqCAdBASigIAiRUB6E+mdAOl9d8/3x74Zsskm2SSbTMr9u669MjtzZuaZ2ZnszLNnzlEIIQSIiIiIiIiIiIiIiApJKXcARERERERERERERFS2MdFMREREREREREREREXCRDMRERERERERERERFQkTzURERERERERERERUJEw0ExEREREREREREVGRMNFMREREREREREREREXCRDMRERERERERERERFQkTzURERERERERERERUJEw0ExEREREREREREVGRMNFMRKXWsmXLoFAopFdW7du3l8aPHj1aGn/16lWLefbs2VOyQVOps2fPHotj4urVqyW27lmzZknrrVy5crGtJyEhAS+++CJCQ0OhVquldS5btqzY1kllm5znRWnA7wqypqT+Z+dn9OjRUhzt27eXLY7SoKj7ojzvS/4fIyKi0oiJZiIqEUII/Pjjj+jatSt8fX2hVqvh7u6OsLAwdOzYES+99BJ+//13ucMs1Wy5WaroyaOKasKECfjkk09w/fp1GAwGucMBkPNYzJ70fvToERo3bixNd3BwwLfffmt1XoVCgaFDh1pdzzfffJOj7KxZs4p568ge8jo+iMh+SksC3d7KcxK5rKhcuTK/ewtow4YNOa5beL1OROWJSu4AiKhiGDlyJFasWGExLi4uDnFxcbhy5Qp2796NuLg4tG3bVpretGlTLFiwoKRDJbKbrl27wtnZGQDg5uZWLOvIyMjA2rVrpfdt2rRBr1694ODggKZNmxbLOosqOjoanTt3xsmTJwEADg4O+O677zB8+PBc59mwYQNu3bqFoKAgi/GfffZZscZK5ZOnp6fF90vVqlVljIaIcvPkk0+ibt26AICQkBCZoyEqmocPH+L555+XOwwiomLFRDMRFbvt27dbJJmbN2+Ozp07Q6vV4saNG7hw4QIOHDiQY746deqgTp06JRkqkV21bNkSLVu2LNZ13LlzBxkZGdL7mTNnolOnTsW6ztTUVDg4OECtVhd43jt37qBTp044d+4cAECtVmPFihUYPHhwnvMZDAZ88cUXePvtt6Vx+/btw/HjxwscA1Vcmceuq6srpk+fLnc4RJSP7t27o3v37nKHQaWIwWBARkYGdDqd3KEU2MSJE3H//n25wyAiKlZsOoOIit2OHTuk4erVq2P//v2YO3cu3nzzTXz11VfYu3cv7t27h2effdZivrzaaC6ItWvXolmzZtDpdPD29sbo0aPx6NEjq2V37tyJQYMGISgoCBqNBm5ubmjevDneffddJCQkWJTNr228/B4nPHbsGMaMGYOwsDA4OjrCxcUFTZs2xcKFC5GamppjP3z33XfSuL179+ZYt0KhQIcOHSzWUaVKFattWRdk/flJTU3F66+/ju7duyMsLAxubm5Qq9Xw9vZG27Zt8dlnn+VozsHavlu+fDmaNGmS5+d0//59vPzyy+jYsSNCQ0Ph4uICjUYDPz8/dO3aFcuXL4cQwqa4O3bsKK1/1KhROaYvWrRImu7n5yclc0+dOoURI0agcuXK0Gq10Ol0qFSpEjp27IgZM2bg1q1b0jLyelz52rVreO6551C9enXodDo4OjoiKCgIrVq1wrRp06REbF4qV66M0NBQi3GdO3e2+ijmhQsXMGHCBGl9Tk5OqFmzJiZPnmz1kc3s7aAfO3YMPXv2hIeHB3Q6ncV22urmzZto166dtG0ajQZr167NN8msVJovV7766iukpaVJ4z/99FOL6Xn5999/8cILL6BmzZrQ6/XQ6/WoV68eZs6cibi4uBzl165di+HDh6Nu3brw9fWFRqOBs7Mz6tSpg0mTJtm0zy5cuIDBgwfD09MTOp0OkZGRVtvQLMgxlZ+MjAz83//9Hzp37gwfHx9oNBr4+vqiVatWNj8hklsb+IB9junM5Wc1ZsyYXJd7584dvPrqq4iIiICLiwscHR1Ro0YNTJs2DXfv3s03fmvHbl7/v7NvY2xsLKZOnYqQkBBotVqEh4djyZIlVvfdqVOn0KdPH7i6usLV1RXdunXD0aNHi9R0QVHPXVuPw7xs3rwZ3bt3h5+fH9RqNVxdXVG1alX0798f8+fPh8lkksru3LkTY8eORcOGDeHv7w+tVgu9Xo/q1atj7NixOHXqVI7lZ28C4Z9//sGAAQPg5uYGT09PDBs2DPfu3QMA7N69G23atIFer4ePjw/GjRuHmJgYi+Vlv35ITU3FzJkzUbVqVWi1WlStWhVz5861+JHOFrGxsZg7dy6aNm0KNzc3aLVaVK5cGePHj8e///5boGUBwO+//4727dvDyckJnp6eGDx4MC5dupTnPF9++SUGDx6MmjVrwtvbW/o8GjZsiFdffRUPHjyQymZeG8yePVsad+3aNavN1vz777948cUX0bp1a4SEhMDJyQlarRbBwcHo27cvfv75Z5u36+OPP5aWHx4ebjGtYcOG0rSs14dZm0Hy8/OTvsutNY9h63WRNffv38dzzz0Hf39/ODo6on79+li/fr3N21bY65eCEEJg9erV6N27NwICAqDRaODl5YWmTZvilVdesWkZeTUrktf19YMHDzB9+nTUqVMHTk5O0Gg08Pf3R7NmzfCf//wHBw8etFj+tWvXpHlnz56d63ILeu5kj//y5csYOnQovL29odFocOjQIQD2/f4sbuvWrcOaNWsAAP3795c3GCKi4iSIiIrZpEmTBAABQHh5eYkLFy7YNN/SpUul+bL/u2rXrp00ftSoUdL4K1euWMzTtWtXi/eZr1atWuVY37Rp06yWzXxVr15dXLt2Ldd17d6922J5oaGh0rSZM2daTPv000+Fg4NDrutq2rSpiI2NtbofrL12796db5ms+6kg689PdHR0vuvu3LmzMBgMue67Vq1a2fQ5HT58ON91jRkzxmKe7PvmypUrQggh1q9fL43T6XQ5trdly5bS9GnTpgkhhDhz5ozQ6/V5rn/79u3SMmbOnCmNDw0Nlcbfu3dP+Pj45LmcJUuW5Lvvsx5j1l6Z27p69Wrh6OiYazkXFxfxyy+/WCw76znWsGHDHNuduezcZN/vM2fOFFWqVJHeOzo6im3bttk0b79+/aTh77//XgghxK1bt4RKpRIARP/+/XOsK6v169cLnU6X6/ZXrVrV4twWQohevXrluW9dXV3FyZMnc91nERERwtnZOcd8Go1GnD59WpqnoMdUXqKjo0Xjxo1zXU7WYzC38yL7dmT9vyGEfY7prMvPL859+/YJT0/PXMv6+vqKY8eO5fo55Hbs5vX/O+s2enl5iZo1a1pd91dffWWx3sOHD1v9zB0dHUXnzp2tbl9+inru2noc5sWW76CUlBSp/AsvvJBnWY1GI3bs2GGxjlGjRknTq1SpIjw8PHLMFx4eLpYvXy6USmWOaW3bts0z5o4dO1qNZcCAARbz5XZ8CyHE+fPnRaVKlXLdLicnpxyfRV5+/vln6X9Y1penp6eIjIyU3rdr185ivjp16uS5f4OCgsStW7eEEDnPc2uvpUuXCiGEWLt2bb5lZ8+ebdO2nThxwmK+u3fvCiGEiI+Pt7j2ePPNN6V5sh4DQ4YMsTo+c1/Yel2Uff7w8HCr35sKhcLmz66w1y+2Sk5OFt27d89z23KLJev/MWv7LVNu19cpKSkiPDw8z3W/8sorOZafX5yFOXeyLr969erC19c3x7ba8/vTlmMq6yv7Ps3P/fv3pe/IkSNH5lhfftdURERlCZvOIKJi16BBA2n44cOHqFmzJiIiItC0aVM0bdoUnTt3RlhYWLGs+9dff0VkZCQ6deqEn3/+WXrM/s8//8SBAwcQGRkJAPj++++xcOFCab6IiAj07dsXV69exYoVKyCEwMW
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1800x700 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABZoAAAKgCAYAAAAS1si3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdsG8HuT3U3vvZEQSighdEJoARJ6BylSpIkifHQUsFFEUFFeUQTlVUEB6SAgEKUFRIqA9CbSO4GQXjd7vj/2zZhNNsmmTsr949qL2Zkzc56ZncnOPHvmjEIIIUBEREREREREREREVEgmcgdAREREREREREREROUbE81EREREREREREREVCRMNBMRERERERERERFRkTDRTERERERERERERERFwkQzERERERERERERERUJE81EREREREREREREVCRMNBMRERERERERERFRkTDRTERERERERERERERFwkQzERERERERERERERUJE81EVGatWrUKCoVCemXVtm1bafyIESOk8bdv39abJzIysnSDpjInMjJSb5+4fft2qdU9Z84cqV4/P78Sqyc+Ph6TJk2Cr68vVCqVVOeqVatKrE4q3+Q8LsoCfleQIaX1Nzs/I0aMkOJo27atbHGUBUXdFhV5W/LvGBERlUVMNBNRqRBC4KeffkLHjh3h6uoKlUoFe3t7+Pv7o3379pg2bRoOHz4sd5hlmjEXS5U9eVRZjR07Fl988QXu3r0LjUYjdzgAcu6L2ZPe0dHRaNy4sTTd1NQU33//vcF5FQoFBg4caLCe7777LkfZOXPmlPDaUXHIa/8gouJTVhLoxa0iJ5HLCz8/P3735iPr8Zfb69VXX5U7TCKiYqOUOwAiqhyGDRuGtWvX6o2LjY1FbGwsbt26hYMHDyI2NhZt2rSRpjdt2hSLFi0q7VCJik3Hjh1hbW0NALCzsyuROtLT07Fp0ybpfevWrdGtWzeYmpqiadOmJVJnUUVFRSE8PBznz58HAJiamuKHH37AkCFDcp1n69atePDgAby8vPTGL126tERjpYrJ0dFR7/ulWrVqMkZDRLkZNGgQAgMDAQA+Pj4yR0NERET5YaKZiErcnj179JLMwcHBCA8Ph5mZGe7du4dr167h2LFjOearW7cu6tatW5qhEhWrFi1aoEWLFiVax6NHj5Ceni69nz17NsLCwkq0zpSUFJiamkKlUhV43kePHiEsLAxXrlwBAKhUKqxduxb9+/fPcz6NRoOvv/4aH3zwgTTuyJEjOHv2bIFjoMorc9+1tbXF9OnT5Q6HiPLRuXNndO7cWe4wqAzRaDRIT0+HhYWF3KEU2MCBA9GkSZMc44OCgmSIhoioZLDrDCIqcXv37pWGa9SogaNHj2L+/Pl47733sGLFChw6dAhPnjzBa6+9pjdfXn00F8SmTZvQrFkzWFhYwNnZGSNGjEB0dLTBsvv27UO/fv3g5eUFtVoNOzs7BAcH46OPPkJ8fLxe2fz6xsvvdsIzZ85g5MiR8Pf3h7m5OWxsbNC0aVMsXrwYKSkpObbDDz/8II07dOhQjroVCgXatWunV0fVqlUN9mVdkPrzk5KSgnfeeQedO3eGv78/7OzsoFKp4OzsjDZt2mDp0qU5unMwtO3WrFmDJk2a5Pk5PX36FG+++Sbat28PX19f2NjYQK1Ww83NDR07dsSaNWsghDAq7vbt20v1Dx8+PMf0JUuWSNPd3NykZO6FCxcwdOhQ+Pn5wczMDBYWFqhSpQrat2+PWbNm4cGDB9Iy8rpd+c6dO3j99ddRo0YNWFhYwNzcHF5eXmjZsiWmTp0qJWLz4ufnB19fX71x4eHhBrtOuXbtGsaOHSvVZ2VlhVq1amHixIkGu1jJ3g/6mTNn0LVrVzg4OMDCwkJvPY11//59hIaGSuumVquxadOmfJPMJia605UVK1YgNTVVGv/ll1/qTc/LP//8g/Hjx6NWrVqwtLSEpaUl6tWrh9mzZyM2NjZH+U2bNmHIkCEIDAyEq6sr1Go1rK2tUbduXUyYMMGobXbt2jX0798fjo6OsLCwQEhIiME+NAuyT+UnPT0d//3vfxEeHg4XFxeo1Wq4urqiZcuWRt8hklsf+EDx7NOZy89q5MiRuS730aNHmDlzJoKCgmBjYwNzc3PUrFkTU6dOxePHj/ON39C+m9ff7+zrGBMTgylTpsDHxwdmZmYICAjA8uXLDW67CxcuoEePHrC1tYWtrS06deqE06dPF6nrgqIeu8buh3nZsWMHOnfuDDc3N6hUKtja2qJatWro3bs3Fi5cCK1WK5Xdt28fRo0ahYYNG8Ld3R1mZmawtLREjRo1MGrUKFy4cCHH8rN3gfD333+jT58+sLOzg6OjIwYPHownT54AAA4ePIjWrVvD0tISLi4uGD16NF68eKG3vOznDykpKZg9ezaqVasGMzMzVKtWDfPnz9f7kc4YMTExmD9/Ppo2bQo7OzuYmZnBz88PY8aMwT///FOgZQHA4cOH0bZtW1hZWcHR0RH9+/fHjRs38pznm2++Qf/+/VGrVi04OztLn0fDhg0xc+ZMPHv2TCqbeW4wd+5cadydO3cMdlvzzz//YNKkSWjVqhV8fHxgZWUFMzMzeHt7o2fPnvjll1+MXq/PP/9cWn5AQIDetIYNG0rTsp4fZu0Gyc3NTfouN9Q9hrHnRYY8ffoUr7/+Otzd3WFubo769etjy5YtRq9bYc9fCkIIgQ0bNqB79+7w8PCAWq2Gk5MTmjZtihkzZhi1jLy6Fcnr/PrZs2eYPn066tatCysrK6jVari7u6NZs2b4v//7Pxw/flxv+Xfu3JHmnTt3bq7LLeixkz3+mzdvYuDAgXB2doZarcaJEycAFO/3Z2no3Lkzpk+fnuPVsWNHuUMjIio+goiohE2YMEEAEACEk5OTuHbtmlHzrVy5Upov+5+r0NBQafzw4cOl8bdu3dKbp2PHjnrvM18tW7bMUd/UqVMNls181ahRQ9y5cyfXug4ePKi3PF9fX2na7Nmz9aZ9+eWXwtTUNNe6mjZtKmJiYgxuB0OvgwcP5lsm63YqSP35iYqKyrfu8PBwodFoct12LVu2NOpzOnnyZL51jRw5Um+e7Nvm1q1bQgghtmzZIo2zsLDIsb4tWrSQpk+dOlUIIcSlS5eEpaVlnvXv2bNHWsbs2bOl8b6+vtL4J0+eCBcXlzyXs3z58ny3fdZ9zNArc103bNggzM3Ncy1nY2Mjfv31V71lZz3GGjZsmGO9M5edm+zbffbs2aJq1arSe3Nzc7F7926j5u3Vq5c0/OOPPwohhHjw4IFQKpUCgOjdu3eOurLasmWLsLCwyHX9q1WrpndsCyFEt27d8ty2tra24vz587lus6CgIGFtbZ1jPrVaLS5evCjNU9B9Ki9RUVGicePGuS4n6z6Y23GRfT2y/t0Qonj26azLzy/OI0eOCEdHx1zLurq6ijNnzuT6OeS27+b19zvrOjo5OYlatWoZrHvFihV69Z48edLgZ25ubi7Cw8MNrl9+inrsGrsf5sWY76Dk5GSp/Pjx4/Msq1arxd69e/XqGD58uDS9atWqwsHBIcd8AQEBYs2aNcLExCTHtDZt2uQZc/v27Q3G0qdPH735ctu/hRDi6tWrokqVKrmul5WVVY7PIi+//PKL9Dcs68vR0VGEhIRI70NDQ/Xmq1u3bp7b18vLSzx48EAIkfM4N/RauXKlEEKITZs25Vt27ty5Rq3buXPn9OZ7/PixEEKIuLg4vXOP9957T5on6z4wYMAAg+Mzt4Wx50XZ5w8ICDD4valQKIz+7Ap7/mKspKQk0blz5zzXLbdYsv4dM7TdMuV2fp2cnCwCAgLyrHvGjBk5lp9fnIU5drIuv0aNGsLV1TXHuhbn96cx+1TWV/Ztmpesf1e8vLyElZWVMDMzE1WrVhWjRo0Sly5dMnpZRETlAbvOIKIS16BBA2n4+fPnqFWrFoKCgtC0aVM0bdoU4eHh8Pf3L5G6f/vtN4SEhCAsLAy//PKLdJv9H3/8gWPHjiEkJAQA8OOPP2Lx4sXSfEFBQejZsydu376NtWvXQgiB69e
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1800x700 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABZoAAAKgCAYAAAAS1si3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdsG8Ht2s5uekN4ICaGEEkIvASFAQkeqFCmGIirwUsVXsVFEUFEUC6ivCgpIBxEpCggoUgSkN5FekhDS6yabPd8f+TJmk02ySTbZlPt3XbkyO3Nm5pnZmd2ZZ8+cIwkhBIiIiIiIiIiIiIiISklh7gCIiIiIiIiIiIiIqGpjopmIiIiIiIiIiIiIyoSJZiIiIiIiIiIiIiIqEyaaiYiIiIiIiIiIiKhMmGgmIiIiIiIiIiIiojJhopmIiIiIiIiIiIiIyoSJZiIiIiIiIiIiIiIqEyaaiYiIiIiIiIiIiKhMmGgmIiIiIiIiIiIiojJhopmIKq3Vq1dDkiT5L6+uXbvK48eNGyePv337tt48hw4dqtigqdI5dOiQ3jFx+/btClv3/Pnz5fX6+/uX23qSk5MxY8YM+Pn5QaVSyetcvXp1ua2TqjZznheVAb8ryJCK+swuzrhx4+Q4unbtarY4KoOy7ovqvC/5OUZERJURE81EVCGEEPj+++/Rs2dPuLu7Q6VSoVatWggICED37t3x4osv4rfffjN3mJWaMTdLNT15VFO98MIL+Pjjj3H37l1otVpzhwOg4LGYP+kdFxeH1q1by9OVSiW++eYbg/NKkoQRI0YYXM/XX39doOz8+fPLeevIFIo6PojIdCpLAt3UqnMSuarw9/fnd6+RMjIysHz5cnTp0gWurq5Qq9Xw9PRESEgI5s6di7S0NHOHSERkEhbmDoCIaoaxY8di3bp1euMSExORmJiIW7du4eDBg0hMTESXLl3k6W3btsXSpUsrOlQik+nZsyfs7OwAAI6OjuWyjqysLGzevFl+3blzZ/Tr1w9KpRJt27Ytl3WWVUxMDMLDw3H+/HkAgFKpxLfffovRo0cXOs+2bdvw4MED+Pj46I3/9NNPyzVWqp6cnZ31vl/q1atnxmiIqDAjR45EUFAQAMDX19fM0RCVzj///IN+/frh77//1hsfHR2N6OhoHD9+HFOnToWNjY2ZIiQiMh0mmomo3O3Zs0cvydy+fXuEh4fD0tIS9+7dw7Vr13Ds2LEC8zVt2hRNmzatyFCJTKpjx47o2LFjua4jMjISWVlZ8ut58+YhLCysXNeZkZEBpVIJlUpV4nkjIyMRFhaGK1euAABUKhXWrVuHYcOGFTmfVqvF559/jrfeeksed+TIEZw9e7bEMVDNlXvsOjg4YM6cOeYOh4iK0bt3b/Tu3dvcYVAlotVqkZWVBWtra3OHYpTk5GT06dMH//zzDwDAwcEBQ4YMQd26daHVavHw4UOcOnUKSqXSzJESEZkGm84gonK3b98+ebhBgwY4evQoFi1ahDfeeANffvklDh8+jOjoaDz33HN68xXVRnNJbN68Ge3atYO1tTVcXV0xbtw4xMXFGSy7f/9+DB06FD4+PlCr1XB0dET79u3xzjvvIDk5Wa9scW3jFfc44ZkzZzB+/HgEBATAysoK9vb2aNu2LZYtW4aMjIwC++Hbb7+Vxx0+fLjAuiVJQrdu3fTWUbduXYNtWZdk/cXJyMjAa6+9ht69eyMgIACOjo5QqVRwdXVFly5d8OmnnxZozsHQvlu7di3atGlT5Pv06NEjvPTSS+jevTv8/Pxgb28PtVoNDw8P9OzZE2vXroUQwqi4u3fvLq8/IiKiwPTly5fL0z08PORk7oULFzBmzBj4+/vD0tIS1tbWqFOnDrp37465c+fiwYMH8jKKelz5zp07eP7559GgQQNYW1vDysoKPj4+6NSpE2bPni0nYovi7+8PPz8/vXHh4eEGm065du0aXnjhBXl9tra2aNSoEaZPn26wiZX87aCfOXMGffv2hZOTE6ytrfW201j3799HaGiovG1qtRqbN28uNsmsUORcrnz55ZfQaDTy+E8++URvelH++ecfTJ06FY0aNYKNjQ1sbGzQrFkzzJs3D4mJiQXKb968GaNHj0ZQUBDc3d2hVqthZ2eHpk2bYtq0aUbts2vXrmHYsGFwdnaGtbU1QkJCDLahWZJjqjhZWVn43//+h/DwcLi5uUGtVsPd3R2dOnUy+gmRwtrAB0xzTOcuP6/x48cXutzIyEi88sorCA4Ohr29PaysrNCwYUPMnj0bUVFRxcZv6Ngt6vM7/zYmJCRg1qxZ8PX1haWlJQIDA7Fy5UqD++7ChQt48skn4eDgAAcHB/Tq1QunT58uU9MFZT13jT0Oi/Ljjz+id+/e8PDwgEqlgoODA+rVq4dBgwZhyZIl0Ol0ctn9+/djwoQJaNmyJTw9PWFpaQkbGxs0aNAAEyZMwIULFwosP38TCH///TcGDx4MR0dHODs7Y9SoUYiOjgYAHDx4EJ07d4aNjQ3c3NwwceJExMfH6y0v//VDRkYG5s2bh3r16sHS0hL16tXDokWL9H6kM0ZCQgIWLVqEtm3bwtHREZaWlvD398ekSZPkJFJJ/Pbbb+jatStsbW3h7OyMYcOG4caNG0XO88UXX2DYsGFo1KgRXF1d5fejZcuWeOWVV/D48WO5bO61wYIFC+Rxd+7cMdhszT///IMZM2bgiSeegK+vL2xtbWFpaYnatWtjwIAB+Omnn4zero8++khefmBgoN60li1bytPyXh/mbQbJw8ND/i431DyGsddFhjx69AjPP/88PD09YWVlhebNm2Pr1q1Gb1tpr19KQgiBjRs3on///vDy8oJarYaLiwvatm2Ll19+2ahlFNWsSFHX148fP8acOXPQtGlT2Nrays07tGvXDv/5z39w/PhxveXfuXNHnnfBggWFLrek507++G/evIkRI0bITU6cOHECgGm/P8vLBx98IG9j06ZNcevWLaxatQpvvvkmFi5ciK+++gpnz56Fl5eXmSMlIjIRQURUzqZNmyYACADCxcVFXLt2zaj5Vq1aJc+X/+MqNDRUHh8RESGPv3Xrlt48PXv21Hud+9epU6cC65s9e7bBsrl/DRo0EHfu3Cl0XQcPHtRbnp+fnzxt3rx5etM++eQToVQqC11X27ZtRUJCgsH9YOjv4MGDxZbJu59Ksv7ixMTEFLvu8PBwodVqC913nTp1Mup9OnnyZLHrGj9+vN48+ffNrVu3hBBCbN26VR5nbW1dYHs7duwoT589e7YQQohLly4JGxubIte/Z88eeRnz5s2Tx/v5+cnjo6OjhZubW5HLWblyZbH7Pu8xZugvd1s3btworKysCi1nb28vfv75Z71l5z3HWrZsWWC7c5ddmPz7fd68eaJu3bryaysrK7F7926j5h04cKA8/N133wkhhHjw4IGwsLAQAMSgQYMKrCuvrVu3Cmtr60K3v169enrnthBC9OvXr8h96+DgIM6fP1/oPgsODhZ2dnYF5lOr1eLixYvyPCU9pooSExMjWrduXehy8h6DhZ0X+bcj7+eGEKY5pvMuv7g4jxw5IpydnQst6+7uLs6cOVPo+1DYsVvU53febXRxcRGNGjUyuO4vv/xSb70nT540+J5bWVmJ8PBwg9tXnLKeu8Yeh0Ux5jsoPT1dLj916tQiy6rVarFv3z69dURERMjT69atK5ycnArMFxgYKNauXSsUCkWBaV26dCky5u7duxuMZfDgwXrzFXZ8CyHE1atXRZ06dQrdLltb2wLvRVF++ukn+TMs75+zs7MICQmRX4eGhurN17Rp0yL3r4+Pj3jw4IEQouB5buhv1apVQgghNm/eXGzZBQsWGLVt586d05svKipKCCFEUlKS3rXHG2+8Ic+T9xgYPny4wfG5+8LY66L88wcGBhr83pQkyej3rrTXL8ZKS0sTvXv3LnLbCosl7+eYof2Wq7Dr6/T0dBEYGFjkul9++eUCyy8uztKcO3mX36BBA+Hu7l5gW035/WnMMZX3L/8+LUrebX/ppZfE4MGDhbe3t7CyshJNmjQRb731lt5
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1800x700 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# From https://scikit-learn.org/0.21/auto_examples/cluster/plot_kmeans_silhouette_analysis.html\n",
|
|||
|
"range_n_clusters = [2, 3, 4, 5, 6]\n",
|
|||
|
"\n",
|
|||
|
"for n_clusters in range_n_clusters:\n",
|
|||
|
" # Create a subplot with 1 row and 2 columns\n",
|
|||
|
" fig, (ax1, ax2) = plt.subplots(1, 2)\n",
|
|||
|
" fig.set_size_inches(18, 7)\n",
|
|||
|
"\n",
|
|||
|
" # The 1st subplot is the silhouette plot\n",
|
|||
|
" # The silhouette coefficient can range from -1, 1 but in this example all\n",
|
|||
|
" # lie within [-0.1, 1]\n",
|
|||
|
" ax1.set_xlim([-0.1, 1])\n",
|
|||
|
" # The (n_clusters+1)*10 is for inserting blank space between silhouette\n",
|
|||
|
" # plots of individual clusters, to demarcate them clearly.\n",
|
|||
|
" ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])\n",
|
|||
|
"\n",
|
|||
|
" # Initialize the clusterer with n_clusters value and a random generator\n",
|
|||
|
" # seed of 10 for reproducibility.\n",
|
|||
|
" clusterer = KMeans(n_clusters=n_clusters, random_state=10, n_init='auto')\n",
|
|||
|
" cluster_labels = clusterer.fit_predict(X)\n",
|
|||
|
"\n",
|
|||
|
" # The silhouette_score gives the average value for all the samples.\n",
|
|||
|
" # This gives a perspective into the density and separation of the formed\n",
|
|||
|
" # clusters\n",
|
|||
|
" silhouette_avg = silhouette_score(X, cluster_labels)\n",
|
|||
|
" print(\"For n_clusters =\", n_clusters,\n",
|
|||
|
" \"The average silhouette_score is :\", silhouette_avg)\n",
|
|||
|
"\n",
|
|||
|
" # Compute the silhouette scores for each sample\n",
|
|||
|
" sample_silhouette_values = silhouette_samples(X, cluster_labels)\n",
|
|||
|
"\n",
|
|||
|
" y_lower = 10\n",
|
|||
|
" for i in range(n_clusters):\n",
|
|||
|
" # Aggregate the silhouette scores for samples belonging to\n",
|
|||
|
" # cluster i, and sort them\n",
|
|||
|
" ith_cluster_silhouette_values = \\\n",
|
|||
|
" sample_silhouette_values[cluster_labels == i]\n",
|
|||
|
"\n",
|
|||
|
" ith_cluster_silhouette_values.sort()\n",
|
|||
|
"\n",
|
|||
|
" size_cluster_i = ith_cluster_silhouette_values.shape[0]\n",
|
|||
|
" y_upper = y_lower + size_cluster_i\n",
|
|||
|
"\n",
|
|||
|
" color = cm.nipy_spectral(float(i) / n_clusters)\n",
|
|||
|
" ax1.fill_betweenx(np.arange(y_lower, y_upper),\n",
|
|||
|
" 0, ith_cluster_silhouette_values,\n",
|
|||
|
" facecolor=color, edgecolor=color, alpha=0.7)\n",
|
|||
|
"\n",
|
|||
|
" # Label the silhouette plots with their cluster numbers at the middle\n",
|
|||
|
" ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))\n",
|
|||
|
"\n",
|
|||
|
" # Compute the new y_lower for next plot\n",
|
|||
|
" y_lower = y_upper + 10 # 10 for the 0 samples\n",
|
|||
|
"\n",
|
|||
|
" ax1.set_title(\"The silhouette plot for the various clusters.\")\n",
|
|||
|
" ax1.set_xlabel(\"The silhouette coefficient values\")\n",
|
|||
|
" ax1.set_ylabel(\"Cluster label\")\n",
|
|||
|
"\n",
|
|||
|
" # The vertical line for average silhouette score of all the values\n",
|
|||
|
" ax1.axvline(x=silhouette_avg, color=\"red\", linestyle=\"--\")\n",
|
|||
|
"\n",
|
|||
|
" ax1.set_yticks([]) # Clear the yaxis labels / ticks\n",
|
|||
|
" ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])\n",
|
|||
|
"\n",
|
|||
|
" # 2nd Plot showing the actual clusters formed\n",
|
|||
|
" colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)\n",
|
|||
|
" ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,\n",
|
|||
|
" c=colors, edgecolor='k')\n",
|
|||
|
"\n",
|
|||
|
" # Labeling the clusters\n",
|
|||
|
" centers = clusterer.cluster_centers_\n",
|
|||
|
" # Draw white circles at cluster centers\n",
|
|||
|
" ax2.scatter(centers[:, 0], centers[:, 1], marker='o',\n",
|
|||
|
" c=\"white\", alpha=1, s=200, edgecolor='k')\n",
|
|||
|
"\n",
|
|||
|
" for i, c in enumerate(centers):\n",
|
|||
|
" ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,\n",
|
|||
|
" s=50, edgecolor='k')\n",
|
|||
|
"\n",
|
|||
|
" ax2.set_title(\"The visualization of the clustered data.\")\n",
|
|||
|
" ax2.set_xlabel(\"Feature space for the 1st feature\")\n",
|
|||
|
" ax2.set_ylabel(\"Feature space for the 2nd feature\")\n",
|
|||
|
"\n",
|
|||
|
" plt.suptitle((\"Silhouette analysis for KMeans clustering on sample data \"\n",
|
|||
|
" \"with n_clusters = %d\" % n_clusters),\n",
|
|||
|
" fontsize=14, fontweight='bold')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"So, the average silhouette score was highest for 3 clusters. So it looks like our \"by eye\" analysis also agrees with this analysis.\n",
|
|||
|
"\n",
|
|||
|
"## agglomerative clustering\n",
|
|||
|
"\n",
|
|||
|
"Now let's try a different form of clustering in which we do not need to set, in advance, the number of clusters.\n",
|
|||
|
"\n",
|
|||
|
"This is *hierarchical agglomerative clustering*."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.cluster import AgglomerativeClustering"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from scipy.cluster.hierarchy import dendrogram"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"aggclust = AgglomerativeClustering(distance_threshold=0, n_clusters=None)\n",
|
|||
|
"aggclust.fit(X);"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABlkAAANUCAYAAAAwyEuvAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzde5xdZX0v/s/kMnESkzEEkxCNiBpTaEABbQh4JEpIQAL14Cm2oREqDXhQaGryokWqIkJQbmKhRaRoUMDYU4vlUmKgCi1CuATSyqX89IgQNAEKQwJkm+v6/cGZaSaZmcxec9lzeb9fr3kx2fuZtZ619tprmOezv89TVxRFEQAAAAAAAKoypNYdAAAAAAAA6I+ELAAAAAAAACUIWQAAAAAAAEoQsgAAAAAAAJQgZAEAAAAAAChByAIAAAAAAFCCkAUAAAAAAKAEIQsAAAAAAEAJQhYAAAAAAIAShCwAAPRLS5cuTV1dXR566KE2n587d27e/va3t3rs7W9/e04++eSe71wPuOuuu1JXV5d/+Id/2G3bk08+eZdj707nnntu6urqOt3+3/7t33LCCSfkLW95S+rr69PY2JhDDz00V111VV577bWWdj39+ixZsiQ//OEPe2Tbv/rVr1JXV5elS5f2yPY7s+/mr+HDh2fcuHF5//vfnz//8z/PY4891ut92lFPX48AAFBLQhYAAAaNm266KZ///Odr3Y0e9/nPfz433XRTrbuRJPniF7+YD37wg/n1r3+dL3/5y7njjjuybNmyHHHEETn33HPzV3/1V73Wl54MWfbaa6/cd999OeaYY3pk+51xxhln5L777svdd9+d7373u/noRz+am2++Oe95z3ty8cUX16xfAAAwkA2rdQcAAKC3HHjggd22raIo8tvf/jYNDQ1d2k6lUunyNnb2zne+s1u3V9b/+T//J+edd15OOeWUXHPNNa2qX44++uicddZZue+++2rYw67btm1btm7dmhEjRuSQQw6paV/e9ra3terDRz7ykXz2s5/N8ccfn7POOivTpk3L0UcfXcMetm/H89jTuuu9CwAAiUoWAAAGkbamo9qwYUMWL16cffbZJ/X19XnLW96ShQsXtprGKknq6urymc98Jt/4xjey7777ZsSIEbnuuuuSJF/60pcyffr07LHHHhkzZkwOOuigXHvttSmKYpf9z507N//4j/+YAw88MG94wxvypS99KUny61//OqeeemomT56c+vr6TJo0Kf/rf/2vPPfcc622sWXLlpxzzjmZNGlSxowZk1mzZuXJJ59s1aat6Zm2b9+eK664Iu9973vT0NCQN73pTTnkkENy8803t7T5/ve/n9mzZ2evvfZKQ0ND9t133/zlX/7lLueis84777yMHTs2f/3Xf93m9GKjR4/O7Nmz2/355inhfvWrX7V6vHnqtLvuuqvlsUceeSRz587N+PHjM2LEiEyaNCnHHHNMnn322SSvv36vvfZarrvuupZptWbOnNny8+vWrctpp52Wt771ramvr88+++yTL33pS9m6dWtLm+ZpuS666KKcf/752WeffTJixIj85Cc/aXO6sOZp1R577LH80R/9URobGzNhwoR88pOfzPr161sd08svv5xTTjkle+yxR974xjfmmGOOyS9/+cvU1dXl3HPP3f3JbkdDQ0OuvfbaDB8+fJdqlmqO+ZJLLslll12WffbZJ2984xszY8aMrFy5cpf9LV26NFOnTs2IESOy77775jvf+c4ubTo6j0ly8803Z8aMGRk5cmRGjx6dI488ss0w7p/+6Z9ywAEHZMSIEXnHO96Rr3/9621OZded791bb701Bx54YMv749Zbb2057n333TejRo3K7/3e77U7jSEAAAOPShYAAPq15k/A72znQdK2bNy4MYcffnieffbZfO5zn8sBBxyQxx57LF/4whfys5/9LHfeeWerAdsf/vCH+bd/+7d84QtfyMSJEzN+/Pgkrw8an3baaXnb296WJFm5cmXOOOOM/PrXv84XvvCFVvt8+OGH88QTT+Sv/uqvss8++2TUqFH59a9/nfe///3ZsmVLSz9efPHF/OhHP0pTU1MmTJjQ8vOf+9zncthhh+Xv/u7vsmHDhvzFX/xFjj322DzxxBMZOnRou8d68skn5/rrr88pp5yS8847L/X19Xn44YdbBRg///nP85GPfCQLFy7MqFGj8p//+Z/56le/mgceeCA//vGPd3s+d7R27do8+uij+fjHP56RI0dW9bPVeu2113LkkUdmn332yd/8zd9kwoQJWbduXX7yk5/klVdeSZLcd999+fCHP5wPfehDLVPGjRkzJsnrYcPv/d7vZciQIfnCF76Qd77znbnvvvty/vnn51e/+lW+/e1vt9rfX//1X+fd7353LrnkkowZMyZTpkzpsH8f+9jH8vGPfzynnHJKfvazn+Xss89OknzrW99K8noAduyxx+ahhx7Kueeem4MOOij33XdfjjrqqG45P5MmTcrBBx+ce++9N1u3bs2wYcOqPua/+Zu/ye/8zu/k8ssvT/L6lHQf+chH8tRTT6WxsTHJ60HDn/zJn+T3f//3c+mll2b9+vU599xzs2nTpgwZsuvn+9o6jzfeeGNOPPHEzJ49O9/73veyadOmXHTRRZk5c2b+5V/+JR/4wAeSJMuXL8/xxx+fD37wg/n+97+frVu35pJLLtkllGzWHe/df//3f8/ZZ5+dc845J42NjfnSl76U448/PmeffXb+5V/+JUuWLEldXV3+4i/+InPnzs1TTz2lWgYAYDAoAACgH/r2t79dJOnwa++99271M3vvvXdx0kkntfz7wgsvLIYMGVI8+OCDrdr9wz/8Q5Gk+Od//ueWx5IUjY2NxUsvvdRhv7Zt21Zs2bKlOO+884px48YV27dvb7X/oUOHFk8++WSrn/nkJz9ZDB8+vHj88cfb3e5PfvKTIknxkY98pNXjf//3f18kKe67776Wx0466aRWx/6v//qvRZLinHPO6bDvO9q+fXuxZcuW4u677y6SFP/+7//e8twXv/jFYnd/SqxcubJIUvzlX/5lp/e58+vT/Bo/9dRTrdo1n4uf/OQnRVEUxUMPPVQkKX74wx92uP1Ro0a12n6z0047rXjjG99YPP30060ev+SSS4okxWOPPVYURVE89dRTRZLine98Z7F58+ZWbZuf+/a3v93yWPN5uuiii1q1Pf3004s3vOENLdfGbbfdViQprrrqqlbtLrzwwiJJ8cUvfrHD42re98UXX9xum49//ONFkuK5554rdcz7779/sXXr1pZ2DzzwQJGk+N73vlcUxevX/aRJk4qDDjqo1TX/q1/9qhg+fHir67G989i8jf3337/Ytm1by+OvvPJKMX78+OLQQw9teez9739/MXny5GLTpk2t2o0bN26Xa7O73rsNDQ3Fs88+2/LY6tWriyTFXnvtVbz22mstj//whz8skhQ333xzh/sDAGBgMF0YAAD92ne+8508+OCDu3w1f+K9I7feemumTZuW9773vdm6dWvL15w5c3aZjipJPvzhD2fs2LG7bOfHP/5xZs2alcbGxgwdOjTDhw/PF77whbz44ot5/vnnW7U94IAD8u53v7vVY7fffns+9KEPZd99991tn4877rhdtpckTz/9dLs/c/vttydJPv3pT3e47V/+8peZN29eJk6c2HIchx9+eJLkiSee2G3fauVd73pXxo4dm7/4i7/IN77xjTz++ONV/fytt96aD33oQ5k0aVKr66B5/ZK77767Vfvjjjsuw4cP7/T223rNfvvb37ZcG83bP+GEE1q1+6M/+qOqjqMjxU6VXdUe8zHHHNOqUmrn6+7JJ5/Mb37zm8ybN69V9dfee++dQw89tM0+7Xwem7cxf/78VpUvb3zjG/Oxj30sK1euzMaNG/Paa6/loYceykc/+tHU19e3anfssce2ua/ueO++973vzVve8paWfze/X2fOnNmqWqv58Y7ekwAADBymCwMAoF/bd9998773vW+XxxsbG7NmzZoOf/a5557LL37xi3YHzP/rv/6r1b/32muvXdo88MADmT1
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 2000x1000 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Based on example at https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html\n",
|
|||
|
"def plot_dendrogram(model, **kwargs):\n",
|
|||
|
" # Create linkage matrix and then plot the dendrogram\n",
|
|||
|
"\n",
|
|||
|
" # create the counts of samples under each node\n",
|
|||
|
" counts = np.zeros(model.children_.shape[0])\n",
|
|||
|
" n_samples = len(model.labels_)\n",
|
|||
|
" for i, merge in enumerate(model.children_):\n",
|
|||
|
" current_count = 0\n",
|
|||
|
" for child_idx in merge:\n",
|
|||
|
" if child_idx < n_samples:\n",
|
|||
|
" current_count += 1 # leaf node\n",
|
|||
|
" else:\n",
|
|||
|
" current_count += counts[child_idx - n_samples]\n",
|
|||
|
" counts[i] = current_count\n",
|
|||
|
"\n",
|
|||
|
" linkage_matrix = np.column_stack([model.children_, model.distances_,\n",
|
|||
|
" counts]).astype(float)\n",
|
|||
|
"\n",
|
|||
|
" # Plot the corresponding dendrogram\n",
|
|||
|
" dendrogram(linkage_matrix, **kwargs)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(20,10))\n",
|
|||
|
"plt.title('Hierarchical Clustering Dendrogram')\n",
|
|||
|
"plot_dendrogram(aggclust)#, truncate_mode='level', p=3)\n",
|
|||
|
"plt.xlabel(\"Number of points in node (or index of point if no parenthesis).\");"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"These dendrograms are often used in combination with a display of the gene expression data in a \"heatmap\".\n",
|
|||
|
"\n",
|
|||
|
"Here is a single example of approximately zillions in the literature. The rows in the heatmap matrix are 1259 genes, the columns are different strains of bacteria.\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"Downloaded from [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4512144/) [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4512144/bin/ofv09303.jpg).\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Further thoughts and questions\n",
|
|||
|
"\n",
|
|||
|
"- Does clustering before PCA work better or clustering after PCA?\n",
|
|||
|
"- How would you judge this?\n",
|
|||
|
"- As discussed in the lecture, PCA is a parameter-free, linear, very fast dimensionality reduction technique and is almost always the first dimensionality reduction technique you should try if you want to look at high dimensional data. If PCA doesn't perform well, there are many other techniques. PCA can give a good first intuition into dimensionality reduction as a broad class of techniques, which is why we do it here. Non-linear techniques in widespread use include ICA (independent component analysis), T-SNE (T stochastic neighbor embedding), and UMAP (Uniform Manifold Approximation and Projection)."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"celltoolbar": "Create Assignment",
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3 (ipykernel)",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.11.10"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 4
|
|||
|
}
|