799 lines
25 KiB
Plaintext
799 lines
25 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# You must run this cell, but you can ignore its contents.\n",
|
|
"\n",
|
|
"import hashlib\n",
|
|
"\n",
|
|
"def ads_hash(ty):\n",
|
|
" \"\"\"Return a unique string for input\"\"\"\n",
|
|
" ty_str = str(ty).encode()\n",
|
|
" m = hashlib.sha256()\n",
|
|
" m.update(ty_str)\n",
|
|
" return m.hexdigest()[:10]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# transcriptome clustering analysis\n",
|
|
"\n",
|
|
"In this exercise, you are going to analyze the results of an experiment in which the RNA was sequenced (a \"transcriptome\" was made) for many cells in cell culture. We expect that the total number of cell types is rather limited, although we sequenced many individual cells.\n",
|
|
"\n",
|
|
"The data here is fake, but the analysis methods are real and are in heavy use across lots of different labs and can be applied to many other types of problems beyond RNA sequencing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"from sklearn.decomposition import PCA\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import seaborn as sns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We are given a dataset where the RNA expression levels of 50 genes from each of many cells was quantified. The data is in the file `RNAseq_data_50genes.csv`. Let's read this into a pandas DataFrame."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv('RNAseq_data_50genes.csv')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now let's have a first look at this data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Q1 Understanding the raw data.\n",
|
|
"\n",
|
|
"In the first row (with index 0), how many reads were made of gene 0? Put the answer in the variable `n_reads_sample0_gene0`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "ad6d233c220d011a8bd8b0fa12801731",
|
|
"grade": false,
|
|
"grade_id": "cell-867448dc531b9577",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE\n",
|
|
"raise NotImplementedError()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "743a3540105b97b337d7700a282cd341",
|
|
"grade": true,
|
|
"grade_id": "cell-927c0add61a309ae",
|
|
"locked": true,
|
|
"points": 1,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# This is a test of the above, do not change this code.\n",
|
|
"assert type(n_reads_sample0_gene0)==int\n",
|
|
"assert ads_hash(n_reads_sample0_gene0)=='1a5de96b83'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"How many total cells were sequenced? This is the number of rows in the dataframe. Put this in the variable `n_cells_sequenced`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "a1c0b029cd32314f47d53b102bddbf00",
|
|
"grade": false,
|
|
"grade_id": "cell-190def2eb393c7e4",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE\n",
|
|
"raise NotImplementedError()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "615ee2939745eee7613193c187d2233e",
|
|
"grade": true,
|
|
"grade_id": "cell-c04f0d4317e6abfe",
|
|
"locked": true,
|
|
"points": 1,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# This is a test of the above, do not change this code.\n",
|
|
"assert type(n_cells_sequenced)==int\n",
|
|
"assert ads_hash(n_cells_sequenced)=='284b7e6d78'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"How many genes do we have in our dataset? This is actually the dimensionality of our dataset. We are counting reads for each of these genes, so if we have N genes, we have an N dimensional dataset. Put your answer in the variable `n_dim`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "71a387cc487a75a453e896365930054f",
|
|
"grade": false,
|
|
"grade_id": "cell-866e97e764ad0ed7",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE\n",
|
|
"raise NotImplementedError()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "bcf3f56610fbee8e98eaa72fff2c864d",
|
|
"grade": true,
|
|
"grade_id": "cell-b28c7b470e97c1de",
|
|
"locked": true,
|
|
"points": 1,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# This is a test of the above, do not change this code.\n",
|
|
"assert type(n_dim)==int\n",
|
|
"assert ads_hash(n_dim)=='1a6562590e'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Quickly plotting pandas DataFrames with seaborn\n",
|
|
"\n",
|
|
"Several lectures ago, we discussed seaborn we are are going to use it below to make a plot with our transcriptomic data. Let's first practice with a simple dataset:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"simple_df = pd.DataFrame(data={'column 1':[1,2,3,4,5], 'column 2':[1,5,5,2,2]})\n",
|
|
"display(simple_df)\n",
|
|
"sns.scatterplot(data=simple_df, x='column 1', y='column 2');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Q2 Visualizing the raw data\n",
|
|
"\n",
|
|
"Now let's use seaborn to make a quick plot of the data (stored as a Pandas DataFrame in the variable `df`). Use the seaborn `scatterplot` function to make a plot like the following. Your plot should include the X and Y axes labels. You need only a single line of code for this.\n",
|
|
"\n",
|
|
"![scatterplot.png](scatterplot.png)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "9838c85a0ccbabda35f9eaa499854446",
|
|
"grade": true,
|
|
"grade_id": "cell-3eb44d7a59e553d8",
|
|
"locked": false,
|
|
"points": 1,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE\n",
|
|
"raise NotImplementedError()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# first impressions of the data\n",
|
|
"\n",
|
|
"So, how does our data look? At first glance it looks... like a bunch of random numbers with no real structure! But could there be some structure? For example, above we learned that although many cells have been sequenced, we expect these are from only a very limited number of cell types.\n",
|
|
"\n",
|
|
"How can we figure out something about these cell types?\n",
|
|
"\n",
|
|
"Let's make use of principal component analysis (PCA) and clustering from scikit-learn as some of the first tools in our toolkit."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Converting to plain numpy\n",
|
|
"\n",
|
|
"While Pandas is very convenient for many things, scikit learn uses plain numpy arrays and generally works best when the datatype is a floating point number rather than an integer. Let's do this conversion now and call our data `X`. (We also `copy()` this to a new numpy array to ensure it is contiguous in memory.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X = df.to_numpy(dtype=np.float64).copy()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## PCA\n",
|
|
"\n",
|
|
"Let's first run PCA on our data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.decomposition import PCA\n",
|
|
"pca = PCA().fit(X)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The results of our analysis are stored in the variable `pca`. We can use this to project our original 50 dimensional data into its principle components and plot just the first two dimensions in this principle component space."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"projected = pca.transform(X)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"plt.plot(projected[:,0], projected[:,1],'.')\n",
|
|
"plt.xlabel('PC1')\n",
|
|
"plt.ylabel('PC2');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Structure in the PCA space\n",
|
|
"\n",
|
|
"Now, what do you notice about the data in this PCA space?\n",
|
|
"\n",
|
|
"Now, instead of looking like a structure-free blob, we seem to have some structure. What kind of structure do we have?\n",
|
|
"\n",
|
|
"\"3 clusters\", I hope you are thinking. Biologically speaking, we are now guessing that there were this many cell types in our original sample.\n",
|
|
"\n",
|
|
"## PCA explained variance\n",
|
|
"\n",
|
|
"One of the first questions about PCA is how much of the variance in our data are \"explained\" by the first N components of the projected data. Let's plot this."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"plt.plot(np.cumsum(pca.explained_variance_ratio_),'.-')\n",
|
|
"plt.xlabel('number of components')\n",
|
|
"plt.ylabel('cumulative explained variance');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## mini batch K-Means\n",
|
|
"\n",
|
|
"Given that it looks like our data may have three clusters, let's find these clusters using mini batch K-Means."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.cluster import MiniBatchKMeans\n",
|
|
"from sklearn.metrics.pairwise import pairwise_distances_argmin"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Q3. Specifying *k*, the number of clusters\n",
|
|
"\n",
|
|
"As always, with a K-Means type algorithm, we must specify the number of clusters before running the algorithm. Use your thoughts from above and put this in the variable `n_clusters`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "ab3ee864e65ed95602e13840e4151fcd",
|
|
"grade": false,
|
|
"grade_id": "cell-956d77de44aaa49b",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE\n",
|
|
"raise NotImplementedError()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "0c8e37abea8f9b3b8e824e2481bde6ac",
|
|
"grade": true,
|
|
"grade_id": "cell-b05818d9386980f7",
|
|
"locked": true,
|
|
"points": 1,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# This is a test of the above, do not change this code.\n",
|
|
"assert type(n_clusters)==int\n",
|
|
"assert ads_hash(n_clusters)=='4e07408562'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we are going to actually run the algorithm from scikit learn."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"mbk = MiniBatchKMeans(n_clusters=n_clusters, batch_size=6, random_state=0, n_init='auto').fit(X);"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## plotting the clustering results in the original \"number of reads\" space\n",
|
|
"\n",
|
|
"Let's first plot the our raw read data in a scatter plot like above, but colored according to our cluster label. We will also plot our cluster centers here."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"mbk_means_cluster_centers = mbk.cluster_centers_\n",
|
|
"mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fig, ax = plt.subplots(nrows=1, ncols=1)\n",
|
|
"\n",
|
|
"x_gene_idx = 0\n",
|
|
"y_gene_idx = 1\n",
|
|
"\n",
|
|
"for k in range(n_clusters):\n",
|
|
" my_members = mbk_means_labels == k\n",
|
|
" cluster_center = mbk_means_cluster_centers[k]\n",
|
|
" line, = ax.plot(X[my_members, x_gene_idx], X[my_members, 1], '.', markersize=5)\n",
|
|
" ax.plot(cluster_center[x_gene_idx], cluster_center[1], 'o', markersize=10, markeredgecolor='black', markerfacecolor=line.get_color())\n",
|
|
"ax.set_xlabel('gene %d' % x_gene_idx)\n",
|
|
"ax.set_ylabel('gene %d' % y_gene_idx);"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## plotting the clustering results in PCA space\n",
|
|
"\n",
|
|
"Hmm, the plot above was not too informative. It does not seem to show obvious clusters in the data, and the points look very interwoven with others, at least for these two genes.\n",
|
|
"\n",
|
|
"Let's re-plot our cluster assignments, but this time using the projection into PCA space."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fig, ax = plt.subplots(nrows=1, ncols=1)\n",
|
|
"\n",
|
|
"projected_centers = pca.transform(mbk.cluster_centers_)\n",
|
|
"\n",
|
|
"for k in range(n_clusters):\n",
|
|
" my_members = mbk_means_labels == k\n",
|
|
" projected_cluster_center = projected_centers[k]\n",
|
|
" line, = ax.plot(projected[my_members, 0], projected[my_members, 1], '.',\n",
|
|
" markersize=1.8)\n",
|
|
" ax.plot(projected_cluster_center[0], projected_cluster_center[1], 'o',\n",
|
|
" markersize=10, markeredgecolor='black', markerfacecolor=line.get_color())\n",
|
|
"ax.set_xlabel('PC 1')\n",
|
|
"ax.set_ylabel('PC 2');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## silhouette analysis with K-means\n",
|
|
"\n",
|
|
"OK, so that last plot (in PCA space) is looking better. The automatically detected clusters seem to agree with the idea we had from just looking at the data. Let's now use a silhouette analysis as one way to check whether this was a particularly good number of clusters for this data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.metrics import silhouette_samples, silhouette_score\n",
|
|
"from sklearn.cluster import KMeans\n",
|
|
"import matplotlib.cm as cm"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# From https://scikit-learn.org/0.21/auto_examples/cluster/plot_kmeans_silhouette_analysis.html\n",
|
|
"range_n_clusters = [2, 3, 4, 5, 6]\n",
|
|
"\n",
|
|
"for n_clusters in range_n_clusters:\n",
|
|
" # Create a subplot with 1 row and 2 columns\n",
|
|
" fig, (ax1, ax2) = plt.subplots(1, 2)\n",
|
|
" fig.set_size_inches(18, 7)\n",
|
|
"\n",
|
|
" # The 1st subplot is the silhouette plot\n",
|
|
" # The silhouette coefficient can range from -1, 1 but in this example all\n",
|
|
" # lie within [-0.1, 1]\n",
|
|
" ax1.set_xlim([-0.1, 1])\n",
|
|
" # The (n_clusters+1)*10 is for inserting blank space between silhouette\n",
|
|
" # plots of individual clusters, to demarcate them clearly.\n",
|
|
" ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])\n",
|
|
"\n",
|
|
" # Initialize the clusterer with n_clusters value and a random generator\n",
|
|
" # seed of 10 for reproducibility.\n",
|
|
" clusterer = KMeans(n_clusters=n_clusters, random_state=10, n_init='auto')\n",
|
|
" cluster_labels = clusterer.fit_predict(X)\n",
|
|
"\n",
|
|
" # The silhouette_score gives the average value for all the samples.\n",
|
|
" # This gives a perspective into the density and separation of the formed\n",
|
|
" # clusters\n",
|
|
" silhouette_avg = silhouette_score(X, cluster_labels)\n",
|
|
" print(\"For n_clusters =\", n_clusters,\n",
|
|
" \"The average silhouette_score is :\", silhouette_avg)\n",
|
|
"\n",
|
|
" # Compute the silhouette scores for each sample\n",
|
|
" sample_silhouette_values = silhouette_samples(X, cluster_labels)\n",
|
|
"\n",
|
|
" y_lower = 10\n",
|
|
" for i in range(n_clusters):\n",
|
|
" # Aggregate the silhouette scores for samples belonging to\n",
|
|
" # cluster i, and sort them\n",
|
|
" ith_cluster_silhouette_values = \\\n",
|
|
" sample_silhouette_values[cluster_labels == i]\n",
|
|
"\n",
|
|
" ith_cluster_silhouette_values.sort()\n",
|
|
"\n",
|
|
" size_cluster_i = ith_cluster_silhouette_values.shape[0]\n",
|
|
" y_upper = y_lower + size_cluster_i\n",
|
|
"\n",
|
|
" color = cm.nipy_spectral(float(i) / n_clusters)\n",
|
|
" ax1.fill_betweenx(np.arange(y_lower, y_upper),\n",
|
|
" 0, ith_cluster_silhouette_values,\n",
|
|
" facecolor=color, edgecolor=color, alpha=0.7)\n",
|
|
"\n",
|
|
" # Label the silhouette plots with their cluster numbers at the middle\n",
|
|
" ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))\n",
|
|
"\n",
|
|
" # Compute the new y_lower for next plot\n",
|
|
" y_lower = y_upper + 10 # 10 for the 0 samples\n",
|
|
"\n",
|
|
" ax1.set_title(\"The silhouette plot for the various clusters.\")\n",
|
|
" ax1.set_xlabel(\"The silhouette coefficient values\")\n",
|
|
" ax1.set_ylabel(\"Cluster label\")\n",
|
|
"\n",
|
|
" # The vertical line for average silhouette score of all the values\n",
|
|
" ax1.axvline(x=silhouette_avg, color=\"red\", linestyle=\"--\")\n",
|
|
"\n",
|
|
" ax1.set_yticks([]) # Clear the yaxis labels / ticks\n",
|
|
" ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])\n",
|
|
"\n",
|
|
" # 2nd Plot showing the actual clusters formed\n",
|
|
" colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)\n",
|
|
" ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,\n",
|
|
" c=colors, edgecolor='k')\n",
|
|
"\n",
|
|
" # Labeling the clusters\n",
|
|
" centers = clusterer.cluster_centers_\n",
|
|
" # Draw white circles at cluster centers\n",
|
|
" ax2.scatter(centers[:, 0], centers[:, 1], marker='o',\n",
|
|
" c=\"white\", alpha=1, s=200, edgecolor='k')\n",
|
|
"\n",
|
|
" for i, c in enumerate(centers):\n",
|
|
" ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,\n",
|
|
" s=50, edgecolor='k')\n",
|
|
"\n",
|
|
" ax2.set_title(\"The visualization of the clustered data.\")\n",
|
|
" ax2.set_xlabel(\"Feature space for the 1st feature\")\n",
|
|
" ax2.set_ylabel(\"Feature space for the 2nd feature\")\n",
|
|
"\n",
|
|
" plt.suptitle((\"Silhouette analysis for KMeans clustering on sample data \"\n",
|
|
" \"with n_clusters = %d\" % n_clusters),\n",
|
|
" fontsize=14, fontweight='bold')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"So, the average silhouette score was highest for 3 clusters. So it looks like our \"by eye\" analysis also agrees with this analysis.\n",
|
|
"\n",
|
|
"## agglomerative clustering\n",
|
|
"\n",
|
|
"Now let's try a different form of clustering in which we do not need to set, in advance, the number of clusters.\n",
|
|
"\n",
|
|
"This is *hierarchical agglomerative clustering*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.cluster import AgglomerativeClustering"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from scipy.cluster.hierarchy import dendrogram"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"aggclust = AgglomerativeClustering(distance_threshold=0, n_clusters=None)\n",
|
|
"aggclust.fit(X);"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Based on example at https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html\n",
|
|
"def plot_dendrogram(model, **kwargs):\n",
|
|
" # Create linkage matrix and then plot the dendrogram\n",
|
|
"\n",
|
|
" # create the counts of samples under each node\n",
|
|
" counts = np.zeros(model.children_.shape[0])\n",
|
|
" n_samples = len(model.labels_)\n",
|
|
" for i, merge in enumerate(model.children_):\n",
|
|
" current_count = 0\n",
|
|
" for child_idx in merge:\n",
|
|
" if child_idx < n_samples:\n",
|
|
" current_count += 1 # leaf node\n",
|
|
" else:\n",
|
|
" current_count += counts[child_idx - n_samples]\n",
|
|
" counts[i] = current_count\n",
|
|
"\n",
|
|
" linkage_matrix = np.column_stack([model.children_, model.distances_,\n",
|
|
" counts]).astype(float)\n",
|
|
"\n",
|
|
" # Plot the corresponding dendrogram\n",
|
|
" dendrogram(linkage_matrix, **kwargs)\n",
|
|
"\n",
|
|
"plt.figure(figsize=(20,10))\n",
|
|
"plt.title('Hierarchical Clustering Dendrogram')\n",
|
|
"plot_dendrogram(aggclust)#, truncate_mode='level', p=3)\n",
|
|
"plt.xlabel(\"Number of points in node (or index of point if no parenthesis).\");"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"These dendrograms are often used in combination with a display of the gene expression data in a \"heatmap\".\n",
|
|
"\n",
|
|
"Here is a single example of approximately zillions in the literature. The rows in the heatmap matrix are 1259 genes, the columns are different strains of bacteria.\n",
|
|
"\n",
|
|
"![ofv09303.jpg](ofv09303.jpg)\n",
|
|
"\n",
|
|
"Downloaded from [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4512144/) [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4512144/bin/ofv09303.jpg).\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Further thoughts and questions\n",
|
|
"\n",
|
|
"- Does clustering before PCA work better or clustering after PCA?\n",
|
|
"- How would you judge this?\n",
|
|
"- As discussed in the lecture, PCA is a parameter-free, linear, very fast dimensionality reduction technique and is almost always the first dimensionality reduction technique you should try if you want to look at high dimensional data. If PCA doesn't perform well, there are many other techniques. PCA can give a good first intuition into dimensionality reduction as a broad class of techniques, which is why we do it here. Non-linear techniques in widespread use include ICA (independent component analysis), T-SNE (T stochastic neighbor embedding), and UMAP (Uniform Manifold Approximation and Projection)."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|