{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# You must run this cell, but you can ignore its contents.\n", "\n", "import hashlib\n", "\n", "def ads_hash(ty):\n", " \"\"\"Return a unique string for input\"\"\"\n", " ty_str = str(ty).encode()\n", " m = hashlib.sha256()\n", " m.update(ty_str)\n", " return m.hexdigest()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# transcriptome clustering analysis\n", "\n", "In this exercise, you are going to analyze the results of an experiment in which the RNA was sequenced (a \"transcriptome\" was made) for many cells in cell culture. We expect that the total number of cell types is rather limited, although we sequenced many individual cells.\n", "\n", "The data here is fake, but the analysis methods are real and are in heavy use across lots of different labs and can be applied to many other types of problems beyond RNA sequencing data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.decomposition import PCA\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are given a dataset where the RNA expression levels of 50 genes from each of many cells was quantified. The data is in the file `RNAseq_data_50genes.csv`. Let's read this into a pandas DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('RNAseq_data_50genes.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's have a first look at this data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | gene 0 | \n", "gene 1 | \n", "gene 2 | \n", "gene 3 | \n", "gene 4 | \n", "gene 5 | \n", "gene 6 | \n", "gene 7 | \n", "gene 8 | \n", "gene 9 | \n", "... | \n", "gene 40 | \n", "gene 41 | \n", "gene 42 | \n", "gene 43 | \n", "gene 44 | \n", "gene 45 | \n", "gene 46 | \n", "gene 47 | \n", "gene 48 | \n", "gene 49 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2377 | \n", "2886 | \n", "1524 | \n", "2235 | \n", "2472 | \n", "1256 | \n", "1006 | \n", "1902 | \n", "911 | \n", "2285 | \n", "... | \n", "0 | \n", "1115 | \n", "1381 | \n", "2226 | \n", "1810 | \n", "2124 | \n", "1479 | \n", "719 | \n", "0 | \n", "529 | \n", "
1 | \n", "1251 | \n", "948 | \n", "3038 | \n", "3857 | \n", "1971 | \n", "1761 | \n", "2371 | \n", "632 | \n", "1705 | \n", "2251 | \n", "... | \n", "954 | \n", "1874 | \n", "527 | \n", "1783 | \n", "1922 | \n", "1029 | \n", "173 | \n", "2267 | \n", "1078 | \n", "1343 | \n", "
2 | \n", "2650 | \n", "1643 | \n", "1560 | \n", "2545 | \n", "1689 | \n", "1072 | \n", "1999 | \n", "1707 | \n", "579 | \n", "1655 | \n", "... | \n", "589 | \n", "409 | \n", "967 | \n", "1762 | \n", "1789 | \n", "2424 | \n", "494 | \n", "1680 | \n", "0 | \n", "1283 | \n", "
3 | \n", "1622 | \n", "1581 | \n", "1333 | \n", "2218 | \n", "2346 | \n", "342 | \n", "1534 | \n", "1571 | \n", "456 | \n", "187 | \n", "... | \n", "2535 | \n", "1524 | \n", "236 | \n", "1343 | \n", "1089 | \n", "1429 | \n", "930 | \n", "1415 | \n", "173 | \n", "1153 | \n", "
4 | \n", "1863 | \n", "993 | \n", "1225 | \n", "1318 | \n", "1854 | \n", "0 | \n", "1461 | \n", "634 | \n", "663 | \n", "0 | \n", "... | \n", "1882 | \n", "1072 | \n", "0 | \n", "3040 | \n", "1031 | \n", "553 | \n", "1468 | \n", "2100 | \n", "0 | \n", "1001 | \n", "
5 rows × 50 columns
\n", "\n", " | column 1 | \n", "column 2 | \n", "
---|---|---|
0 | \n", "1 | \n", "1 | \n", "
1 | \n", "2 | \n", "5 | \n", "
2 | \n", "3 | \n", "5 | \n", "
3 | \n", "4 | \n", "2 | \n", "
4 | \n", "5 | \n", "2 | \n", "