pm21-dragon/lectures/lecture-09/2 Principal Component Analysis.ipynb

1271 lines
523 KiB
Plaintext
Raw Normal View History

2024-12-13 04:02:29 -05:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# In Depth: Principal Component Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we explore what is perhaps one of the most broadly used of unsupervised algorithms, principal component analysis (PCA).\n",
"PCA is fundamentally a dimensionality reduction algorithm, but it can also be useful as a tool for visualization, for noise filtering, for feature extraction and engineering, and much more.\n",
"After a brief conceptual discussion of the PCA algorithm, we will see a couple examples of these further applications.\n",
"\n",
"We begin with the standard imports:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns; sns.set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introducing Principal Component Analysis\n",
"\n",
"Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data, which we saw briefly in [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb).\n",
"Its behavior is easiest to visualize by looking at a two-dimensional dataset.\n",
"Consider the following 200 points:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAi8AAAGlCAYAAAAyFxZnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA+KUlEQVR4nO3deXxU9cHv8e8kELLAhIxkASIUAzYsT2MjQVQkFcQi3tfj3lta0T5qtT5XbGvVVy22uFStQrFF5NrWWlyutVda+1gbRWj74Mu2sl20FdmkGLYQlglMJctAZu4fcaaZmTNrzszJmfm8Xy+rnDkz53d+k3K++a0Ov9/vFwAAgE3kWV0AAACAZBBeAACArRBeAACArRBeAACArRBeAACArRBeAACArRBeAACArRBeAACArRBeAACArZgeXo4dO6bvfe97mj59uurr6zV37lxt3Lgx6vltbW361re+pYaGBjU0NOi73/2u2tvbzS4WAADIEqaHlzvuuEPvvfeelixZopUrV2rixIm68cYbtWvXLsPzb7/9du3du1crVqzQ0qVL9ec//1n333+/2cUCAABZwmHm3kbNzc26+OKL9ctf/lL19fWSJL/fr89//vO69NJL9fWvfz3k/M2bN+uLX/yimpqaVFNTI0l6++23ddNNN2nt2rWqrKw0q2gAACBLmNryUlZWpp/+9KeaNGlS8JjD4ZDf79fx48cjzt+4caPKy8uDwUWSpkyZIofDoU2bNplZNAAAkCUGmPlhTqdTjY2NIcdef/117dmzR9OmTYs4v7W1VcOHDw85VlBQoKFDh6qlpSXlcvj9fvl8bJbdW16egzrJMOrcGtR75lHn1si2es/Lc8jhcCR0rqnhJdymTZv0ne98RzNnztSMGTMiXu/o6FBBQUHE8UGDBqmrq6tP187PZyJVuPz8xH4oYB7q3BrUe+ZR59bI1XpPW3hZs2aN7rzzTtXV1WnJkiWG5xQWFsrr9UYc7+rqUnFxccrX9vn88niYsRSQn58np7NIHk+Hurt9VhcnJ1Dn1qDeM486t0Y21rvTWZRww0NawssLL7yghx56SLNmzdLixYsNW1ckqaqqSmvWrAk55vV6dezYsT4P1j11Kju+TDN1d/uolwyjzq1BvWcedW6NXK130/tWXnzxRT344IP68pe/rB/96EdRg4skNTQ06ODBg2pubg4eW7dunSQFZysBAAD0Zmp42b17tx5++GHNmjVLt9xyi44eParDhw/r8OHD+uc//6nu7m4dPnxYnZ2dkqS6ujrV19frm9/8pv72t7/pnXfe0cKFC3X55ZczTRoAABgyNbysWrVKJ0+e1OrVqzVt2rSQfx566CG1tLRo2rRpampqktQzjXrZsmWqrq7W9ddfr2984xuaPn267rvvPjOLBQAAsoipi9T1F93dPrndJ6wuRr8xYECeyspK1NZ2Iif7Rq1AnVuDes886twa2VjvLldJwgN2mU8MAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABshfACAABsJe3hZfny5Zo3b17Mc1555RV9+tOfjvinubk53cUDAAA2MyCdH75ixQotXbpUDQ0NMc/bvn27pkyZoiVLloQcd7lc6SweAACwobSEl9bWVi1YsECbNm3SmDFj4p6/Y8cO1dbWqry8PB3FAQAAWSQt3UZbtmxRaWmpXn31VdXV1cU9f/v27Ro7dmw6igIAALJMWlpeZsyYoRkzZiR0rtvt1pEjR7RhwwY9//zzOnbsmOrq6nTnnXcm1GoTzYABjEUOyM/PC/k30o86twb1nnnUuTVyvd7TOuYlETt27JAk5efn69FHH1V7e7uWL1+uL33pS/rd736nYcOGJf2ZeXkOlZWVmF1U23M6i6wuQs6hzq1BvWcedW6NXK13y8PL1KlTtX79epWWlgaPPfnkk7rwwgv1m9/8RjfffHPSn+nz+eXxtJtZTFvLz8+T01kkj6dD3d0+q4uTE6hza1DvmUedWyMb693pLEq4Jcny8CIpJLhIUnFxsaqrq9Xa2pryZ546lR1fppm6u33US4ZR59ag3jOPOrdGrta75Z1lL774os455xx1dnYGj3388cf66KOPGMQLAAAiZDy8dHd36/Dhw8GwcuGFF8rv9+vuu+/Wzp079fe//13z58+Xy+XSFVdckeniAQCAfi7j4aWlpUXTpk1TU1OTJGn48OF69tlndeLECc2dO1df+cpXNGTIED333HMqLCzMdPEAAEA/5/D7/X6rC2G27m6f3O4TVhej3xgwIE9lZSVqazuRk32jVqDOrUG9Zx51bo1srHeXqyThAbuWj3kBAABIBuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYCuEFAADYStrDy/LlyzVv3ryY57S1telb3/qWGhoa1NDQoO9+97tqb29Pd9EAAIANpTW8rFixQkuXLo173u233669e/cGz//zn/+s+++/P51FAwAANjUgHR/a2tqqBQsWaNOmTRozZkzMczdv3qz169erqalJNTU1kqQHHnhAN910k+644w5VVlamo4gAAMCm0tLysmXLFpWWlurVV19VXV1dzHM3btyo8vLyYHCRpClTpsjhcGjTpk3pKB4AALCxtLS8zJgxQzNmzEjo3NbWVg0fPjzkWEFBgYYOHaqWlpaUyzBgAGORA/Lz80L+jfSjzq1BvWcedW6NXK/3tISXZHR0dKigoCDi+KBBg9TV1ZXSZ+blOVRWVtLXomUdp7PI6iLkHOrcGtR75lHn1sjVerc8vBQWFsrr9UYc7+rqUnFxcUqf6fP55fEwWykgPz9PTmeRPJ4OdXf7rC5OTqDOrUG9Zx51bo1srHensyjhliTLw0tVVZXWrFkTcszr9erYsWN9Gqx76lR2fJlm6u72US8ZRp1bg3rPPOrcGrla75Z3ljU0NOjgwYNqbm4OHlu3bp0kqb6+3qpiAQCAfirj4aW7u1uHDx9WZ2enJKmurk719fX65je/qb/97W965513tHDhQl1++eVMkwYAABEyHl5aWlo0bdo0NTU1SZIcDoeWLVum6upqXX/99frGN76h6dOn67777st00QAAQBxuT6e2NrfJ7em0rAwOv9/vt+zqadLd7ZPbfcLqYvQbAwbkqaysRG1tJ3Kyb9QK1Lk1qPfMo86tYVW9v/XeAT37xjb5/ZLDIV0/u1bT60aY8tkuV0nCA3YtH/MCAAD6P7enMxhcJMnvl557Y5slLTCEFwAAEFdrW4fC+2p8fulQW0fGy0J4AQAAcVWWFcnhCD2W55AqyjK/UB7hBQAAxOVyFur62bXK+yTA5Dmk62bXyuUszHhZLF+kDgAA2MP0uhGaNMalQ20dqigrsiS4SIQXAACQBJez0LLQEkC3EQAAsBXCCwAAsBXCCwAAsBXCCwAAsBXCCwAAsBXCCwAAsBXCCwAAsBXCCwAAsBXCCwAgp7g9ndra3GbJbsgwByvsAgByxlvvHdCzb2yT3y85HNL1s2s1vW6E1cVCkmh5AQDkBLenMxhcJMnvl557YxstMDZEeAEA5ITWto5gcAnw+aVDbR3WFAgpI7wAAHJCZVmRHI7QY3kOqaKsyJoCIWWEFwBATnA5C3X97FrlfRJ
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rng = np.random.RandomState(1)\n",
"X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T\n",
"plt.plot(X[:, 0], X[:, 1],'.')\n",
"plt.axis('equal');"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(200, 2)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By eye, it is clear that there is a nearly linear relationship between the x and y variables.\n",
"This is reminiscent of the linear regression data we explored in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb), but the problem setting here is slightly different: rather than attempting to *predict* the y values from the x values, the unsupervised learning problem attempts to learn about the *relationship* between the x and y values.\n",
"\n",
"In principal component analysis, this relationship is quantified by finding a list of the *principal axes* in the data, and using those axes to describe the dataset.\n",
"Using Scikit-Learn's ``PCA`` estimator, we can compute this as follows:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {\n",
" /* Definition of color scheme common for light and dark mode */\n",
" --sklearn-color-text: #000;\n",
" --sklearn-color-text-muted: #666;\n",
" --sklearn-color-line: gray;\n",
" /* Definition of color scheme for unfitted estimators */\n",
" --sklearn-color-unfitted-level-0: #fff5e6;\n",
" --sklearn-color-unfitted-level-1: #f6e4d2;\n",
" --sklearn-color-unfitted-level-2: #ffe0b3;\n",
" --sklearn-color-unfitted-level-3: chocolate;\n",
" /* Definition of color scheme for fitted estimators */\n",
" --sklearn-color-fitted-level-0: #f0f8ff;\n",
" --sklearn-color-fitted-level-1: #d4ebff;\n",
" --sklearn-color-fitted-level-2: #b3dbfd;\n",
" --sklearn-color-fitted-level-3: cornflowerblue;\n",
"\n",
" /* Specific color for light theme */\n",
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
" --sklearn-color-icon: #696969;\n",
"\n",
" @media (prefers-color-scheme: dark) {\n",
" /* Redefinition of color scheme for dark theme */\n",
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
" --sklearn-color-icon: #878787;\n",
" }\n",
"}\n",
"\n",
"#sk-container-id-1 {\n",
" color: var(--sklearn-color-text);\n",
"}\n",
"\n",
"#sk-container-id-1 pre {\n",
" padding: 0;\n",
"}\n",
"\n",
"#sk-container-id-1 input.sk-hidden--visually {\n",
" border: 0;\n",
" clip: rect(1px 1px 1px 1px);\n",
" clip: rect(1px, 1px, 1px, 1px);\n",
" height: 1px;\n",
" margin: -1px;\n",
" overflow: hidden;\n",
" padding: 0;\n",
" position: absolute;\n",
" width: 1px;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-dashed-wrapped {\n",
" border: 1px dashed var(--sklearn-color-line);\n",
" margin: 0 0.4em 0.5em 0.4em;\n",
" box-sizing: border-box;\n",
" padding-bottom: 0.4em;\n",
" background-color: var(--sklearn-color-background);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-container {\n",
" /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
" but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
" so we also need the `!important` here to be able to override the\n",
" default hidden behavior on the sphinx rendered scikit-learn.org.\n",
" See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
" display: inline-block !important;\n",
" position: relative;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-text-repr-fallback {\n",
" display: none;\n",
"}\n",
"\n",
"div.sk-parallel-item,\n",
"div.sk-serial,\n",
"div.sk-item {\n",
" /* draw centered vertical line to link estimators */\n",
" background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
" background-size: 2px 100%;\n",
" background-repeat: no-repeat;\n",
" background-position: center center;\n",
"}\n",
"\n",
"/* Parallel-specific style estimator block */\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item::after {\n",
" content: \"\";\n",
" width: 100%;\n",
" border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
" flex-grow: 1;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel {\n",
" display: flex;\n",
" align-items: stretch;\n",
" justify-content: center;\n",
" background-color: var(--sklearn-color-background);\n",
" position: relative;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item {\n",
" display: flex;\n",
" flex-direction: column;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item:first-child::after {\n",
" align-self: flex-end;\n",
" width: 50%;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item:last-child::after {\n",
" align-self: flex-start;\n",
" width: 50%;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item:only-child::after {\n",
" width: 0;\n",
"}\n",
"\n",
"/* Serial-specific style estimator block */\n",
"\n",
"#sk-container-id-1 div.sk-serial {\n",
" display: flex;\n",
" flex-direction: column;\n",
" align-items: center;\n",
" background-color: var(--sklearn-color-background);\n",
" padding-right: 1em;\n",
" padding-left: 1em;\n",
"}\n",
"\n",
"\n",
"/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
"clickable and can be expanded/collapsed.\n",
"- Pipeline and ColumnTransformer use this feature and define the default style\n",
"- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
"*/\n",
"\n",
"/* Pipeline and ColumnTransformer style (default) */\n",
"\n",
"#sk-container-id-1 div.sk-toggleable {\n",
" /* Default theme specific background. It is overwritten whether we have a\n",
" specific estimator or a Pipeline/ColumnTransformer */\n",
" background-color: var(--sklearn-color-background);\n",
"}\n",
"\n",
"/* Toggleable label */\n",
"#sk-container-id-1 label.sk-toggleable__label {\n",
" cursor: pointer;\n",
" display: flex;\n",
" width: 100%;\n",
" margin-bottom: 0;\n",
" padding: 0.5em;\n",
" box-sizing: border-box;\n",
" text-align: center;\n",
" align-items: start;\n",
" justify-content: space-between;\n",
" gap: 0.5em;\n",
"}\n",
"\n",
"#sk-container-id-1 label.sk-toggleable__label .caption {\n",
" font-size: 0.6rem;\n",
" font-weight: lighter;\n",
" color: var(--sklearn-color-text-muted);\n",
"}\n",
"\n",
"#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n",
" /* Arrow on the left of the label */\n",
" content: \"▸\";\n",
" float: left;\n",
" margin-right: 0.25em;\n",
" color: var(--sklearn-color-icon);\n",
"}\n",
"\n",
"#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n",
" color: var(--sklearn-color-text);\n",
"}\n",
"\n",
"/* Toggleable content - dropdown */\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content {\n",
" max-height: 0;\n",
" max-width: 0;\n",
" overflow: hidden;\n",
" text-align: left;\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content.fitted {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content pre {\n",
" margin: 0.2em;\n",
" border-radius: 0.25em;\n",
" color: var(--sklearn-color-text);\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
" /* Expand drop-down */\n",
" max-height: 200px;\n",
" max-width: 100%;\n",
" overflow: auto;\n",
"}\n",
"\n",
"#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
" content: \"▾\";\n",
"}\n",
"\n",
"/* Pipeline/ColumnTransformer-specific style */\n",
"\n",
"#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Estimator-specific style */\n",
"\n",
"/* Colorize estimator box */\n",
"#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n",
"#sk-container-id-1 div.sk-label label {\n",
" /* The background is the default theme color */\n",
" color: var(--sklearn-color-text-on-default-background);\n",
"}\n",
"\n",
"/* On hover, darken the color of the background */\n",
"#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"/* Label box, darken color on hover, fitted */\n",
"#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Estimator label */\n",
"\n",
"#sk-container-id-1 div.sk-label label {\n",
" font-family: monospace;\n",
" font-weight: bold;\n",
" display: inline-block;\n",
" line-height: 1.2em;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-label-container {\n",
" text-align: center;\n",
"}\n",
"\n",
"/* Estimator-specific */\n",
"#sk-container-id-1 div.sk-estimator {\n",
" font-family: monospace;\n",
" border: 1px dotted var(--sklearn-color-border-box);\n",
" border-radius: 0.25em;\n",
" box-sizing: border-box;\n",
" margin-bottom: 0.5em;\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-estimator.fitted {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"/* on hover */\n",
"#sk-container-id-1 div.sk-estimator:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-estimator.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
"\n",
"/* Common style for \"i\" and \"?\" */\n",
"\n",
".sk-estimator-doc-link,\n",
"a:link.sk-estimator-doc-link,\n",
"a:visited.sk-estimator-doc-link {\n",
" float: right;\n",
" font-size: smaller;\n",
" line-height: 1em;\n",
" font-family: monospace;\n",
" background-color: var(--sklearn-color-background);\n",
" border-radius: 1em;\n",
" height: 1em;\n",
" width: 1em;\n",
" text-decoration: none !important;\n",
" margin-left: 0.5em;\n",
" text-align: center;\n",
" /* unfitted */\n",
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-unfitted-level-1);\n",
"}\n",
"\n",
".sk-estimator-doc-link.fitted,\n",
"a:link.sk-estimator-doc-link.fitted,\n",
"a:visited.sk-estimator-doc-link.fitted {\n",
" /* fitted */\n",
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-fitted-level-1);\n",
"}\n",
"\n",
"/* On hover */\n",
"div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
".sk-estimator-doc-link:hover,\n",
"div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
".sk-estimator-doc-link:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
".sk-estimator-doc-link.fitted:hover,\n",
"div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
".sk-estimator-doc-link.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"/* Span, style for the box shown on hovering the info icon */\n",
".sk-estimator-doc-link span {\n",
" display: none;\n",
" z-index: 9999;\n",
" position: relative;\n",
" font-weight: normal;\n",
" right: .2ex;\n",
" padding: .5ex;\n",
" margin: .5ex;\n",
" width: min-content;\n",
" min-width: 20ex;\n",
" max-width: 50ex;\n",
" color: var(--sklearn-color-text);\n",
" box-shadow: 2pt 2pt 4pt #999;\n",
" /* unfitted */\n",
" background: var(--sklearn-color-unfitted-level-0);\n",
" border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
"}\n",
"\n",
".sk-estimator-doc-link.fitted span {\n",
" /* fitted */\n",
" background: var(--sklearn-color-fitted-level-0);\n",
" border: var(--sklearn-color-fitted-level-3);\n",
"}\n",
"\n",
".sk-estimator-doc-link:hover span {\n",
" display: block;\n",
"}\n",
"\n",
"/* \"?\"-specific style due to the `<a>` HTML tag */\n",
"\n",
"#sk-container-id-1 a.estimator_doc_link {\n",
" float: right;\n",
" font-size: 1rem;\n",
" line-height: 1em;\n",
" font-family: monospace;\n",
" background-color: var(--sklearn-color-background);\n",
" border-radius: 1rem;\n",
" height: 1rem;\n",
" width: 1rem;\n",
" text-decoration: none;\n",
" /* unfitted */\n",
" color: var(--sklearn-color-unfitted-level-1);\n",
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
"}\n",
"\n",
"#sk-container-id-1 a.estimator_doc_link.fitted {\n",
" /* fitted */\n",
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-fitted-level-1);\n",
"}\n",
"\n",
"/* On hover */\n",
"#sk-container-id-1 a.estimator_doc_link:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-3);\n",
"}\n",
"</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>PCA()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow\"><div><div>PCA</div></div><div><a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.6/modules/generated/sklearn.decomposition.PCA.html\">?<span>Documentation for PCA</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></div></label><div class=\"sk-toggleable__content fitted\"><pre>PCA()</pre></div> </div></div></div></div>"
],
"text/plain": [
"PCA()"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sklearn.decomposition\n",
"PCA = sklearn.decomposition.PCA\n",
"pca = PCA()\n",
"pca.fit(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fit learns some quantities from the data, most importantly the \"components\" and \"explained variance\":"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0.94446029 0.32862557]\n",
" [-0.32862557 0.94446029]]\n"
]
}
],
"source": [
"print(pca.components_)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.7625315 0.0184779]\n"
]
}
],
"source": [
"print(pca.explained_variance_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see what these numbers mean, let's visualize them as vectors over the input data, using the \"components\" to define the direction of the vector, and the \"explained variance\" to define the squared-length of the vector:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAi8AAAGlCAYAAAAyFxZnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB290lEQVR4nO39eZhkdX33/z/PfmrtbXqZhRkQUBAEMzAssi8jmwozktw3RuL9i8bc932pMS75xmCCS4xRCRpC+BqjXibmi0kcGQQEWZSgoLKJmrBOWIYZmF6ru2s9+/n9cbqb7q7ume6Z7q6u7vfjurzAquqqT32q6fOqz/L+KHEcxwghhBBCNAm10Q0QQgghhJgPCS9CCCGEaCoSXoQQQgjRVCS8CCGEEKKpSHgRQgghRFOR8CKEEEKIpiLhRQghhBBNRcKLEEIIIZqKhBchhBBCNJUFDy8jIyP8xV/8BWeffTabN2/mqquu4rHHHpv18cPDw3z0ox9ly5YtbNmyhT//8z+nWq0udLOEEEIIsUIseHj5yEc+wq9//Wuuv/56duzYwXHHHcd73/tenn/++Rkf/6EPfYg9e/bwrW99ixtuuIGHHnqIT3/60wvdLCGEEEKsEMpCnm20e/du3vrWt/Kd73yHzZs3AxDHMRdddBGXXXYZf/RHfzTl8U888QT/83/+T+68806OPPJIAB588EHe97738cADD9Dd3b1QTRNCCCHECrGgIy9tbW187Wtf4/jjj5+4TVEU4jhmdHS07vGPPfYYnZ2dE8EF4JRTTkFRFB5//PGFbJoQQgghVgh9IZ8sn89zzjnnTLntrrvu4uWXX+bMM8+se3xfXx9r166dcptpmrS2trJv376Dbkccx0SRHJY9maoq0idLTPq8MaTfl570eWOstH5XVQVFUeb02AUNL9M9/vjj/Nmf/RkXXHAB559/ft39tVoN0zTrbrcsC9d1D+m1NU02Uk2naXP7pRALR/q8MaTfl570eWOs1n5ftPBy33338bGPfYwTTzyR66+/fsbH2LaN53l1t7uuSzqdPujXjqKYYlF2LI3TNJV8PkWxWCMMo0Y3Z1WQPm8M6felJ33eGCux3/P51JwHHhYlvPzLv/wLn/vc59i6dSvXXXfdjKMrAD09Pdx3331TbvM8j5GRkUNerBsEK+PDXEhhGEm/LDHp88aQfl960ueNsVr7fcHnVm6++WY++9nP8ru/+7t85StfmTW4AGzZsoXe3l527949cdvDDz8MMLFbSQghhBBisgUNLy+++CJ/9Vd/xdatW/nDP/xDhoaGGBgYYGBggFKpRBiGDAwM4DgOACeeeCKbN2/mj//4j/nNb37DL37xC6699lquuOIK2SYthBBCiBktaHi5++678X2fe++9lzPPPHPK/z73uc+xb98+zjzzTO68804g2UZ94403smHDBt7znvfw4Q9/mLPPPptPfepTC9ksIYQQQqwgC1qkbrkIw4hCodLoZiwbuq7S1pZheLiyKudGG0H6vDGk35ee9HljrMR+b2/PzHnBruwnFkIIIURTkfAihBBCiKYi4UUIIYQQTUXCixBCCCGaioQXIYQQQjQVCS9CCCGEaCoSXoQQQgjRVCS8CCGEEKKpSHgRQgghRFOR8CKEEEKIpiLhRQghhBBNRcKLEEIIIZqKhBchhBBCNBUJL0IIIYRoKhJehBBCCNFUJLwIIYQQoqlIeBFCCCFEU5HwIoQQQoimIuFFCCGEEE1FwosQQgghmoqEFyGEEEI0FQkvQgghhGgqEl6EEEII0VQkvAghhBCiqUh4EUIIIURTkfAihBBCiKYi4UUIIYQQTUXCixBCCCGaioQXIYQQQjQVCS9CCCGEaCoSXoQQQgjRVCS8CCGEEKKpSHgRQgghRFOR8CKEEEKIpiLhRQghhBBNZdHDy0033cTVV1+938fs3LmTN7zhDXX/271792I3TwghhBBNRl/MJ//Wt77FDTfcwJYtW/b7uGeffZZTTjmF66+/fsrt7e3ti9k8IYQQQjShRQkvfX19XHPNNTz++OMcccQRB3z8c889xzHHHENnZ+diNEcIIYQQK8iiTBs9+eSTtLS0cNttt3HiiSce8PHPPvssRx111GI0RQghhBArzKKMvJx//vmcf/75c3psoVBgcHCQRx99lG9/+9uMjIxw4okn8rGPfWxOozaz0XVZizxO09Qp/xSLT/q8MaTfl570eWOs9n5f1DUvc/Hcc88BoGkaX/jCF6hWq9x00028613v4vbbb2fNmjXzfk5VVWhryyx0U5tePp9qdBNWHenzxpB+X3rS542xWvu94eHltNNO45FHHqGlpWXitr//+7/nvPPO45ZbbuH973//vJ8zimKKxepCNrOpaZpKPp+iWKwRhlGjm7MqSJ83hvT70pM+b4yV2O/5fGrOI0kNDy/AlOACkE6n2bBhA319fQf9nEGwMj7MhRSGkfTLEpM+bwzp96Unfd4Yq7XfGz5ZdvPNN3PqqafiOM7EbeVymZdeekkW8QohhBCizpKHlzAMGRgYmAgr5513HnEc8yd/8ifs2rWL//zP/+SDH/wg7e3tbNu2bambJ4QQQohlbsnDy759+zjzzDO58847AVi7di3/9E//RKVS4aqrruJ//a//RS6X45//+Z+xbXupmyeEEEKIZU6J4zhudCMWWhhGFAqVRjdj2dB1lba2DMPDlVU5N9oI0ueNIf2+9KTPG2Ml9nt7e2bOC3YbvuZFCCGEEGI+JLwIIYQQoqlIeBFCCCFEU5HwIoQQQoimIuFFCCGEEE1FwosQQgghmoqEFyGEEEI0FQkvQgghhGgqEl6EEEII0VQkvAghhBCiqUh4EUIIIURTkfAihBBCiKYi4UUIIYQQTUXCixBCCCGaioQXIYQQQjQVCS9CCCGEaCoSXoQQQgjRVCS8CCGEEKKpSHgRQgghRFOR8CKEEEKIpiLhRQghhBBNRcKLEEIIIZqKhBchhBBCNBUJL0IIIYRoKhJehBBCCNFUJLwIIYQQoqlIeBFCCCFEU5HwIoQQQoimIuFFCCGEEE1FwosQQgghmoqEFyGEEEI0FQkvQgghhGgqEl6EEEII0VQkvAghhBCiqUh4EUIIIURTWfTwctNNN3H11Vfv9zHDw8N89KMfZcuWLWzZsoU///M/p1qtLnbThBBCCNGEFjW8fOtb3+KGG2444OM+9KEPsWfPnonHP/TQQ3z6059ezKYJIYQQoknpi/GkfX19XHPNNTz++OMcccQR+33sE088wSOPPMKdd97JkUceCcBnPvMZ3ve+9/GRj3yE7u7uxWiiEEIIIZrUooy8PPnkk7S0tHDbbbdx4okn7vexjz32GJ2dnRPBBeCUU05BURQef/zxxWieEEIIIZrYooy8nH/++Zx//vlzemxfXx9r166dcptpmrS2trJv376DboOuy1rkcZqmTvmnWHzS540h/b70pM8bY7X3+6KEl/mo1WqYpll3u2VZuK57UM+pqgptbZlDbdqKk8+nGt2EVUf6vDGk35ee9HljrNZ+b3h4sW0bz/Pqbnddl3Q6fVDPGUUxxaLsVhqnaSr5fIpisUYYRo1uzqogfd4Y0u9LT/q8MVZiv+fzqTmPJDU8vPT09HDfffdNuc3zPEZGRg5psW4QrIwPcyGFYST9ssSkzxtD+n3pSZ83xmrt94ZPlm3ZsoXe3l527949cdvDDz8MwObNmxvVLCGEEEJME8cxNTegXPOpuQFxHDekHUs+8hKGIYVCgVwuh23bnHjiiWzevJk//uM/5lOf+hTVapVrr72WK664QrZJCyGEEMtE1fHpH65RrPqEUYSmquTTBl1tKdK2saRtWfKRl3379nHmmWdy5513AqAoCjfeeCMbNmzgPe95Dx/+8Ic5++yz+dSnPrXUTRNCCCHEDKqOz0u9JYZLLrap0po1sU2V4ZLLS70lqo6/pO1R4kaN+SyiMIwoFCqNbsayoesqbW0Zhocrq3JutBGkzxtD+n3pSZ83xlL2exzH7B4LLm15q+7+4aJLW85iU08ORVEO+nXa2zNzXrDb8DUvQgghhFi+HC+kWPXJpGZeaZJ
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def draw_vector(v0, v1, ax=None):\n",
" ax = ax or plt.gca()\n",
" arrowprops=dict(arrowstyle='->',\n",
" linewidth=2,\n",
" color='black',\n",
" shrinkA=0, shrinkB=0)\n",
" ax.annotate('', v1, v0, arrowprops=arrowprops)\n",
"\n",
"# plot data\n",
"plt.scatter(X[:, 0], X[:, 1], alpha=0.2)\n",
"for length, vector in zip(pca.explained_variance_, pca.components_):\n",
" v = vector * 3 * np.sqrt(length)\n",
" draw_vector(pca.mean_, pca.mean_ + v)\n",
"plt.axis('equal');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These vectors represent the *principal axes* of the data, and the length of the vector is an indication of how \"important\" that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis.\n",
"The projection of each data point onto the principal axes are the \"principal components\" of the data.\n",
"\n",
"If we plot these principal components beside the original data, we see the plots shown here:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABd8AAAImCAYAAACirPOWAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXxU1fk/8M9d586alSTsKrgruCGoKIq7SCWIrV20tlWrtbW21bq1tat2sWqV6q+tWr+1ta1EAopYRUVxR6xbFXFFQCEJ2Wa9+/39McmYZBIygUlmknzer5cvZO6dO+eeOyHnPvc5zxE8z/NARERERERERERERER5Ixa6AUREREREREREREREIw2D70REREREREREREREecbgOxERERERERERERFRnjH4TkRERERERERERESUZwy+ExERERERERERERHlGYPvRERERERERERERER5xuA7EREREREREREREVGeMfhORERERERERERERJRnDL4TEVHR8zyv0E0gIiIiIqIRjPccRDQYGHwnIqJ+zZ07F1ddddWQf65pmrjhhhvw0EMPDflnExEREdHIt/fee+O2227L6zHPOeccnHPOOXk9JgDcdttt2HvvvfN+3NEuGo3iyiuvxLp16wrdFCIageRCN4CIiIrf4sWLEQqFhvxzGxsbcc899+CGG24Y8s8mIiIiopHv3//+N2pqavJ6zOuuuy6vx6PBtX79eixbtgwLFy4sdFOIaARi8J2IiPq13377FboJRERERER5d9BBB+X9mFOnTs37MYmIaHhi2RkiIupXZ9mZLVu2YO+998YjjzyCSy+9FAcffDBmzJiBa6+9FolEotv+N998M2644QYcfvjhOPzww3HFFVegtbU1s89VV12FuXPndvuczuMvXboUW7ZswfHHHw8AuPrqq7P2JSIiIiICch97fvWrX8V1112Hww47DLW1tbBtu1vZmZdeegl77703XnjhBXz961/H9OnTceSRR+I3v/kNbNvOHMuyLPzxj3/ECSecgGnTpmHevHl44IEHMtt7lp3Ze++98fe//x1XXnklDj74YBx55JH45S9/CV3XM/s4joM///nPOP300zFt2jQcdNBBOPvss/HCCy8MqC8SiQRuuOEGHHPMMTjooIOwcOFCPPnkk90+5x//+Afmz5+PadOm4dhjj8WNN94IwzC69dU3vvEN3H///ZlzPPvss/HRRx9h9erVmD9/PqZPn46zzjoL69ev7/a+c845B3V1dTjuuONw8MEH49xzz8Xbb7/drY0bN27EpZdeiqOOOgoHHXQQzjnnHLzyyiuZ7bnecwDAkiVLMG/ePBxwwAE49thjcdttt3W7VldddRXOO+88PPDAAzj55JNxwAEH4HOf+xyefvppAOlrfu655wIAzj333Mx127x5My6++GLMnDkT06dPxxe+8IXMe4iIBoKZ70RENGDXXXcdzjzzTNx+++144403cPPNN6O8vBw/+MEPMvvcd999mDx5Mq6//nq0tLTg97//PT788EMsWbIEotj/s9+qqiosXrwY3/72t3HxxRfjpJNOGsxTIiIiIqJhLJex57p16yAIAm677TYkEgnIcu8hkcsvvxxf+tKXcMEFF+Cpp57C3XffjcmTJ+Pss88GAFx55ZV44okncPHFF2P69Ol45plncM0110CSJCxYsKDXY/7hD3/A9OnTccstt+CDDz7ALbfcgoaGhkzg/8Ybb8R9992Hyy+/HHvvvTe2bduGP/7xj/jud7+Lp556CoFAoN8+cF0X559/Pj744ANceumlmDJlCpYvX45vf/vb+Otf/4qZM2fiJz/5CZYtW4bzzz8fhx9+ON5++2388Y9/xPr163HnnXdCEAQAwGuvvYbGxkZcddVV0HUdP/3pT3HhhRdCEARceumlEEUR119/PS6//HI8/PDDmTasX78eH374Ib7//e+jpKQEt956K8455xysXLkS1dXVeP/99/H5z38ekydPxo9+9CMoioK//e1v+OpXv4q7774bhx9+eOZY/d1z/OlPf8LNN9+Mr3zlK7j66quxfv163Hbbbdi6dSuuv/76zHH+97//obGxEZdeeilCoRD+8Ic/4NJLL8WaNWuw//774yc/+Ql+/vOf4yc/+QlmzpwJ13XxzW9+E2PGjMFvf/tbyLKMv/3tb/jWt76FlStXYvLkyf1eCyKiTgy+ExHRgM2ZMwdXXnklAOCII47Ac889h6eeeqpb8F0QBPz1r39FOBwGAJSXl+OSSy7BmjVrcOyxx/b7GaqqYt999wUATJo0iaVviIiIiKhPuYw9bdvGz372s36Dp2eddRYuueQSAOmx7uOPP46nnnoKZ599Nt577z08/PDDuPbaazMZ00cccQQ+/fRTvPTSS30G38vLy/H//t//gyzLmDNnDkRRxA033ID33nsPe+65JxobG/G9732vW8a8pmn4zne+gw0bNuDggw/utw/WrFmD//73v7j99tszM0hnzZqFjz/+GC+++CIqKipQV1eHyy67DBdffDEA4KijjkJVVRV++MMfYs2aNZgzZw4AIB6P45ZbbsGUKVMAAGvXrsW///1v3HPPPTjiiCMAANu2bcNvfvMbRKNRRCIRAEAsFsMdd9yBGTNmAACmTZuGE044Affccw+uvPJKLF68OBNw77xWxx57LE4//XT87ne/w5IlSzLns6N7js7P+cIXvoAf/ehHAIDZs2ejtLQUP/rRj/C1r30Ne+65Z6ZNS5cuxaRJkwAAgUAAX/nKV/Diiy/i5JNPzpQJmjp1KqZOnYqmpiZ88MEHuOiiizL9MW3aNCxevLjbDAEiolyw7AwREQ1Yz9qYNTU1SCaT3V477rjjMgNqID0dWFEUrFu3biiaSERERESjSC5jT03TMgHYHekZ6O461u083oknnthtn1tuuQU33HBDn8ecN29et0z7k08+udvxfv/73+O8885DS0sLXn31VSxduhQPPvgggHSZm1ysW7cOiqLguOOOy7wmCAL++c9/4rvf/S7Wrl0LAJg/f35W2yRJwksvvZR5raSkJBN4B4AxY8YA6H4fUFpaCgCIRqOZ18aNG5cJvAPp2awHH3xwpqzM2rVrs66VLMuYN28e3nzzzW5lZXZ0z/Hqq68ilUph7ty5sG07819nqcrnnnsu877y8vJu171zgd1UKoXeVFZWYurUqfjxj3+Mq666CitXroTnebj66qux11579foeIqK+MPOdiIgGzO/3d/u7KIrwPK/ba1VVVVn7lJaWdhucExERERHlQy5jz4qKikxZlR3RNC3rWJ1j3ba2tsyxdqV9ne/vbN+bb76Jn/3sZ3jzzTehaRqmTp2K8ePHA0DWOLsvbW1tKC0t7bPEY3t7O4DPAumdZFlGWVkZYrFY5rVQKNTrMXreB/TU8zyB9Lm+9dZbmTZUVlZm7VNZWQnP8xCPx/v8rN6uw4UXXthrOxobG/s8Tud3wHXdXt8rCALuvvtu3HHHHVi1ahXq6+uhKApOOOEE/PSnP808dCAiygWD70RENCg6B8SdHMdBa2srysvLAaQHtY7jdNunZ/Y8EREREVEu+ht75ktneZWWlpZMBjUAfPjhh2hpacFhhx2WU/u2b98OIJ2VHY/Hcf7552PvvffGihUrMGXKFIiiiKeffhqPPvpozm0Lh8Noa2uD67rdAvDr16+HbdsoKSkBADQ1NWHChAmZ7ZZlobW1FWVlZTl/Vl96nieQPtfOhw0lJSWZc++qqakJAFBWVtYtcN6Xzutw4403Yrfddsva3luAfyCqq6vx05/+FNdddx3eeecd/Oc//8Ff/vIXlJSU4Gc/+9kuHZuIRheWnSEiokHxzDPPwDTNzN+feOIJ2LadqREZDAbR2trarW7if//7327HkCRpaBpLRERERMNaf2PPfDn00EMBAI8//ni312+++Wb84he/6PN9Tz75ZLe/P/rooxAEAbNmzcKHH36ItrY2nHvuudhzzz0zgfM1a9YA6DtDu6fDDjsMlmXh6aefzrzmeR6uvfZa3HHHHZnFTB966KFu73v44YfhOE7m3HbFpk2b8P7772f+3tD
"text/plain": [
"<Figure size 1600x600 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots(1, 2, figsize=(16, 6))\n",
"fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)\n",
"\n",
"# plot data\n",
"ax[0].scatter(X[:, 0], X[:, 1], alpha=0.2)\n",
"for length, vector in zip(pca.explained_variance_, pca.components_):\n",
" v = vector * 3 * np.sqrt(length)\n",
" draw_vector(pca.mean_, pca.mean_ + v, ax=ax[0])\n",
"ax[0].axis('equal');\n",
"ax[0].set(xlabel='x', ylabel='y', title='input')\n",
"\n",
"# plot principal components\n",
"X_pca = pca.transform(X)\n",
"ax[1].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.2)\n",
"draw_vector([0, 0], [0, 3], ax=ax[1])\n",
"draw_vector([0, 0], [3, 0], ax=ax[1])\n",
"ax[1].axis('equal')\n",
"ax[1].set(xlabel='component 1', ylabel='component 2',\n",
" title='principal components',\n",
" xlim=(-5, 5), ylim=(-3, 3.1));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This transformation from data axes to principal axes is an *affine transformation*, which basically means it is composed of a translation, rotation, and uniform scaling.\n",
"\n",
"While this algorithm to find principal components may seem like just a mathematical curiosity, it turns out to have very far-reaching applications in the world of machine learning and data exploration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PCA as dimensionality reduction\n",
"\n",
"Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.\n",
"\n",
"Here is an example of using PCA as a dimensionality reduction transform:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"original shape: (200, 2)\n",
"transformed shape: (200, 1)\n"
]
}
],
"source": [
"pca = PCA(n_components=1)\n",
"pca.fit(X)\n",
"X_pca = pca.transform(X)\n",
"print(\"original shape: \", X.shape)\n",
"print(\"transformed shape:\", X_pca.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The transformed data has been reduced to a single dimension.\n",
"To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAi8AAAGlCAYAAAAyFxZnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB0AklEQVR4nO39eZzlZ13nf7+u737WWrqqq3pJL2QhmzQGElESMiRkBsn8VJB7ZqIscwvqPOYhqAj+xKBhURGJASPyGBX9sXijc8sM3hoiCI6DgprNAE5CFkJ679qXs33373X/8a2qrrW71lN1qj/PxwO6c+os17lOdX3fdS2fS2mtNUIIIYQQHcLY7gYIIYQQQqyFhBchhBBCdBQJL0IIIYToKBJehBBCCNFRJLwIIYQQoqNIeBFCCCFER5HwIoQQQoiOIuFFCCGEEB1FwosQQgghOsqmh5epqSl+9Vd/lVe84hXccMMN3HXXXTz66KMr3n9ycpJf+IVf4MYbb+TGG2/kV37lV2i1WpvdLCGEEELsEpseXt7xjnfwzW9+k/vuu4/Pfe5zXHfddbzlLW/hueeeW/b+b3/72zl16hSf/OQnuf/++/n617/O+973vs1ulhBCCCF2CbWZZxudOHGCf/tv/y1/+qd/yg033ACA1pp/9+/+HXfeeSc/+7M/u+D+jz/+OP/pP/0nHnzwQS6//HIAvva1r/HWt76Vr371qwwMDGxW04QQQgixS2zqyEtPTw9/8Ad/wPXXXz93m1IKrTXT09NL7v/oo4/S398/F1wAbrrpJpRSPPbYY5vZNCGEEELsEtZmPlm1WuXWW29dcNtf//Vfc/LkSW6++eYl9x8eHmbfvn0LbnMch+7ubs6dO7fudmityTI5LHuWYSjpjzaTPm8/6fP2kz5vv93c54ahUEqt6r6bGl4We+yxx/jlX/5lbr/9dm677bYlX/d9H8dxltzuui5hGG7otU1TNlLNZ5qr+4YQm0f6vP2kz9tP+rz9pM+3MLx85Stf4Z3vfCfHjh3jvvvuW/Y+nucRRdGS28MwpFgsrvu1s0xTq8mOJchDXLVaoFbzSdNsu5tzSZA+bz/p8/aTPm+/3d7n1Wph1QMPWxJe/uRP/oRf//Vf54477uDee+9ddnQFYHBwkK985SsLbouiiKmpqQ0v1k2S3ffBbkSaZtInbSZ93n7S5+0nfd5+0udbsFX6s5/9LB/4wAf48R//cT760Y+uGFwAbrzxRoaGhjhx4sTcbQ899BDA3G4lIYQQQoj5NjW8PP/88/zGb/wGd9xxBz/90z/N+Pg4o6OjjI6OUq/XSdOU0dFRgiAA4NixY9xwww38/M//PN/61rf453/+Z+655x5+5Ed+RLZJCyGEEGJZmxpevvSlLxHHMV/+8pe5+eabF/zv13/91zl37hw333wzDz74IJBvo/7Yxz7GwYMHefOb38zP/dzP8YpXvIL3vve9m9ksIYQQQuwim1qkbqdI04yJieZ2N2NHsCyDnp4Sk5PNS36OtF2kz9tP+rz9pM/bb7f3eW9vadULdmU/sRBCCCE6ioQXIYQQQnQUCS9CCCGE6CgSXoQQQgjRUSS8CCGEEKKjSHgRQgghREeR8CKEEEKIjiLhRQghhBAdRcKLEEIIITqKhBchhBBCdBQJL0IIIYToKBJehBBCCNFRJLwIIYQQoqNIeBFCCCFER5HwIoQQQoiOIuFFCCGEEB1FwosQQgghOoqEFyGEEEJ0FAkvQgghhOgoEl6EEEII0VEkvAghhBCio0h4EUIIIURHkfAihBBCiI4i4UUIIYQQHUXCixBCCCE6ioQXIYQQQnQUCS9CCCGE6CgSXoQQQgjRUSS8CCGEEKKjSHgRQgghREeR8CKEEEKIjiLhRQghhBAdRcKLEEIIITqKhBchhBBCdJQtDy8f//jHeeMb33jB+3z+85/nhS984ZL/nThxYqubJ4QQQogOY23lk3/yk5/k/vvv58Ybb7zg/Z5++mluuukm7rvvvgW39/b2bmXzhBBCCNGBtiS8DA8Pc/fdd/PYY49x9OjRi97/mWee4eqrr6a/v38rmiOEEEKIXWRLpo2eeOIJurq6+Mu//EuOHTt20fs//fTTXHHFFVvRFCGEEELsMlsy8nLbbbdx2223req+ExMTjI2N8cgjj/CZz3yGqakpjh07xjvf+c5VjdoIIYQQ4tKypWteVuOZZ54BwDRNPvShD9Fqtfj4xz/Oj/3Yj/FXf/VX9PX1ret5LUs2UgGYprHgT7H1pM/bT/q8/aTP20/6/DyltdZb+QK/9Eu/xJkzZ/jMZz6z4n2mp6fp6uqa++9Wq8UrX/lK3vKWt/BTP/VTa35NrTVKqXW1VwghhBA727aPvAALggtAsVjk4MGDDA8Pr+v5skxTq7U2o2kdzzQNqtUCtZpPmmbb3ZxLgvR5+0mft5/0efvt9j6vVgurHlXa9vDy2c9+lt/5nd/hq1/9Kp7nAdBoNDh+/Divf/3r1/28SbL7PtiNSNNM+qTNpM/bT/q8/aTP20/6fBsq7KZpyujoKEEQAPDKV74SrTW/+Iu/yLPPPsu//uu/8ra3vY3e3l5e+9rXtrt5QgghhNjh2h5ezp07x80338yDDz4IwL59+/jUpz5Fs9nkrrvu4j//5/9MpVLh05/+9NxIjBBCCCHErC1fsLsd0jRjYqK53c3YESzLoKenxORk85IfZmwX6fP2kz5vP+nz9tvtfd7bW1r1mhfZbyWEEEKIjiLhRQghhBAdRcKLEEIIITqKhBchhBBCdBQJL0IIIYToKBJehBBCCNFRJLwIIYQQoqNIeBFCCCFER5HwIoQQQoiOIuFFCCGEEB1FwosQQgghOoqEFyGEEEJ0FAkvQgghhOgoEl6EEEII0VEkvAghhBCio0h4EUIIIURHkfAihBBCiI4i4UUIIYQQHUXCixBCCCE6ioQXIYQQQnQUCS9CCCGE6CgSXoQQQgjRUSS8CCGEEKKjSHgRQgghREeR8CKEEEKIjiLhRQghhBAdRcKLEEIIITqKhBchhBBCdBQJL0IIIYToKBJehBBCCNFRJLwIIYQQoqNIeBFCCCFER5HwIoQQQoiOIuFFCCGEEB1FwosQQgghOsqWh5ePf/zjvPGNb7zgfSYnJ/mFX/gFbrzxRm688UZ+5Vd+hVartdVNE0IIIUQH2tLw8slPfpL777//ovd7+9vfzqlTp+bu//Wvf533ve99W9k0IYQQQnQoayuedHh4mLvvvpvHHnuMo0ePXvC+jz/+OA8//DAPPvggl19+OQDvf//7eetb38o73vEOBgYGtqKJQgghhOhQWzLy8sQTT9DV1cVf/uVfcuzYsQve99FHH6W/v38uuADcdNNNKKV47LHHtqJ5QgghhOhgWzLyctttt3Hbbbet6r7Dw8Ps27dvwW2O49Dd3c25c+fW3QbLkrXIAKZpLPhTbD3p8/aTPm8/6fP2kz4/b0vCy1r4vo/jOEtud12XMAzX9ZyGoejpKW20abtKtVrY7iZccqTP20/6vP2kz9tP+nwHhBfP84iiaMntYRhSLBbX9ZxZpqnVZLcS5Am9Wi1Qq/mkabbdzbkkSJ+3n/R5+0mft99u7/NqtbDqUaVtDy+Dg4N85StfWXBbFEVMTU1taLFukuy+D3Yj0jSTPmkz6fP2kz5vP+nz9pM+3wFF6m688UaGhoY4ceLE3G0PPfQQADfccMN2NUsIIYS45Gmt8cOEhh/jhwla6+1uErANIy9pmjIxMUGlUsHzPI4dO8YNN9zAz//8z/Pe976XVqvFPffcw4/8yI/INmkhhBBim7SCmJFJn1orJs0yTMOgWrTZ21Og6Nnb2ra2j7ycO3eOm2++mQcffBAApRQf+9jHOHjwIG9+85v5uZ/7OV7xilfw3ve+t91NE0IIIQR5cDk+VGeyHuI5Bt1lB88xmKyHHB+q0wribW2f0jtlDGgTpWnGxERzu5uxI1iWQU9PicnJ5iU/R9ou0uftJ33eftLn7deuPtdac2ImuPRU3SVfn6yF9FRcDg9WUEpt2uv29pZWvWB329e8CCGEEGLnCKKUWiumVFh+ZUmpYFF
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"X_new = pca.inverse_transform(X_pca)\n",
"plt.scatter(X[:, 0], X[:, 1], alpha=0.2)\n",
"plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)\n",
"plt.axis('equal');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The light points are the original data, while the dark points are the projected version.\n",
"This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance.\n",
"The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much \"information\" is discarded in this reduction of dimensionality.\n",
"\n",
"This reduced-dimension dataset is in some senses \"good enough\" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PCA for visualization: Hand-written digits\n",
"\n",
"The usefulness of the dimensionality reduction may not be entirely apparent in only two dimensions, but becomes much more clear when looking at high-dimensional data.\n",
"To see this, let's take a quick look at the application of PCA to the digits data we saw in [In-Depth: Decision Trees and Random Forests](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html#Example:-Random-Forest-for-Classifying-Digits).\n",
"\n",
"We start by loading the data:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1797, 64)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.datasets import load_digits\n",
"digits = load_digits()\n",
"digits.data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall that the data consists of 8×8 pixel images, meaning that they are 64-dimensional.\n",
"To gain some intuition into the relationships between these points, we can use PCA to project them to a more manageable number of dimensions, say two:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1797, 64)\n",
"(1797, 2)\n"
]
}
],
"source": [
"pca = PCA(2) # project from 64 to 2 dimensions\n",
"projected = pca.fit_transform(digits.data)\n",
"print(digits.data.shape)\n",
"print(projected.shape)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.unique(digits.target)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1797,)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"digits.target.shape"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0., 0., 5., 13., 9.],\n",
" [ 0., 0., 0., 12., 13.],\n",
" [ 0., 0., 0., 4., 15.],\n",
" [ 0., 0., 7., 15., 13.],\n",
" [ 0., 0., 0., 1., 11.],\n",
" [ 0., 0., 12., 10., 0.],\n",
" [ 0., 0., 0., 12., 13.],\n",
" [ 0., 0., 7., 8., 13.],\n",
" [ 0., 0., 9., 14., 8.],\n",
" [ 0., 0., 11., 12., 0.]])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"digits.data[:10,:5]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"digits.target[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now plot the first two principal components of each point to learn about the data:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"digits_df = pd.DataFrame({'pca1': projected[:, 0], 'pca2':projected[:, 1], 'digit': digits.target})\n",
"center = {}\n",
"for (digit, gdf) in digits_df.groupby('digit'):\n",
" center[digit] = (gdf['pca1'].mean(), gdf['pca2'].mean())"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiUAAAG6CAYAAADXtTbgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOz9d5Qc132njT/3VujcPTkjZyKRBChSgRIpiraibVqSJdniOslpvXvW67zW2dfWcVi/G+Sw9jpI/r2y16ZF2bKCZWUqUCTBTCQiA4MZTE6du6sr3Pv7owYz05wZEGAEgXqOdA6nbvete6sbVZ/+RqG11kREREREREREvMrIV3sBEREREREREREQiZKIiIiIiIiIq4RIlERERERERERcFUSiJCIiIiIiIuKqIBIlEREREREREVcFkSiJiIiIiIiIuCqIRElERERERETEVUEkSiIiIiIiIiKuCiJREhEREREREXFVEImSiIiIiIiIiCvm//yf/8O9997bdOz48eN8+MMf5sYbb+SOO+7gb/7mb65ozkiURERERERERFwRn/rUp/jTP/3TpmP5fJ6f/MmfZP369Xz2s5/lP/7H/8if/Mmf8NnPfvay531NiJLZ2Vl+7dd+jdtuu42bbrqJn/3Zn+XMmTML4y9WmUVEREREREQ8P5OTk3zkIx/hT/7kT9iwYUPT2Gc+8xls2+Z3fud32LRpE+9973v5iZ/4CT7xiU9c9vyvCVHyC7/wC1y4cIFPfOIT/PM//zPxeJyf+ImfoF6vvyTKLCIiIiIiIuL5efbZZ8nlcnzxi19k7969TWNPPvkkt9xyC6ZpLhy77bbbGBwcZHZ29rLmN5//Ja8u+XyegYEBfuEXfoEtW7YA8O///b/nB3/wBzl9+jQHDhxYUGamabJp0yaGhob4xCc+wXvf+95XefURERERERFXF4cOHeKXf/mXVx1/4IEHVh1761vfylvf+tYVxyYmJti6dWvTsa6uLgDGxsZob29/3rVd9aKktbWVj3/84wt/z8zM8Dd/8zf09PSwefNm/vf//t8rKrO/+qu/YnZ29rIuwkporVFKv+j1r4aU4mWd/2rhetjn9bBHuD72Ge3x2uGV3KeUAiHEy3oOrTW+77/oeUzTpLOz8yVY0XIcx8G27aZjsVgMgEajcVlzXPWiZCn/9b/+1wWf1V/8xV+QTCZfEmW2EkppSqX6i17zShiGJJtNUK06BIF6Wc5xNXA97PN62CNcH/uM9njt8ErvM5tNYBgvryjxfZ/z50de9Dzr1w/Q19d3SWvICyUej+O6btOxi2IkmUxe1hyvKVHy4z/+43zgAx/gH//xH/nFX/xF7rvvvpdEma2ElILW1tSLWu/zkc0mXtb5rxauh31eD3uE62Of0R6vHa6XfV4t9PT0MDU11XTs4t/d3d2XNcdrSpRs3rwZgN/93d/l4MGD/P3f//1LosxWIrSU1F74Yi/BRRVfKtWvi18r1/I+r4c9wvWxz2iP1w6v9D5DS8lrIm/kZeWWW27h05/+NEEQYBgGAAcOHGDDhg2X7bW46kXJ7OwsBw4c4B3veMfCJqWUbNq0iampqZdEma2G77+8X+YgUC/7Oa4Grod9Xg97hOtjn9Eerx2ul31eLbz3ve/lk5/8JB/96Ef5yEc+wuHDh/nbv/1bPvaxj132HFe9tJuamuJXfuVXePzxxxeOeZ7HsWPH2LRpE7fccgtPPfUUQRAsjF+pMouIiIiIiIh4cbS3t/PJT36SwcFB7rnnHv7sz/6MX//1X+eee+657DmuekvJ9u3bedOb3sTHPvYxfu/3fo9sNstf/uVfUiqV+Imf+AlisdiLVmYRERERERERV8Yf/uEfLju2Z88e7r///hc851VvKRFC8Md//Mfcdttt/NIv/RLvf//7KRaL/MM//AN9fX0viTKLiIiIiIiIePURWutrP2H9BRAEirm56ssyt2lKWltT5PPVa9rfeT3s83rYI1wf+4z2eO3wSu+zrS31sge6ep73kqUEW5b1Eqzo5eGqt5REREREREREXB9EoiQiIiIiIiLiqiASJRERERERERFXBVd99k1ExEtBQykOl6qcrTpoYH0ixt5ciuR87ZuIiIiIiFefSJREXPNU/IBPj00z5y42s7pQb3CwVOWD/Z20WtE/g4iIiIirgch9E3HN893ZYpMguUjZD/jmdOGVX1BERERExIpEoiTimsZTipPV1bs9n687lF+CduARLy/V2QaliTqBd+2msUZERETum4hrnIbS+Gr1UjxaQy1QZKJ/CS8pKtBMnywxe66MEIKOLRnaN2eQ8vLbu9eLLse/NMLJr4/TKHnEsxbtmzJsvquH9W/oRIiXt1V8RETEK090K464pkkakqQhqa3SKdSSgpwZ/TN4KXFrPofuH6Iy7Swcm3i2QK4/yZ73rcWMPX9wcb3ocuAvTjH82Ax6XlTWCy4jT8/SqHgEDcXmt/a8bHuIiIh4dYjuxhHXNFIIbsqleXiutOL4zkySeNRy/CXl9AMTTYLkIsXRGue+N8XWt/U+7xxDj0wzc7q8IEgAtNI4ZY9zD04xfbrMxLMFena1YNgSpxhaUnp2t5DI2S/pfiIiIl45IlEScc3z+tYMc57P8XKt6fiGZJw723Ov0qpeeyilGT04x8jBObxaQKojRv/NbbStTy+8xq35zJxaWQACTD5bYNMd3RhmKAQbFY/xwwWqMw5W0qR3dwuZ7gRTJ0vUC+7C+7TWVGca+I2wG7hf9xk/UuDMtyZItsdoXZ+iNFanXnDp2pZl/Ru7WLO/nURLJFAiIl5LRKIk4ppHCsF7utt4XUuas1UHhWZDMk5/PHbJ9002XKYaHklDsj4Zx7iOYxiU0jx9/znOH5xBLXGnzJwps/nOHtbc0g5Ao+yjgtVjePyGwqsHGBnJ7LkKz37hQlPw6ujTc6x7fSfKb57Drfi4tXBu5WsqUw5O0SPZFqM0Xqcy5XDx4xl9Jg/A5LEiN35gHZnuxGXt0XcDhp+eYexMHiNm0H1DLhI1ERGvMJEoibhu6I7ZdMee/yFT8QP+dXKOC/XGwrG0afD2rlY2JuMv5xKvWqZOFJk8WVxx7NyDk3TtyBJLW8QyJtIQqwoTMyaxEga+qzj2pZEVs2mGDkxjJQyS7TFKYzVqeZfarINXC9BKIwyBlGGQcmmijjQEVsJYFBBaU550sBImpx+Y4OYf3fC8+ytN1Hn2Xy5gaEG97qGU5vzDU2x8SzdrX9dx+RcqIiLiRRE50yMinsO/TMw2CRIIhcrnJ2aZcb1XaVWvLhPPFlYdU4Fm6njosrGTJh1bs6u+tntnC4YpmTpRxHeCVV9nxiXZ3njosqn7KF83xZcoBUKEcSb1vMtze51fnLs4UqNedKnlXU59Y5xH//o0j37iNGe+NYFT9hbWf/RzF3BrzanhWsPZ70ySH355uoVHREQsJxIlERFLGKo5TDjuimO+0jxdrLzCK7o68OurCwgAb4nA2HJXD+mu5Ral3ECSjW/uBsApXlrcWXGTnl0tZLriWAlzQXQIQ2DGDHSg8V0FulmsXGRphk9xpMpT//cco8/MUS+41PMuF56c5am/O0dtrsHMmTKN8urrGTs4d8m1RkREvHRE7puIiCWMN1YWJBcZW0WwXOtkehPM5surj/csihA7abLv3o1MnyoxcaRA4AX07m2l+4aWhTolzxerEW+xCVzFujd04pQ8hh+foTBcRQUaHWi01gQVhZU0kJZEmkvifQSku8P1GLbkzLcnqU472CkTsaROilv1OfvdSTI9l445qc1en595RMSrQSRKIq56akHA4VKN4bqDKQRb0wl2pJMvS+CpLS9tPIw9z/i1ysDN7eRPrezGSLbZtG/KNB2r513GDuUpzLs+ypMOxZE63TtzTB0vUptrUBypkWi1sVPLb0N9e1sZeXIWIQWJFhuhCWNRnADla4QQSFsQeBozJmiUPbTSxNIWHVuy2EmT6oyD31Dkz4fWLWlJcv1Jcv3JhfPMnq3QujbMHnJKHvkzFYrTdRCQao+RG0gSiyrrRUS8YkT/2iKuamZcj/vHZqj6i+6BM1WHQ8U
"text/plain": [
"<Figure size 640x480 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cmap = matplotlib.colormaps.get_cmap('nipy_spectral').resampled(10)\n",
"plt.scatter(projected[:, 0], projected[:, 1],\n",
" c=digits.target, edgecolor='none', alpha=0.5,\n",
" cmap=cmap, clim=(0,10))\n",
"plt.xlabel('component 1')\n",
"plt.ylabel('component 2')\n",
"plt.colorbar(ticks=list(range(11)))\n",
"for digit in center:\n",
" xy = center[digit]\n",
" plt.plot([xy[0]], [xy[1]], 'wo', markersize=12)\n",
" plt.text(xy[0], xy[1], digit, horizontalalignment='center', verticalalignment='center')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall what these components mean: the full data is a 64-dimensional point cloud, and these points are the projection of each data point along the directions with the largest variance.\n",
"Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits in two dimensions, and have done this in an unsupervised manner—that is, without reference to the labels."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What do the components mean?\n",
"\n",
"We can go a bit further here, and begin to ask what the reduced dimensions *mean*.\n",
"This meaning can be understood in terms of combinations of basis vectors.\n",
"For example, each image in the training set is defined by a collection of 64 pixel values, which we will call the vector $x$:\n",
"\n",
"$$\n",
"x = [x_1, x_2, x_3 \\cdots x_{64}]\n",
"$$\n",
"\n",
"One way we can think about this is in terms of a pixel basis.\n",
"That is, to construct the image, we multiply each element of the vector by the pixel it describes, and then add the results together to build the image:\n",
"\n",
"$$\n",
"{\\rm image}(x) = x_1 \\cdot{\\rm (pixel~1)} + x_2 \\cdot{\\rm (pixel~2)} + x_3 \\cdot{\\rm (pixel~3)} \\cdots x_{64} \\cdot{\\rm (pixel~64)}\n",
"$$\n",
"\n",
"One way we might imagine reducing the dimension of this data is to zero out all but a few of these basis vectors.\n",
"For example, if we use only the first eight pixels, we get an eight-dimensional projection of the data, but it is not very reflective of the whole image: we've thrown out nearly 90% of the pixels!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](figures/05.09-digits-pixel-components.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Digits-Pixel-Components)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The upper row of panels shows the individual pixels, and the lower row shows the cumulative contribution of these pixels to the construction of the image.\n",
"Using only eight of the pixel-basis components, we can only construct a small portion of the 64-pixel image.\n",
"Were we to continue this sequence and use all 64 pixels, we would recover the original image."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But the pixel-wise representation is not the only choice of basis. We can also use other basis functions, which each contain some pre-defined contribution from each pixel, and write something like\n",
"\n",
"$$\n",
"image(x) = {\\rm mean} + x_1 \\cdot{\\rm (basis~1)} + x_2 \\cdot{\\rm (basis~2)} + x_3 \\cdot{\\rm (basis~3)} \\cdots\n",
"$$\n",
"\n",
"PCA can be thought of as a process of choosing optimal basis functions, such that adding together just the first few of them is enough to suitably reconstruct the bulk of the elements in the dataset.\n",
"The principal components, which act as the low-dimensional representation of our data, are simply the coefficients that multiply each of the elements in this series.\n",
"This figure shows a similar depiction of reconstructing this digit using the mean plus the first eight PCA basis functions:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](figures/05.09-digits-pca-components.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Digits-PCA-Components)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike the pixel basis, the PCA basis allows us to recover the salient features of the input image with just a mean plus eight components!\n",
"The amount of each pixel in each component is the corollary of the orientation of the vector in our two-dimensional example.\n",
"This is the sense in which PCA provides a low-dimensional representation of the data: it discovers a set of basis functions that are more efficient than the native pixel-basis of the input data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Choosing the number of components\n",
"\n",
"A vital part of using PCA in practice is the ability to estimate how many components are needed to describe the data.\n",
"This can be determined by looking at the cumulative *explained variance ratio* as a function of the number of components:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAG1CAYAAAAfhDVuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABa10lEQVR4nO3dd3iTVf8G8DtJ9whdlBbKLJQKHRQoApVVXrQvMhRxM5TlKyJLKyAOUASUaYWC/AQRFEVkOCiIiAgqYNkyyiqbDmjSpnQ3Ob8/SiMhbUnajDa5P9fFZfM8T558e6z25pzznCMRQggQERER2SiptQsgIiIiMieGHSIiIrJpDDtERERk0xh2iIiIyKYx7BAREZFNY9ghIiIim8awQ0RERDaNYYeIiIhsGsMOERER2TQHaxdQGwghoNGYZyFpqVRitnvXdWybqrF9Kse2qRrbp3Jsm6rVpfaRSiWQSCQGXcuwA0CjEVAo8kx+XwcHKby93aFS5aO0VGPy+9dlbJuqsX0qx7apGtuncmybqtW19vHxcYdMZljY4TAWERER2TSGHSIiIrJpDDtERERk0xh2iIiIyKYx7BAREZFNY9ghIiIim8awQ0RERDaNYYeIiIhsGsMOERER2bRaFXYSExMxdOjQKq9RKpV47bXXEB0djejoaLz99tvIz8+3UIVERERU19SasLN69WokJCTc97rx48fj6tWr2uv//PNPzJw50wIVEhERUV1k9b2xMjIyMH36dBw6dAjNmzev8tojR47g77//RlJSEoKDgwEA7733HkaNGoXJkyejQYMGliiZiIjqIIWqEFez8uHuKIXczanC8xnKAjTwdoWP3KXWnjPnZ1bWPuaq1VKsHnZOnjyJevXq4YcffsDSpUtx/fr1Sq89ePAg6tevrw06ANCpUydIJBIcOnQIffv2tUTJRERkApYMCXuO3cAX21MgBCCRAMPjQtE9sqFB52vTudpWT01qtSSJEKLW7OU+depUXL9+HWvXrq3w/KxZs3Ds2DFs2LBB53iXLl0watQojBw5slqfq1ZroFIVVOu9VZHJpJDLXaFSFUCtrv07yFoS26ZqbJ/KsW2qZo32UagKka7IR4CPW4Xho6Jzvx+5jlVJp7W/BEf0fQA9ohpBoxH49dA1fLnjjPbcgIeaIyLYF6WlGhw+exM7/r4KAUACICY8EMFB9VCq1uDs1Wwkn87UfkZkS18E+Xsgr6AUu4/o/0W6a1gAnJ1kKChSY//JdL3znUL9IQAkp2TqnevQuj6EAA6fval3rl0rPwDA0XO39M+19C07dz5L71zknXPHKjnn5ChDcYm60vNVvdfS5yqrVSoBFr76kEl6eORyV8hkhs3GqVNhZ/r06bh06RK++uorneM9e/bEU089hbFjx1brc4UQkEgM2yaeiMgW3couwI1bt9HQzwN+Xq4Gn9tx4DKWbDiqDSbP9GmNiJZ+yC8sxf6Tadh54ArKf8mEt/RDfS9XKHMLceSMfkhwlElQoq41v5LITGa/HIPwln4W/UyrD2MZw8XFBcXFxXrHi4qK4ObmVu37ajQCKpXpn+ji30Arx7apGtuncmybqmXnFSO3sBSeLg7wctefd2FIL8vj3VsgtIk3cgtKcPhMJv78599ej6YNPODm4oi8whLk5pdAmVukPScE8PWOM/h6x5kKa/vnvH5Px92qCjpe7k5wkElxS1Wody6kcT1IpVKkXFbqnYt+wB/uLo56PTsSAI882ARuLg4oKCrFtv1X9M4/2rUZJAB+/OuS3rkBDzUDAHz/h/65x7o3ByTA5t8v6p0b1LMFAGDT7lQII8+532n36rzX0ucqq1UqAdwcJVAq81BTxvTs1KmwExAQgJ07d+ocKy4uRnZ2do0nJ5eWmu9/mmq1xqz3r8vYNlVj+1TOntvG0Hkpwx5pjaiQ+sjNK8Yfx9OwI/mq9hdPSON68HB1gkJViEvpudp7CAFs+j210s++nHH7vvV5ezjB2UmGdIX+9IAe7RrCr54LNv1+zy9JCTDt+fZwcpRh5upk3D3mIJUAbw3vCACIX/aX3rkx/dtWeu7pXi3hI3dBswBPrNmeAo0oOz7snrkjDbzdKj3vW8+l0nPenpWfq+fmXOk5T1enap2ryXstfa6y83I3J4v/t1unhrGOHj2Kp59+Gjt27EDTpk0BAHv37sWYMWOwe/fuagcetVoDhaLmKfNeDg5SeHu7Q6nMs9v/KVeGbVM1tk/l7KVtDA00vaIaoaGfO27cysOuw5U/4GEsb09nuDo74MYt/f839u/aDC3vzJNZsukfvYDx0ctdAVQcPj56uSt85C7Yc+xGpb8kzXEOAFT5xcgvEXBzlFT6NFamsgD+lUx8ri3nzHXfqtrHXLXWhI+Pu23M2VGr1VAoFPD09ISLiwuEEHjuuedQVFSEGTNmID8/H2+++SYefPBBzJkzp9qfy7BjeWybqrF9KmcrbVPVE0W/HrqKdTvPaYNCeAsfuLs4Il2Rr9MLYygXJxkKi9V6x/t0DEKgrzvW/nxGb6jBkMAC1Cx8WDok2MrPjrnUtfYxJuzU6mGstLQ09O7dG3PmzMGgQYMgkUiwZMkSzJw5E8OHD4ezszPi4uIwbdo0a5dKRKSnokAjRNnTRl//+m+YeaCpN1ycZMjKKURmdoFeMPknVVHl57RsJIe/txv+OqH7RJFEAsx9qTNkUmmFoeWRTk3gI3eBVCrRCyXl9Q6PC630HAB0j2yIsOY+FQaMqs4BgI/cpdK/6ZvjHNmvWtWzYy3s2bE8tk3V2D6Vq21tU1GgKSlVY9uBK/h+70Vtj0kTfw9oBHArJx+FxcbX3aNdQzT298BXO85W2AtTk6Gh8u+jusMUdUVt+9mpbepa+9hMzw4RUW1QUaDRaASSDlzG5rsm2gb6uKG4VIOsCp4YupJZ9cTe/3QMQttmPnB0kGLB+qN6vTD9uzaDj9wFDjJppT0t3SMbol0rvwrnXZirl4WoLmDYISJC5XNodh+5jrV3FrgDgOaBnigpFUjLyoNao9sxnqaoegmLJ3oEI7ihJ+Z9ox9m4u4MKQFVDx0ZElqCK/nbOUML2SuGHSKyGwY94QQgspUfnBykuJKei3Sl7uPTF9OqniD8/H9C0KpxvQofn+7StgF85C41mgcDMLQQGYthh4jswr2PbD8S3QTens44fz1HZzsAgYqX+b/boO4t0LqJF+Z+dVgv0ESF+N030NwvzAAMNESmxLBDRDbj3p4bjUYgM7sApy4p8OWOs9rrhAC2/32lijuVTQhu2UiOVUkpeoGma1hAjQMNwwyR5TDsEFGdolAV4mpWPtwdpToTcH87fA1f/nJWG0zq13NBTn4xiksqf6qkZSM5mgfKsfPgNb0nnMonBKs1YKAhquMYdoiozrh3bk30A/5wkEmReiNHb2uCmzllT0Q5OUjRwMcNV+95GkoqAf43MAw+chc0qu/BQENkwxh2iKjWuXs4ysPVEVcybuPkxSx8/+cl7TUCwN+nMyu9BwCMfPQBdGkbAKlUUuE6Mww0RPaBYYeIrKKy1YW37ruMzXv+XbtGAqCqlU9jwgIQ0tgLq7frz615oKk3pFIJAAYaInvGsENEFnfvk1EdW/tDoxE4c0WJ24WlOtcKAB6ujmjs747Tl7N1zkklwOPdW8BH7gKByufWlGOgIbJPDDtEZDZ6T0cJgZMXFVi9LUV7jRDQefS7Ii8/1hYPNPWp0VAUEdkvhh0iMot7JxM3D5TjVk4BVPklFV7fPbIhwlv4IHHLCb3hqAbebtprKtsOAWDPDRFVzLAdtIiIKqFQFeL0ZSUUd/aDKinV4MCpdKze9u8cGgEgNU0FVX4JHB0keveQSoABMc3QobU/hseF4s40mwqHo3zkLghv6cdQQ0QGY88OEVXb3b03ANDY3wOZygIUlagrvP7Z3i3Rq30Q/jqRzuEoIrIYhh0iuq97594IIXDsfJbO3BsA2rVsPF0dkVugO1wllQAdWpeti8Mno4jIkhh2iKhK9869CW3ijXRlPpS5RRVePzyuNbpHNsTe42lVPh3FQENElsKwQ0SVylTm44ttKdp1bgSA01e
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pca = PCA().fit(digits.data)\n",
"plt.plot(np.cumsum(pca.explained_variance_ratio_),'.-')\n",
"plt.xlabel('number of components')\n",
"plt.ylabel('cumulative explained variance');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This curve quantifies how much of the total, 64-dimensional variance is contained within the first $N$ components.\n",
"For example, we see that with the digits the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.\n",
"\n",
"Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PCA as Noise Filtering\n",
"\n",
"PCA can also be used as a filtering approach for noisy data.\n",
"The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise.\n",
"So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.\n",
"\n",
"Let's see how this looks with the digits data.\n",
"First we will plot several of the input noise-free data:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAxoAAAFHCAYAAAAmzzpSAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAgj0lEQVR4nO3dzW5VV9Yu4G3+AyFg5Qf4kshwBcANFNAvyeQKgCswdNLFltLHXAFwAVHhfqTgKwC3qgnkFwIRkBQkgQJ/jaPqnKNivHU8vGst53m6Gdpr7bnnnGuNGM13Zn19fX0CAADQaNt/+wYAAICtR6MBAAC002gAAADtNBoAAEA7jQYAANBOowEAALTTaAAAAO00GgAAQDuNBgAA0G7HtC/4/PnzsubJkydlzZ49e8qa2dnZsmb79u1lzVj88MMPZc2bN2/KmmTc9u3bF93TGPz2229lzcOHD8uaXbt2lTX/8z//E93TEDx79qysSdbqjh31NnPkyJGyZiut1devX5c1jx49KmsOHz7ccTuDkOxfyVz66KOPOm5nS+l6NnzyyScdtzMIyf6WjEnyTvPy5cuyZtu2+v/7fvrpp+VnzMzMlJ+zUY8fPy5rXrx4Udbs37+/rHnvvffKmrE8Gx48eFDWJHNuTO8Rk8lkMrO+vr4+zQvevHmzrFlcXCxrTp8+3fI5Bw8eLGvGIhmTp0+fljVLS0tlzfz8fHBH43Dr1q2y5uzZs2XNiRMnWq41FMvLy2VNssaOHj1a1iTjspXWarIOz58/X9Yk++lYJPtXMpeuX7++4XvZarqeDXfu3NnwvQxFsr8lY5KswbW1tbLmwIEDZc29e/fe+t/3798/lZfuixcvljXJuCR7XHKtsTwbkveIZM6N6T1iMvFPpwAAgE2g0QAAANppNAAAgHYaDQAAoJ1GAwAAaKfRAAAA2k09RyM5DrM6wm0yyY4A6zoKMTmSbAiSI95WV1fLmq+//rqsGcvxtslxjGfOnClrOo4eHJJkHSbHEyZHRCbHEya/U3JE51gk+05yXPJWkqyfZP+6ceNGWTM3N9dyP0OwsrJS1iTjdvny5Y7b2VKSZ2qyB3YdpTuUY1y7jjlO9sHkKNchHPea7BfJWk0kWSnHjx8va6Z1XLW/aAAAAO00GgAAQDuNBgAA0E6jAQAAtNNoAAAA7TQaAABAO40GAADQTqMBAAC0aw3sS8I/klCTtbW1siYJXEqC9pJ7HkJgX3KfXaE1WykoLAmdS4JtkjmwtLQU3NEwXLhwoaxJgvaSEL0kOHMrhfElwVtJUFUy/l2hcslvtNmSMLL79++XNUm4ZjLfxhKg1hW0N4Tn3DQl6yvRFUI8hNC5VPKO0BWYnKyxZOw2+xmT7BeJU6dOlTXJ2A5pPvmLBgAA0E6jAQAAtNNoAAAA7TQaAABAO40GAADQTqMBAAC002gAAADtNBoAAEC71sC+JLAkCXpJwvgSYwmeW15eLmuSUKBnz55t/GYmWys8LQllSsJvks+Zn5+vb2ggkjWWhKMlQVRbKRwtkYRQJeN2/vz5siaZl8m4JfvLZkvWYRLmmuyDybNhLPMtWTtJKOlYnpeJJKysK9AseX4nknDZZE+YhuQ+Tp48WdYk+2CyDocQONp1D8k8SMI1uwIEO/iLBgAA0E6jAQAAtNNoAAAA7TQaAABAO40GAADQTqMBAAC002gAAADtNBoAAEC7qQf2TTMMbiwhYEnoVhKQMzs7u/GbmQwr6OVtkvtMwpSSgJxEEtQ2JkmoXxK4lIQLJTXJ77TZ63llZaWsuXTpUllz7ty5jtuZXL16tay5du1ay7U2W/L7JiFrd+7cKWuS3yiR7N2bLdkHkzCxZK9M1ulYwtOSedIV6pfM7TEF5Xa9I6yurpY1d+/eLWuGMOeSZ08SnJm8xy0sLJQ1yfxOnt8dY+svGgAAQDuNBgAA0E6jAQAAtNNoAAAA7TQaAABAO40GAADQTqMBAAC002gAAADtWgP7ksCSJEQkkQTGJNdKAoj+bJJxO3HixKbfR2VxcbGsSQLNEkMIixui5DsnY5cEnyWBYsmc2IgDBw601Ny4caOs6dort9IeN81QsyTMagiSQK0kGC15piZBh7dv3y5rNvv5kYxJsi/NzMy0fM6YwviSfefMmTNlzeXLl8uarsDX5DcYQqhfMrbTfP9Knrsdgcb+ogEAALTTaAAAAO00GgAAQDuNBgAA0E6jAQAAtNNoAAAA7TQaAABAO40GAADQrjWw79ixY2VNEkaSBIR0hIhMJllgCcN0/vz5subWrVtlzdraWlmThAbNz8+XNRcuXGj5nGlIwu+SIKokCCz5nYYQPNf1fZN9MLnWuXPnypqxBEmurKyUNUkYYldo4xDmWyLZB5OgvSTQLAlYS57NQwh8TZ79yXw7depUw90MRzIPknFJxjeZTydPnixrrl+/XtZsdphrl2RtJGObjEnXe3TFXzQAAIB2Gg0AAKCdRgMAAGin0QAAANppNAAAgHYaDQAAoJ1GAwAAaKfRAAAA2rUG9s3NzZU1SWhKUpOEmiQhYGORhG4lQW9JKFYybklI1GZL5kASjJbUJHMyGdskDGkogX3JnOsKvEzC0ZaXl1uuNQTJ2D579qysGcI67PL111+XNVevXm25VhJ0mAQmDkEyB5JgtCTgKxmTsQQdJs+5GzdulDVjCcRMJd8nmQezs7NlTRL8lzwPxxK8nNxn8j7SFYI7reBMf9EAAADaaTQAAIB2Gg0AAKCdRgMAAGin0QAAANppNAAAgHYaDQAAoN3M+vr6+jQv+Ouvv5Y1yfnxu3btKms++uij6J62igcPHpQ1L168KGvefffdsmYrje0ff/xR1jx58qSsScb2vffeK2s++OCDsmYaknWYrOfEnj17ypqhjEuHV69elTXffvttWXPkyJGy5p133onu6b/t8ePHZc0vv/zScq2ttMe9fv26rEn2r3/84x9lTbJOk/yE3bt3lzWb7bvvvitrku+yb9++jtsZla53jW3b6v/Xncy5Dz/8sKzZvn17WbPZkj3u5cuXZc2bN2/KmmRMprUOp95oAAAAW59/OgUAALTTaAAAAO00GgAAQDuNBgAA0E6jAQAAtNNoAAAA7XZM+4JDO/P78OHDZc1W8s0335Q1ydnWyfn9Qzi3+vnz52VNkhdx6NChsmYI3zeV5Dgk45Ksw2Q+7d27t6zZv39/WTOE8/kTyR6XjP+nn35a1gxhXnatw+T8+OQc+kQytjt37my51mb7s823RPIukuS2JOOW7G9jyW2ZTCaTn376qaxJ9uIkh2krZSwl3zdZq0lGxpDyk6beaCQD/cUXX5Q1169fL2tOnz5d1ty8ebOs2Ur+8pe/lDUHDx4sa27dutXyOZvtq6++KmuuXLlS1iTzZAjfN/X999+XNcvLy2VNsg6TcTl79mxZc/78+bLmxIkTZc0QXL16taxJxv/evXtlzRDmZdc6fPr0aVmztraW3FLp7t27Zc3Ro0dbrrXZ/mzzLZG8i3SNW7K/JXvpUHz++edlTbIXd73HJb/BEHz55ZdlzeLiYlnTNW7T4p9OAQAA7TQaAABAO40GAADQTqMBAAC002gAAADtNBoAAEC7qR9vmxxRubKyUtZcvny5rEmOAEtqknsegmTc7t+/31KTHDM5hGMOz507V9Yk95nMk4sXL9Y3NBDJMZXJEcbJd07mSnKMZPI7DeF42+T7JvOp6+jUIazVa9eulTWrq6tlzYEDB8qa5NmQHP04lqNrE13HkQ9hT0/cuXOnrEme613H+SbjPybJd05+g2k+e4ewnpNj8pP3L8fbAgAAf3oaDQAAoJ1GAwAAaKfRAAAA2mk0AACAdhoNAACgnUYDAABop9EAAADatQb2JeE2SahcErK2uLhY1iRBVUmozFgsLCy0fM6pU6fKmiGE3ySS+0zClM6ePVvWjCmwLwnzSdZGEhyUrNUkiC35DYagK8QwCXdK5nfyWyfX2ogkSDGZb8nnJOM/luC5RDJuSRjilStXGu5mGJLQs6452RX8NybJXry8vFzWJPtXslbH8j7SNedu3LhR1iTP3WmNm79oAAAA7TQ
"text/plain": [
"<Figure size 1000x400 with 40 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def plot_digits(data):\n",
" fig, axes = plt.subplots(4, 10, figsize=(10, 4),\n",
" subplot_kw={'xticks':[], 'yticks':[]},\n",
" gridspec_kw=dict(hspace=0.1, wspace=0.1))\n",
" for i, ax in enumerate(axes.flat):\n",
" ax.imshow(data[i].reshape(8, 8),\n",
" cmap='binary', interpolation='nearest',\n",
" clim=(0, 16))\n",
"plot_digits(digits.data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets add some random noise to create a noisy dataset, and re-plot it:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAxoAAAFHCAYAAAAmzzpSAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA/E0lEQVR4nO3dZ5hUZbr2/bvpAE3TdDc5SRQQJRgwgCJZDCiCIgbUMW4TZsQ0bEXHPKOAYRQDwYQKouCICRUdDCiCooIgIElyajp3Vb0f5qn9cjwH6z6voe5m+8zx/31dJ3etrlpr1bq6OdaZlkgkEg4AAAAAAqr2v70DAAAAAP7zMGgAAAAACI5BAwAAAEBwDBoAAAAAgmPQAAAAABAcgwYAAACA4Bg0AAAAAATHoAEAAAAgOAYNAAAAAMFlWIOJRMJVVlZ6M5mZmXKd7du3B8nk5+fLTEFBgcwUFhbKTPXq1SO3ZWVluUQi4apV2/vMlkgkXHl5uXf9iooKuQ+1atWSmRUrVshMenq6zDRu3Fhmon7ePZWUlERuy8nJcfF43GVk7P0QtBTWp6WlyYxFcXGxzFiOk7KyMplp3ry5zOzatcu7vVatWt73P5FIyPfP8vmFOldr164tM5bz2XLspnJMxONx7zHr3L+OW0VdJ52znfOW48nyvlU1y7lq+Xkt59jOnTtlpnXr1jJjYbku+I5J9d1gEY/HZcby/v/222/7vA97CvXe+o6HjIyMlN83C8uxVFRUJDNZWVkyo67p1nWaNWsmM/uD5b3bunWrzFjuaxo0aCAzlnPAdx9muY9Tr2G5xlm+GzZu3CgzLVq0kBnL96W6N3XOf+wWFBS4eDwu7/3Ng0ZlZaVbv369N2P54f/+97/LzO233y4zo0ePlpm77rpLZqZOnSozXbt2jdx28MEHO+eih5Hy8nL3ww8/eNe3DAhnnXWWzLRp00ZmGjVqJDOffvqpzOTl5cnMO++8E7lt6NChzjnncnNzIzPqxLVcmC3mzZsnM0899ZTMLFiwQGZWrlwpM6+//rp3+9ChQ70374lEQn5B+t73pCeeeEJm/vznP8vMFVdcITOW89nyhWO5uEYpKSlx77//vjczePBguc6GDRtkZtmyZTLz7bffyszNN98sM/uDOldXrVol15gwYYLMPPLIIzJjuemw+OSTT2TGdz095JBDnHP+X1QplmFz9+7dMmP5brAI9d767iWaNGninLP9MiQVU6ZMkZnp06fLTOfOnWVm7NixMnPooYfKzHfffScz+4PlvRsxYoTMXHjhhTIzceJEmbHc5P/888+R2zp06OCciz5XE4mE/EWI5T7Och0cMmSIzGzatElm6tevLzOWffbd+yS//+vVq+ddg/86BQAAACA4Bg0AAAAAwTFoAAAAAAiOQQMAAABAcAwaAAAAAIJj0AAAAAAQnPnxtpmZmfLxtZZHZX3xxRcy06lTJ5n5/vvvZcbyKMSRI0fKTCqqV6/ufTyuc84tXLgwyGs1bdpUZtatWyczr732msyMGjVKZi6++GKZiZKWlhbk8bXPPvuszMycOVNm3n777ZT3xerss8/2bs/OzvZuTyQSsg/iqquukvvx4osvyszw4cNlxsLy6No1a9ak9BrJ8yPqmd85OTmmx9cqs2bNkpkzzzxTZizX09LSUpl54403ZCaVz7GiokI+ttnyWFTLfh5zzDEyY+lSOeWUU2TmlltukZnDDz9cZlLx+OOPy8wvv/wiM5b+HsujUy3fDZZHsfu+85KPyrT0cqWiZcuWMmN55PJll10mM6eeeqrMWOoB1CNWc3JyUn4ssOW689NPP6X0GkmWPg7L9XTgwIEy47uvtLxnUZ1f/84a1113ncwcddRRMmP5vrz88stlxvIIe8ujmRX+ogEAAAAgOAYNAAAAAMExaAAAAAAIjkEDAAAAQHAMGgAAAACCY9AAAAAAEByDBgAAAIDgGDQAAAAABGcu7LOwFL1UVlbKjKWMb8qUKTLz4Ycfysydd94pM/fee6/MpOK0006TmRtvvFFm6tevLzMNGzaUmR49esjM/PnzZaZt27aR2+rUqeOccy49PX2v22OxmCwnshxvqizROec2bNggM/n5+TIzefJkmdmxY4fMlJeXe7fXqFHDuz0ej7vi4mJvxlLGZ/mMjzzySJmxuPbaa2WmQ4cOMuMrAmvcuLH33xYXF7tPPvnEmzn55JPlPlxxxRUykzz+fcaMGSMz6lhwzlZyl4qsrCzXvn37lNcpKiqSmS1btshMv379ZMZSBmcpUKtqlkLZQw89VGZWr14tM5ZSxS5dusiMxaBBg4Kskwrf91OS+g5yzrk//elPMmMphR0/frzMpMryvdq6dWu5juW79+uvv5YZy/fHBRdcIDOWkru6devKTJR4PO62bt3qzWzatEmuYymd7du3r8wce+yxMqP21zldQuic//4oWa4pywzlqwAAAADAv4lBAwAAAEBwDBoAAAAAgmPQAAAAABAcgwYAAACA4Bg0AAAAAATHoAEAAAAgOAYNAAAAAMEFLeyLxWIy06tXL5mxFKidf/75MmMpR0sWjvg89NBDkdsuv/xy55yt0C3KfffdJzNPP/20zJSWlspMx44dZcZScpeTkyMzqUhPT5fv6YEHHijXsRR8WUrYjjjiCJmxFGdZjpOffvpJrhFVdOjcv8rIWrZs6V3DUs5lkZ2dLTMlJSUyYynjW7x4scxYPoMo2dnZsixJfTbOOXfmmWfKjKWkc+3atTJjKWWaPn26zAwePDhyW/JYspTcRdm1a5fMPPXUUzIzbdo0mbEcJ+ecc47M1KpVS2Z8JZ3J9zQ3N3ev2xOJhCzn/P333+U+LFq0SGZGjRolM5ZCs7lz58qMpYzSUi6biiuvvFJmLOfFbbfdJjM33HCDaZ+UtLQ0mZk9e7Z3e48ePVzNmjUjt1u+Vy1OPPFEmTn33HNlZt26dTJz4YUXysyPP/6Y0mslCySjztVYLOa2bdvmXb9///5yHyy6desmM4888ojMLFu2TGYmTZokM40aNZIZhb9oAAAAAAiOQQMAAABAcAwaAAAAAIJj0AAAAAAQHIMGAAAAgOAYNAAAAAAEx6ABAAAAIDgGDQAAAADBpSVCNXc553bv3i0zZ599tszMmjVLZgoLC2WmoqJCZizlgM2bN4/clizHqVZt7zNbLBZzO3fu9K4f9W//XaqkzTlbOd3YsWNlpqoLlyzmzJkjM7feeqvMdOnSRWaeffZZmbGUBhUXF8vMBx984N1+xRVXuIKCArlOKq/hnHO9e/eWmYwM3fl57bXXyswrr7wiM5s3b5YZ33UhWTLpO1dVsdzGjRvlPhx11FEyc+ONN8rMXXfdJTOWkq8HH3xQZm655RaZiVJaWuq+//57b8bynoQybtw4mbG8t1dddZXM3HvvvZZd2qtdu3a5119/3ZsZPny4XKdGjRoyc+yxx8qMpfjvkksukZnHHntMZnzFf0ceeaRzzlYGmop3331XZk455RSZ6dmzp8x8/PHHpn36f4Xl+9BSZGi5Z7Tcolru9VIpHLWwlN8lC559KisrZcZyXVBloM45t2nTJpl55513IrdlZWU55/Q9LH/RAAAAABAcgwYAAACA4Bg0AAAAAATHoAEAAAAgOAYNAAAAAMExaAAAAAAIjkEDAAAAQHAMGgAAAACC021b/0dZWZlbsmSJN7Nu3Tq5Tq1atawv6ZWbmysz1113ncxYynZ+//33yG3du3d3zkWXC6Wnp7s6ders8/pJlhI9VQzonK0YzVLqZymMqWpr1qyRmfT0dJlZuHChzMycOVNmLCWGloI7JVkSGSWRSMiSo/79+8vXGTVqlMxcdNFFMmMpULOcA6rczDnnWrRoEbmtc+fOzrnogrOKigp5jVu/fr3cB0uZ6N133y0zltdq3bq1zBx99NEys3Tp0shtrVq1cs79/+VM/7f09HR57D///PNyHy6++GKZsRwnlgLU6tWry4zl/U9FjRo1/uf7I8qMGTPkOiNGjJCZ8ePHW3fLy1IQaXH88cfv87/duXOnmzJlijfTtWtXuc5JJ50kM5ayuKuvvlpmLF544QW
"text/plain": [
"<Figure size 1000x400 with 40 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"np.random.seed(42)\n",
"noisy = np.random.normal(digits.data, 4)\n",
"plot_digits(noisy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's clear by eye that the images are noisy, and contain spurious pixels.\n",
"Let's train a PCA on the noisy data, requesting that the projection preserve 50% of the variance:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pca = PCA(0.50).fit(noisy)\n",
"pca.n_components_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here 50% of the variance amounts to 12 principal components.\n",
"Now we compute these components, and then use the inverse of the transform to reconstruct the filtered digits:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAxoAAAFHCAYAAAAmzzpSAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA4pElEQVR4nO3daZRV9Z318XNvMUMV81SlDKKiCAjKICKzI2JAUFRUmlYTUXGeWo1i1EiroJK02g5RWyJGUUFtExXTIg4RnHFCVJBBKMZippjueV6k7/Nk9Xru2VvPvwid9f28vZtT5546089y/XcmjuM4AgAAAICAsn/vHQAAAADwj4dBAwAAAEBwDBoAAAAAgmPQAAAAABAcgwYAAACA4Bg0AAAAAATHoAEAAAAgOAYNAAAAAMExaAAAAAAIrpobdArEnczq1atlZsuWLTJTp04dmWnQoIHM1KhRQ2YymcxPzjjHxNl+LpeTmcrKSpmpqKiQmeLi4iAZ53sVEuq4OdasWSMzGzdulJlGjRrJjHNO7gnO+eQcl23btslMzZo1ZcY5ds61mkaoe9yuXbtkZseOHTKzdetWmWnWrJnMVLVQx23VqlUy4xyTUNdh2u+Vvz/tDc+GtWvXyoyzP02aNJGZbFb/98s03935tw7n3uW8izgZ55g0btxYZpz3nrTPxVDPhs2bN8uMcz6VlJTIjGNvOOecd7R169bJTFFRkcy0aNHC2iclxH0qE5tHMI5jeQI6D4Fzzz1XZqZNmyYzJ554osyMHz9eZjp06CAzaS7uUA8T56b4xhtvyMzNN98sMz//+c9lZsyYMTJTvXp1mSkkjmP5wpZm+3/rsssuk5nJkyfLzB133CEz11xzjbNLVc45n5zz4LXXXpOZI488UmZuvfVWmenUqZPMpBHHcbR79+7EzM6dO+V2vv32W5n5+uuvZeb555+XmalTp8pMVYvjWN7nnJexM888U2ZeeuklmZk0aZLMXHHFFTKjzoUoSn5xqF27dhRFhV8ynWeq80LhPHede5zzEvTQQw/JTK1atWQm6TqqVu2v//0z6ZmqzjfnmTpz5kyZcd5Fnn32WZlp2LChzDjn7UknnZT4eVFRUepBw/mPapdcconM/Md//IfMPPDAAzIzduxYmXEkXWvOfxRQ9wPnuM+aNUtmbrjhBpkpLS2VGef54Uj63vl7m/ru/K9TAAAAAIJj0AAAAAAQHIMGAAAAgOAYNAAAAAAEx6ABAAAAIDgGDQAAAADB2T0aUaSXsFqwYIHchrNc3MiRI2WmrKxMZv7whz/IzNVXXy0z+eX2/n+ctf3VMq0rV65MvY0oiqJrr71WZj7//HOZufDCC2XGWWs7jUwmE2T5Wucc+OMf/5j650SRd/xPO+00mWndunWq/cjlcnIN8zfffFNuZ+HChTIzdOhQmXn66adlxlnSz5HUT5E/n5KWMFTLsDpLXTvL8D711FMy4yyTG0qaZbgzmYx8NmzYsEFu/+2335YZp//itttuk5kDDzxQZpxlmZOOm1rmNZPJWMvXKv/+7/8uMw8//LDM/PrXv5aZpUuXysy+++4rM84SuIXEcSyX5160aJHczty5c2Xm0EMPlZl69erJjPMccpYcD3G+KE7HzyuvvCIzAwYMkJlWrVrJzMSJE2XmvPPOk5mq7rByOsouvvhimWnXrp3MOMuFP/jggzJz/vnny0yIvjL+ogEAAAAgOAYNAAAAAMExaAAAAAAIjkEDAAAAQHAMGgAAAACCY9AAAAAAEByDBgAAAIDgGDQAAAAABGcX9jmlTE6Jm1PGd99998mMU1hy7733yoxT7NOiRQuZSaKO25IlS+Q2pk+fLjNOGV9S+WCeU1JUs2ZNmdm5c6fcj0LHJpfLRdu3b0/c/tatW+U+OOeJU6LnlAZVVlbKTNoyPkcmk5HFck2aNJHbcYobX3vtNZnZtGmTzHTp0kVmHE6BZiHZbDaqX79+kP1QnMKr9u3by4xzzq1Zs0ZmnALUNJo1ayYzXbt2lRmnuNQ53959912Z6dixo8y0adNGZtJwCvKuvPJKmTn22GNlZuzYsTLz/fffy8y6detkJumZmn8mJBVEquvcKZ3r37+/zDgFnatWrZKZAw44QGbatm0rM+qZWKNGjdQFa86zwTmfDj74YJlZsWKFzDjPmMGDB8tMcXFxwc+y2b/+N/ekY6eO65dffin34auvvpKZYcOGyYzzrvfss8/KjPPuE6LokL9oAAAAAAiOQQMAAABAcAwaAAAAAIJj0AAAAAAQHIMGAAAAgOAYNAAAAAAEx6ABAAAAIDgGDQAAAADB2YV9DlUmE0VR1KlTJ5mpXbu2zDilMrVq1ZKZTz75RGZ69eolM4XEcSxLtZxSoPfff/8n78Pfco6/UxTmqF69eqp/n1T4F0VR1LhxY7mNww8/XGackjbnZ51wwgkyU1FRITMNGzaUmSSZTEYWMx500EFyO4sXL5aZKVOmyIxzHTpFYE7pVVX74IMPZOamm26SmR49esjMqFGjZOaJJ56QmaKiIpk599xzZaaqOYVxzvedP3++zMyaNUtmnMLKbdu2FfwsX2qaLwP7n+I4jnbv3p24fee7JBWR5Y0bN05mNmzYIDOlpaUy41zvhY6JS5Wn1atXT27jP//zP2XGOU8WLlwoM8791rkHpnkXCenoo4+WmcmTJ8uM845w5plnykxJSYnMJJV9qgLITCYj76MLFiyQ++B4/vnnZebrr78O8rPefvttmRkyZEjqn8NfNAAAAAAEx6ABAAAAIDgGDQAAAADBMWgAAAAACI5BAwAAAEBwDBoAAAAAgmPQAAAAABAcgwYAAACA4OzCvjiOoziOEzPLly+X23EKy+rWrevuVurtbNmyJcjPKiSbzcryoLZt28rtHHvssTJTXl4uM+p3GEVegaBj69atBT/LlzoVKm7KZrOyhGf9+vVyH5yiQ6cgx3HMMcfITFLBV16+6KuQWrVqpS68ckoBe/bsKTMvv/yyzKhyrSiKohtvvFFm/umf/klmjj/++CD7U8jUqVNl5k9/+pPMXHLJJTLjXM/Tp0+XmeHDh8tMGnEcRzt27EjMOMf8pJNOkplTTz1VZm699VaZuf3222Xm3XfflZmRI0fKTCFOsabzTHXuKaNHj5YZ51n43HPPyUzfvn1lJqnENl+eVuge5xw35/6minSjKIpWrFghM849Z/DgwTLjlOmm5Tz/Fy1aJDMzZsyQmY8++khmmjdvLjNOgbBTWqmeq2k5RZXdunWTmf79+8vMkUceKTOPPfaYzDgFtBT2AQAAANgrMWgAAAAACI5BAwAAAEBwDBoAAAAAgmPQAAAAABAcgwYAAACA4Bg0AAAAAATHoAEAAAAgOLuwL5PJyNKlli1byu18/PHHMrNz506ZqV69uswsXLhQZpwCqI0bNxb8LF/GV6hcKI7jKJfLJW7fKUpySvS+/vprmXE4v6OuXbvKTJrivziO5XmwZMkSuR2nvGvp0qUyc9xxx8lM06ZNZcYpLktbxudwrrGVK1fKTLt27WTGKVxyiucefvhhmTnhhBNkppBcLievxVatWv3k7f+tsrIymfnwww9lxilr23///VNtJ190lVSgpsqwNmzYIPfBuTacUqzDDz9cZpzisqouc43jONq9e3diZr/99pPb2bVrl8yE+i6zZ8+WmcMOO0xmateuXfAzdR7EcSy/s/PsGTBggMw4z8KkYtq8n/3sZzKzp6hzzrk23nrrLZmZMmWKzDjP3ieeeEJm2rRpIzNJ51z+fbLQuecUVnfp0kXuw9VXXy0zvXv3lhmnSNJ5pjrnbgj8RQMAAABAcAwaAAAAAIJj0AAAAAAQHIMGAAAAgOAYNAAAAAAEx6ABAAAAIDgGDQAAAADBMWgAAAAACM4u7HM45V1vvPGGzNxxxx0ys2rVKplZvHixzDj7XFJSIjOFZDKZqKioKDGTL/1L0qNHD5np3r27zDjHbdGiRTLToEEDmWnWrFnBz/LlOUklYDVq1EjcfocOHeQ+jBgxQmbuvfdemXF+1oIFC2TGOZcaNWqU+LlTaLVjx47EzJo1a+R+3HnnnTLTtm1
"text/plain": [
"<Figure size 1000x400 with 40 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"components = pca.transform(noisy)\n",
"filtered = pca.inverse_transform(components)\n",
"plot_digits(filtered)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Principal Component Analysis Summary\n",
"\n",
"In this section we have discussed the use of principal component analysis for dimensionality reduction, for visualization of high-dimensional data, for noise filtering, and for feature selection within high-dimensional data.\n",
"Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines.\n",
"Given any high-dimensional dataset, I tend to start with PCA in order to visualize the relationship between points (as we did with the digits), to understand the main variance in the data (as we did with the eigenfaces), and to understand the intrinsic dimensionality (by plotting the explained variance ratio).\n",
"Certainly PCA is not useful for every high-dimensional dataset, but it offers a straightforward and efficient path to gaining insight into high-dimensional data.\n",
"\n",
"PCA's main weakness is that it tends to be highly affected by outliers in the data.\n",
"For this reason, many robust variants of PCA have been developed, many of which act to iteratively discard data points that are poorly described by the initial components.\n",
"Scikit-Learn contains a couple interesting variants on PCA, including ``RandomizedPCA`` and ``SparsePCA``, both also in the ``sklearn.decomposition`` submodule.\n",
"``RandomizedPCA``, which we saw earlier, uses a non-deterministic method to quickly approximate the first few principal components in very high-dimensional data, while ``SparsePCA`` introduces a regularization term (see [In Depth: Linear Regression](05.06-Linear-Regression.ipynb)) that serves to enforce sparsity of the components."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}