Files
Iris Series: Visualize Math -- From Arithmetic Basics to Machine Learning 5adb9e44a7 Add files via upload
2025-02-01 17:08:33 +08:00

588 lines
15 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "73bd968b-d970-4a05-94ef-4e7abf990827",
"metadata": {},
"source": [
"Chapter 24\n",
"\n",
"# 数据分解\n",
"Book_4《矩阵力量》 | 鸢尾花书:从加减乘除到机器学习 (第二版)"
]
},
{
"cell_type": "markdown",
"id": "84c7e223-f9df-4575-a7d8-ab27812fea5c",
"metadata": {},
"source": [
"这段代码对鸢尾花数据集的特征矩阵 $X$ 进行了多种矩阵操作和分解,以便分析数据的结构和特性。首先,代码计算了 $X$ 的 **Gram 矩阵** $G = X^T X$,展示了数据样本的内积关系。接着,基于 $G$ 构造了 **余弦相似度矩阵** $C$,通过对 $G$ 中的特征进行归一化处理,使得 $C$ 中的每个元素代表样本间的相似性。随后,代码计算数据的 **质心**(均值向量) $E(X)$ 并生成 **去均值数据矩阵** $X_c = X - E(X)$,使数据中心化,以便消除均值对数据的影响。\n",
"\n",
"代码进一步计算了 **协方差矩阵** $\\Sigma = \\frac{1}{N} X_c^T X_c$ 和 **相关矩阵** $\\rho$,分别表示特征间的协方差和标准化后的相关性。\n",
"\n",
"接下来,代码进行了多种矩阵分解,包括:\n",
"\n",
"1. **QR 分解**:将原始矩阵 $X$ 分解为一个正交矩阵 $Q$ 和一个上三角矩阵 $R$,满足 $X = QR$。\n",
" \n",
"2. **Cholesky 分解**:对 Gram 矩阵 $G$ 和协方差矩阵 $\\Sigma$ 进行分解,得到其对应的下三角矩阵 $L$,使得 $G = LL^T$ 和 $\\Sigma = LL^T$。\n",
" \n",
"3. **特征值分解**:对 Gram 矩阵 $G$、协方差矩阵 $\\Sigma$ 和相关矩阵 $\\rho$ 进行特征值分解,得到其特征值(对角化的 $\\Lambda$ 矩阵)和特征向量矩阵 $V$,满足 $G = V \\Lambda V^T$、$\\Sigma = V \\Lambda V^T$ 和 $\\rho = V \\Lambda V^T$。\n",
"\n",
"4. **奇异值分解SVD**:对原始数据 $X$、去均值数据 $X_c$ 和标准化数据 $Z_X$ 分别进行 SVD 分解,得到分解形式 $X = U S V^T$,其中 $U$ 和 $V$ 分别表示数据的左右奇异向量矩阵,$S$ 是包含奇异值的对角矩阵。SVD 提供了数据的主成分方向和大小用于后续的主成分分析PCA或特征提取。\n",
"\n",
"通过这些分解和操作,代码全面分析了数据的相似性、协方差结构、相关性、主成分和特征方向,为理解数据的内在结构提供了重要参考。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f95b3fca-b77f-4e10-a84d-62f6876de01a",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np # 导入 numpy 进行数值计算\n",
"import matplotlib.pyplot as plt # 导入 matplotlib 用于绘图\n",
"import pandas as pd # 导入 pandas 进行数据操作\n",
"from sklearn.datasets import load_iris # 从 sklearn 加载鸢尾花数据集"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c3ab92a3-7a2b-4655-b39f-41014f0bf0c3",
"metadata": {},
"outputs": [],
"source": [
"from numpy.linalg import inv # 导入 inv 函数用于矩阵求逆\n",
"from scipy.stats import zscore # 导入 zscore 函数用于标准化\n",
"from numpy.linalg import qr # 导入 qr 函数进行 QR 分解\n",
"from numpy.linalg import cholesky as chol # 导入 cholesky 函数用于 Cholesky 分解\n",
"from numpy.linalg import eig # 导入 eig 函数进行特征值分解\n",
"from numpy.linalg import svd # 导入 svd 函数用于奇异值分解"
]
},
{
"cell_type": "markdown",
"id": "c401295c-eabb-4713-94aa-2311ec8a972d",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2397142d-b32e-465f-834f-65b06c127cea",
"metadata": {},
"outputs": [],
"source": [
"iris = load_iris() # 从 sklearn 加载鸢尾花数据集"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "313bf88d-fcfd-4947-beb3-bcb2d7bc56e3",
"metadata": {},
"outputs": [],
"source": [
"X = iris.data # 特征矩阵 X\n",
"y = iris.target # 目标标签 y"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "91f49a78-e515-4201-9eb3-d8993b32471d",
"metadata": {},
"outputs": [],
"source": [
"feature_names = ['Sepal length, x1', 'Sepal width, x2',\n",
" 'Petal length, x3', 'Petal width, x4'] # 特征名称"
]
},
{
"cell_type": "markdown",
"id": "730bcf6c-1e64-4bda-85fe-80a14077e339",
"metadata": {},
"source": [
"## 将特征矩阵 X 转换为 DataFrame 格式"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c86246d1-a83b-4495-a982-36e715eeae6e",
"metadata": {},
"outputs": [],
"source": [
"X_df = pd.DataFrame(X, columns=feature_names)"
]
},
{
"cell_type": "markdown",
"id": "54cef963-570f-47c3-b5dd-222fdc550707",
"metadata": {},
"source": [
"## 原始数据 X"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "802ac1d9-148c-4e3b-b00f-9f1ee45648d1",
"metadata": {},
"outputs": [],
"source": [
"X = X_df.to_numpy() # 转换 DataFrame 为 numpy 数组"
]
},
{
"cell_type": "markdown",
"id": "877eeaab-7719-4b83-b829-280143f81c5a",
"metadata": {},
"source": [
"## Gram 矩阵 G"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "775d5366-beec-4349-bae2-7726b9735f6e",
"metadata": {},
"outputs": [],
"source": [
"G = X.T @ X # 计算 Gram 矩阵G = X^T X"
]
},
{
"cell_type": "markdown",
"id": "967a5f97-96e3-43c5-9527-76d2879e8f03",
"metadata": {},
"source": [
"## 余弦相似度矩阵 C"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3639f25d-1ad5-4646-bfd5-675cf79fd512",
"metadata": {},
"outputs": [],
"source": [
"# 使用特征范数计算相似度\n",
"S_norm = np.diag(np.sqrt(np.diag(G))) # 生成缩放矩阵,对角线元素为每列特征的范数"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2052f952-8ee5-4914-9bb9-33bce33b92b5",
"metadata": {},
"outputs": [],
"source": [
"C = inv(S_norm) @ G @ inv(S_norm) # 计算余弦相似度矩阵 C"
]
},
{
"cell_type": "markdown",
"id": "fa9f5362-0b41-4a6c-8ca1-2324672bd552",
"metadata": {},
"source": [
"## 数据矩阵的质心 E(X)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "7e51dd81-5415-46ee-be03-9bbfba20be8a",
"metadata": {},
"outputs": [],
"source": [
"E_X = X_df.mean().to_frame().T # 计算 X 的均值,并转换为单行 DataFrame"
]
},
{
"cell_type": "markdown",
"id": "83005aa2-ba44-4ee6-8bcd-c70f86d0c366",
"metadata": {},
"source": [
"## 数据去均值处理 X_c"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e98077f7-6426-427d-a6f4-8a85bef517c4",
"metadata": {},
"outputs": [],
"source": [
"X_c = X_df.sub(X_df.mean()) # 每列减去其均值,得到去均值矩阵 X_c"
]
},
{
"cell_type": "markdown",
"id": "3f96ef51-d242-4f3e-a95d-c4174d394d63",
"metadata": {},
"source": [
"## 协方差矩阵 Sigma"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "8aae2ef4-1803-4522-9d72-0749936c3bab",
"metadata": {},
"outputs": [],
"source": [
"SIGMA = X_df.cov() # 计算协方差矩阵 SIGMA"
]
},
{
"cell_type": "markdown",
"id": "2aee8ac6-dd46-446f-82d7-4f953a3b03aa",
"metadata": {},
"source": [
"## 相关矩阵 P"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b79c796e-be1e-4ace-be31-87af4c1daf84",
"metadata": {},
"outputs": [],
"source": [
"RHO = X_df.corr() # 计算相关矩阵 RHO"
]
},
{
"cell_type": "markdown",
"id": "b087fa60-77ef-4280-8e20-d4efa134ba69",
"metadata": {},
"source": [
"## 数据标准化 Z_X"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8b473969-650c-4c52-ae8d-f89ce7b9dce0",
"metadata": {},
"outputs": [],
"source": [
"Z_X = zscore(X_df) # 对 X 的每列标准化,使其均值为 0标准差为 1"
]
},
{
"cell_type": "markdown",
"id": "a70b1f17-e178-4f52-9346-e330986b22b0",
"metadata": {},
"source": [
"## QR 分解"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "fc15ec6d-931a-4472-b1d0-7d1aaa1285dd",
"metadata": {},
"outputs": [],
"source": [
"Q, R = qr(X_df, mode='reduced') # 对 X 进行 QR 分解mode='reduced' 保留最小矩阵维度"
]
},
{
"cell_type": "markdown",
"id": "d1977ba7-04d1-4d11-b9a4-dc5b0e1ff5c8",
"metadata": {},
"source": [
"## Cholesky 分解"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "7f129525-f4e7-48b7-ba19-bc49ad4515e3",
"metadata": {},
"outputs": [],
"source": [
"L_G = chol(G) # 对 Gram 矩阵 G 进行 Cholesky 分解,得到下三角矩阵 L_G"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "293cfc24-8d6c-48b5-8b55-0fd2173f10de",
"metadata": {},
"outputs": [],
"source": [
"R_G = L_G.T # 上三角矩阵 R_G 为 L_G 的转置"
]
},
{
"cell_type": "markdown",
"id": "44091d77-46f4-416f-8305-d66438fc9d62",
"metadata": {},
"source": [
"## 协方差矩阵 Sigma 的 Cholesky 分解"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "959d25e3-0ab4-40c1-b79d-f6388712b381",
"metadata": {},
"outputs": [],
"source": [
"L_Sigma = chol(SIGMA) # 对协方差矩阵 SIGMA 进行 Cholesky 分解"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "94d8ce95-d99d-43a1-b244-402cf2e07f28",
"metadata": {},
"outputs": [],
"source": [
"R_Sigma = L_Sigma.T # 上三角矩阵 R_Sigma 为 L_Sigma 的转置"
]
},
{
"cell_type": "markdown",
"id": "4f02b402-a8bc-464a-9006-4dca94b9b219",
"metadata": {},
"source": [
"## Gram 矩阵 G 的特征值分解"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "b63e3321-afb5-4b9e-ac92-a74c39537169",
"metadata": {},
"outputs": [],
"source": [
"Lambs_G, V_G = eig(G) # 对 G 进行特征值分解,得到特征值 Lambs_G 和特征向量 V_G"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "40135be7-b292-4765-8a8b-866b4f22418f",
"metadata": {},
"outputs": [],
"source": [
"Lambs_G = np.diag(Lambs_G) # 将特征值转换为对角矩阵形式"
]
},
{
"cell_type": "markdown",
"id": "559af984-f9b8-4f34-8c69-af3035dfb589",
"metadata": {},
"source": [
"## 协方差矩阵 Sigma 的特征值分解"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "0280dc12-1245-4b5d-8099-8fe3abc94ccc",
"metadata": {},
"outputs": [],
"source": [
"Lambs_sigma, V_sigma = eig(SIGMA) # 对 SIGMA 进行特征值分解"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "7941e036-5e24-4a15-8612-20dfd3d60779",
"metadata": {},
"outputs": [],
"source": [
"Lambs_sigma = np.diag(Lambs_sigma) # 将特征值转换为对角矩阵形式"
]
},
{
"cell_type": "markdown",
"id": "d387720b-43f9-4c24-a918-48eb686acb8a",
"metadata": {},
"source": [
"## 相关矩阵 P 的特征值分解"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "7227d988-df95-44f1-9986-bd3f0922d7da",
"metadata": {},
"outputs": [],
"source": [
"Lambs_P, V_P = eig(RHO) # 对相关矩阵 RHO 进行特征值分解"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "a8a40ce6-cdde-4fea-be3e-1d34367b9aac",
"metadata": {},
"outputs": [],
"source": [
"Lambs_P = np.diag(Lambs_P) # 将特征值转换为对角矩阵形式"
]
},
{
"cell_type": "markdown",
"id": "2891d64a-1faa-47a6-bcba-5c56611e96cf",
"metadata": {},
"source": [
"## 原始数据 X 的 SVD 分解"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "37ffec3a-f1ea-45dc-9fa0-be3e27bc15e7",
"metadata": {},
"outputs": [],
"source": [
"U_X, S_X_, V_X = svd(X_df, full_matrices=False) # 对 X 进行 SVD 分解"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "472ff2d6-e2fe-4ba5-b341-cc13ebf416fe",
"metadata": {},
"outputs": [],
"source": [
"V_X = V_X.T # 转置 V_X"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "fd6103a7-12b4-431f-9ad5-1d7d4a3c6564",
"metadata": {},
"outputs": [],
"source": [
"S_X = np.diag(S_X_) # 将奇异值转换为对角矩阵"
]
},
{
"cell_type": "markdown",
"id": "93dc741b-c4b5-48ae-a32e-8b80bafa2ae6",
"metadata": {},
"source": [
"## 去均值数据 X_c 的 SVD 分解"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "58e1f3b8-6838-410e-bbfa-1b538b8713c1",
"metadata": {},
"outputs": [],
"source": [
"U_Xc, S_Xc, V_Xc = svd(X_c, full_matrices=False) # 对去均值后的数据 X_c 进行 SVD 分解"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "98a5eb4a-8112-4021-9339-27779343347c",
"metadata": {},
"outputs": [],
"source": [
"V_Xc = V_Xc.T # 转置 V_Xc"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "fb80d669-ea06-4ee3-a74e-1e3b8a00e954",
"metadata": {},
"outputs": [],
"source": [
"S_Xc = np.diag(S_Xc) # 将奇异值转换为对角矩阵"
]
},
{
"cell_type": "markdown",
"id": "9b015450-0229-445f-bf61-487d53fa3efe",
"metadata": {},
"source": [
"## 标准化数据 Z_X 的 SVD 分解"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "689bb4cf-2b59-45ca-b8e4-acaf31d8e139",
"metadata": {},
"outputs": [],
"source": [
"U_Z, S_Z, V_Z = svd(Z_X, full_matrices=False) # 对标准化后的数据 Z_X 进行 SVD 分解"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "a296884a-2df5-4b34-b1ad-ffec488c2f37",
"metadata": {},
"outputs": [],
"source": [
"V_Z = V_Z.T # 转置 V_Z"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "a3bd573a-de20-4d7c-81a2-f88c2bcf0263",
"metadata": {},
"outputs": [],
"source": [
"S_Z = np.diag(S_Z) # 将奇异值转换为对角矩阵"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85a80909-2aac-49ed-bb7a-f8cc6b80ee7d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "ecd322f4-f919-4be2-adc3-69d28ef25e69",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}