mirror of
https://github.com/Visualize-ML/Book4_Power-of-Matrix.git
synced 2026-02-03 02:24:03 +08:00
588 lines
15 KiB
Plaintext
588 lines
15 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "73bd968b-d970-4a05-94ef-4e7abf990827",
|
||
"metadata": {},
|
||
"source": [
|
||
"Chapter 24\n",
|
||
"\n",
|
||
"# 数据分解\n",
|
||
"Book_4《矩阵力量》 | 鸢尾花书:从加减乘除到机器学习 (第二版)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "84c7e223-f9df-4575-a7d8-ab27812fea5c",
|
||
"metadata": {},
|
||
"source": [
|
||
"这段代码对鸢尾花数据集的特征矩阵 $X$ 进行了多种矩阵操作和分解,以便分析数据的结构和特性。首先,代码计算了 $X$ 的 **Gram 矩阵** $G = X^T X$,展示了数据样本的内积关系。接着,基于 $G$ 构造了 **余弦相似度矩阵** $C$,通过对 $G$ 中的特征进行归一化处理,使得 $C$ 中的每个元素代表样本间的相似性。随后,代码计算数据的 **质心**(均值向量) $E(X)$ 并生成 **去均值数据矩阵** $X_c = X - E(X)$,使数据中心化,以便消除均值对数据的影响。\n",
|
||
"\n",
|
||
"代码进一步计算了 **协方差矩阵** $\\Sigma = \\frac{1}{N} X_c^T X_c$ 和 **相关矩阵** $\\rho$,分别表示特征间的协方差和标准化后的相关性。\n",
|
||
"\n",
|
||
"接下来,代码进行了多种矩阵分解,包括:\n",
|
||
"\n",
|
||
"1. **QR 分解**:将原始矩阵 $X$ 分解为一个正交矩阵 $Q$ 和一个上三角矩阵 $R$,满足 $X = QR$。\n",
|
||
" \n",
|
||
"2. **Cholesky 分解**:对 Gram 矩阵 $G$ 和协方差矩阵 $\\Sigma$ 进行分解,得到其对应的下三角矩阵 $L$,使得 $G = LL^T$ 和 $\\Sigma = LL^T$。\n",
|
||
" \n",
|
||
"3. **特征值分解**:对 Gram 矩阵 $G$、协方差矩阵 $\\Sigma$ 和相关矩阵 $\\rho$ 进行特征值分解,得到其特征值(对角化的 $\\Lambda$ 矩阵)和特征向量矩阵 $V$,满足 $G = V \\Lambda V^T$、$\\Sigma = V \\Lambda V^T$ 和 $\\rho = V \\Lambda V^T$。\n",
|
||
"\n",
|
||
"4. **奇异值分解(SVD)**:对原始数据 $X$、去均值数据 $X_c$ 和标准化数据 $Z_X$ 分别进行 SVD 分解,得到分解形式 $X = U S V^T$,其中 $U$ 和 $V$ 分别表示数据的左右奇异向量矩阵,$S$ 是包含奇异值的对角矩阵。SVD 提供了数据的主成分方向和大小,用于后续的主成分分析(PCA)或特征提取。\n",
|
||
"\n",
|
||
"通过这些分解和操作,代码全面分析了数据的相似性、协方差结构、相关性、主成分和特征方向,为理解数据的内在结构提供了重要参考。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "f95b3fca-b77f-4e10-a84d-62f6876de01a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np # 导入 numpy 进行数值计算\n",
|
||
"import matplotlib.pyplot as plt # 导入 matplotlib 用于绘图\n",
|
||
"import pandas as pd # 导入 pandas 进行数据操作\n",
|
||
"from sklearn.datasets import load_iris # 从 sklearn 加载鸢尾花数据集"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "c3ab92a3-7a2b-4655-b39f-41014f0bf0c3",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from numpy.linalg import inv # 导入 inv 函数用于矩阵求逆\n",
|
||
"from scipy.stats import zscore # 导入 zscore 函数用于标准化\n",
|
||
"from numpy.linalg import qr # 导入 qr 函数进行 QR 分解\n",
|
||
"from numpy.linalg import cholesky as chol # 导入 cholesky 函数用于 Cholesky 分解\n",
|
||
"from numpy.linalg import eig # 导入 eig 函数进行特征值分解\n",
|
||
"from numpy.linalg import svd # 导入 svd 函数用于奇异值分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c401295c-eabb-4713-94aa-2311ec8a972d",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 加载数据"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "2397142d-b32e-465f-834f-65b06c127cea",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"iris = load_iris() # 从 sklearn 加载鸢尾花数据集"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "313bf88d-fcfd-4947-beb3-bcb2d7bc56e3",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"X = iris.data # 特征矩阵 X\n",
|
||
"y = iris.target # 目标标签 y"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "91f49a78-e515-4201-9eb3-d8993b32471d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"feature_names = ['Sepal length, x1', 'Sepal width, x2',\n",
|
||
" 'Petal length, x3', 'Petal width, x4'] # 特征名称"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "730bcf6c-1e64-4bda-85fe-80a14077e339",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 将特征矩阵 X 转换为 DataFrame 格式"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "c86246d1-a83b-4495-a982-36e715eeae6e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"X_df = pd.DataFrame(X, columns=feature_names)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "54cef963-570f-47c3-b5dd-222fdc550707",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 原始数据 X"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "802ac1d9-148c-4e3b-b00f-9f1ee45648d1",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"X = X_df.to_numpy() # 转换 DataFrame 为 numpy 数组"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "877eeaab-7719-4b83-b829-280143f81c5a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Gram 矩阵 G"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "775d5366-beec-4349-bae2-7726b9735f6e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"G = X.T @ X # 计算 Gram 矩阵,G = X^T X"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "967a5f97-96e3-43c5-9527-76d2879e8f03",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 余弦相似度矩阵 C"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "3639f25d-1ad5-4646-bfd5-675cf79fd512",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# 使用特征范数计算相似度\n",
|
||
"S_norm = np.diag(np.sqrt(np.diag(G))) # 生成缩放矩阵,对角线元素为每列特征的范数"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "2052f952-8ee5-4914-9bb9-33bce33b92b5",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"C = inv(S_norm) @ G @ inv(S_norm) # 计算余弦相似度矩阵 C"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fa9f5362-0b41-4a6c-8ca1-2324672bd552",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 数据矩阵的质心 E(X)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "7e51dd81-5415-46ee-be03-9bbfba20be8a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"E_X = X_df.mean().to_frame().T # 计算 X 的均值,并转换为单行 DataFrame"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "83005aa2-ba44-4ee6-8bcd-c70f86d0c366",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 数据去均值处理 X_c"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "e98077f7-6426-427d-a6f4-8a85bef517c4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"X_c = X_df.sub(X_df.mean()) # 每列减去其均值,得到去均值矩阵 X_c"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3f96ef51-d242-4f3e-a95d-c4174d394d63",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 协方差矩阵 Sigma"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"id": "8aae2ef4-1803-4522-9d72-0749936c3bab",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SIGMA = X_df.cov() # 计算协方差矩阵 SIGMA"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2aee8ac6-dd46-446f-82d7-4f953a3b03aa",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 相关矩阵 P"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "b79c796e-be1e-4ace-be31-87af4c1daf84",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"RHO = X_df.corr() # 计算相关矩阵 RHO"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b087fa60-77ef-4280-8e20-d4efa134ba69",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 数据标准化 Z_X"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "8b473969-650c-4c52-ae8d-f89ce7b9dce0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Z_X = zscore(X_df) # 对 X 的每列标准化,使其均值为 0,标准差为 1"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a70b1f17-e178-4f52-9346-e330986b22b0",
|
||
"metadata": {},
|
||
"source": [
|
||
"## QR 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"id": "fc15ec6d-931a-4472-b1d0-7d1aaa1285dd",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Q, R = qr(X_df, mode='reduced') # 对 X 进行 QR 分解,mode='reduced' 保留最小矩阵维度"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d1977ba7-04d1-4d11-b9a4-dc5b0e1ff5c8",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Cholesky 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"id": "7f129525-f4e7-48b7-ba19-bc49ad4515e3",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"L_G = chol(G) # 对 Gram 矩阵 G 进行 Cholesky 分解,得到下三角矩阵 L_G"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"id": "293cfc24-8d6c-48b5-8b55-0fd2173f10de",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"R_G = L_G.T # 上三角矩阵 R_G 为 L_G 的转置"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "44091d77-46f4-416f-8305-d66438fc9d62",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 协方差矩阵 Sigma 的 Cholesky 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "959d25e3-0ab4-40c1-b79d-f6388712b381",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"L_Sigma = chol(SIGMA) # 对协方差矩阵 SIGMA 进行 Cholesky 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"id": "94d8ce95-d99d-43a1-b244-402cf2e07f28",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"R_Sigma = L_Sigma.T # 上三角矩阵 R_Sigma 为 L_Sigma 的转置"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4f02b402-a8bc-464a-9006-4dca94b9b219",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Gram 矩阵 G 的特征值分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"id": "b63e3321-afb5-4b9e-ac92-a74c39537169",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Lambs_G, V_G = eig(G) # 对 G 进行特征值分解,得到特征值 Lambs_G 和特征向量 V_G"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"id": "40135be7-b292-4765-8a8b-866b4f22418f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Lambs_G = np.diag(Lambs_G) # 将特征值转换为对角矩阵形式"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "559af984-f9b8-4f34-8c69-af3035dfb589",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 协方差矩阵 Sigma 的特征值分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"id": "0280dc12-1245-4b5d-8099-8fe3abc94ccc",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Lambs_sigma, V_sigma = eig(SIGMA) # 对 SIGMA 进行特征值分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"id": "7941e036-5e24-4a15-8612-20dfd3d60779",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Lambs_sigma = np.diag(Lambs_sigma) # 将特征值转换为对角矩阵形式"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d387720b-43f9-4c24-a918-48eb686acb8a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 相关矩阵 P 的特征值分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"id": "7227d988-df95-44f1-9986-bd3f0922d7da",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Lambs_P, V_P = eig(RHO) # 对相关矩阵 RHO 进行特征值分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"id": "a8a40ce6-cdde-4fea-be3e-1d34367b9aac",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Lambs_P = np.diag(Lambs_P) # 将特征值转换为对角矩阵形式"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2891d64a-1faa-47a6-bcba-5c56611e96cf",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 原始数据 X 的 SVD 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"id": "37ffec3a-f1ea-45dc-9fa0-be3e27bc15e7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"U_X, S_X_, V_X = svd(X_df, full_matrices=False) # 对 X 进行 SVD 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"id": "472ff2d6-e2fe-4ba5-b341-cc13ebf416fe",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"V_X = V_X.T # 转置 V_X"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"id": "fd6103a7-12b4-431f-9ad5-1d7d4a3c6564",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"S_X = np.diag(S_X_) # 将奇异值转换为对角矩阵"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "93dc741b-c4b5-48ae-a32e-8b80bafa2ae6",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 去均值数据 X_c 的 SVD 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"id": "58e1f3b8-6838-410e-bbfa-1b538b8713c1",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"U_Xc, S_Xc, V_Xc = svd(X_c, full_matrices=False) # 对去均值后的数据 X_c 进行 SVD 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"id": "98a5eb4a-8112-4021-9339-27779343347c",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"V_Xc = V_Xc.T # 转置 V_Xc"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"id": "fb80d669-ea06-4ee3-a74e-1e3b8a00e954",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"S_Xc = np.diag(S_Xc) # 将奇异值转换为对角矩阵"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9b015450-0229-445f-bf61-487d53fa3efe",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 标准化数据 Z_X 的 SVD 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 33,
|
||
"id": "689bb4cf-2b59-45ca-b8e4-acaf31d8e139",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"U_Z, S_Z, V_Z = svd(Z_X, full_matrices=False) # 对标准化后的数据 Z_X 进行 SVD 分解"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"id": "a296884a-2df5-4b34-b1ad-ffec488c2f37",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"V_Z = V_Z.T # 转置 V_Z"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 35,
|
||
"id": "a3bd573a-de20-4d7c-81a2-f88c2bcf0263",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"S_Z = np.diag(S_Z) # 将奇异值转换为对角矩阵"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "85a80909-2aac-49ed-bb7a-f8cc6b80ee7d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ecd322f4-f919-4be2-adc3-69d28ef25e69",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|