{ "cells": [ { "cell_type": "markdown", "id": "73bd968b-d970-4a05-94ef-4e7abf990827", "metadata": {}, "source": [ "Chapter 24\n", "\n", "# 数据分解\n", "Book_4《矩阵力量》 | 鸢尾花书：从加减乘除到机器学习 (第二版)" ] }, { "cell_type": "markdown", "id": "84c7e223-f9df-4575-a7d8-ab27812fea5c", "metadata": {}, "source": [ "这段代码对鸢尾花数据集的特征矩阵 $X$ 进行了多种矩阵操作和分解，以便分析数据的结构和特性。首先，代码计算了 $X$ 的 **Gram 矩阵** $G = X^T X$，展示了数据样本的内积关系。接着，基于 $G$ 构造了 **余弦相似度矩阵** $C$，通过对 $G$ 中的特征进行归一化处理，使得 $C$ 中的每个元素代表样本间的相似性。随后，代码计算数据的 **质心**（均值向量） $E(X)$ 并生成 **去均值数据矩阵** $X_c = X - E(X)$，使数据中心化，以便消除均值对数据的影响。\n", "\n", "代码进一步计算了 **协方差矩阵** $\\Sigma = \\frac{1}{N} X_c^T X_c$ 和 **相关矩阵** $\\rho$，分别表示特征间的协方差和标准化后的相关性。\n", "\n", "接下来，代码进行了多种矩阵分解，包括：\n", "\n", "1. **QR 分解**：将原始矩阵 $X$ 分解为一个正交矩阵 $Q$ 和一个上三角矩阵 $R$，满足 $X = QR$。\n", " \n", "2. **Cholesky 分解**：对 Gram 矩阵 $G$ 和协方差矩阵 $\\Sigma$ 进行分解，得到其对应的下三角矩阵 $L$，使得 $G = LL^T$ 和 $\\Sigma = LL^T$。\n", " \n", "3. **特征值分解**：对 Gram 矩阵 $G$、协方差矩阵 $\\Sigma$ 和相关矩阵 $\\rho$ 进行特征值分解，得到其特征值（对角化的 $\\Lambda$ 矩阵）和特征向量矩阵 $V$，满足 $G = V \\Lambda V^T$、$\\Sigma = V \\Lambda V^T$ 和 $\\rho = V \\Lambda V^T$。\n", "\n", "4. **奇异值分解（SVD）**：对原始数据 $X$、去均值数据 $X_c$ 和标准化数据 $Z_X$ 分别进行 SVD 分解，得到分解形式 $X = U S V^T$，其中 $U$ 和 $V$ 分别表示数据的左右奇异向量矩阵，$S$ 是包含奇异值的对角矩阵。SVD 提供了数据的主成分方向和大小，用于后续的主成分分析（PCA）或特征提取。\n", "\n", "通过这些分解和操作，代码全面分析了数据的相似性、协方差结构、相关性、主成分和特征方向，为理解数据的内在结构提供了重要参考。" ] }, { "cell_type": "code", "execution_count": 1, "id": "f95b3fca-b77f-4e10-a84d-62f6876de01a", "metadata": {}, "outputs": [], "source": [ "import numpy as np # 导入 numpy 进行数值计算\n", "import matplotlib.pyplot as plt # 导入 matplotlib 用于绘图\n", "import pandas as pd # 导入 pandas 进行数据操作\n", "from sklearn.datasets import load_iris # 从 sklearn 加载鸢尾花数据集" ] }, { "cell_type": "code", "execution_count": 2, "id": "c3ab92a3-7a2b-4655-b39f-41014f0bf0c3", "metadata": {}, "outputs": [], "source": [ "from numpy.linalg import inv # 导入 inv 函数用于矩阵求逆\n", "from scipy.stats import zscore # 导入 zscore 函数用于标准化\n", "from numpy.linalg import qr # 导入 qr 函数进行 QR 分解\n", "from numpy.linalg import cholesky as chol # 导入 cholesky 函数用于 Cholesky 分解\n", "from numpy.linalg import eig # 导入 eig 函数进行特征值分解\n", "from numpy.linalg import svd # 导入 svd 函数用于奇异值分解" ] }, { "cell_type": "markdown", "id": "c401295c-eabb-4713-94aa-2311ec8a972d", "metadata": {}, "source": [ "## 加载数据" ] }, { "cell_type": "code", "execution_count": 3, "id": "2397142d-b32e-465f-834f-65b06c127cea", "metadata": {}, "outputs": [], "source": [ "iris = load_iris() # 从 sklearn 加载鸢尾花数据集" ] }, { "cell_type": "code", "execution_count": 4, "id": "313bf88d-fcfd-4947-beb3-bcb2d7bc56e3", "metadata": {}, "outputs": [], "source": [ "X = iris.data # 特征矩阵 X\n", "y = iris.target # 目标标签 y" ] }, { "cell_type": "code", "execution_count": 5, "id": "91f49a78-e515-4201-9eb3-d8993b32471d", "metadata": {}, "outputs": [], "source": [ "feature_names = ['Sepal length, x1', 'Sepal width, x2',\n", " 'Petal length, x3', 'Petal width, x4'] # 特征名称" ] }, { "cell_type": "markdown", "id": "730bcf6c-1e64-4bda-85fe-80a14077e339", "metadata": {}, "source": [ "## 将特征矩阵 X 转换为 DataFrame 格式" ] }, { "cell_type": "code", "execution_count": 6, "id": "c86246d1-a83b-4495-a982-36e715eeae6e", "metadata": {}, "outputs": [], "source": [ "X_df = pd.DataFrame(X, columns=feature_names)" ] }, { "cell_type": "markdown", "id": "54cef963-570f-47c3-b5dd-222fdc550707", "metadata": {}, "source": [ "## 原始数据 X" ] }, { "cell_type": "code", "execution_count": 7, "id": "802ac1d9-148c-4e3b-b00f-9f1ee45648d1", "metadata": {}, "outputs": [], "source": [ "X = X_df.to_numpy() # 转换 DataFrame 为 numpy 数组" ] }, { "cell_type": "markdown", "id": "877eeaab-7719-4b83-b829-280143f81c5a", "metadata": {}, "source": [ "## Gram 矩阵 G" ] }, { "cell_type": "code", "execution_count": 8, "id": "775d5366-beec-4349-bae2-7726b9735f6e", "metadata": {}, "outputs": [], "source": [ "G = X.T @ X # 计算 Gram 矩阵，G = X^T X" ] }, { "cell_type": "markdown", "id": "967a5f97-96e3-43c5-9527-76d2879e8f03", "metadata": {}, "source": [ "## 余弦相似度矩阵 C" ] }, { "cell_type": "code", "execution_count": 9, "id": "3639f25d-1ad5-4646-bfd5-675cf79fd512", "metadata": {}, "outputs": [], "source": [ "# 使用特征范数计算相似度\n", "S_norm = np.diag(np.sqrt(np.diag(G))) # 生成缩放矩阵，对角线元素为每列特征的范数" ] }, { "cell_type": "code", "execution_count": 10, "id": "2052f952-8ee5-4914-9bb9-33bce33b92b5", "metadata": {}, "outputs": [], "source": [ "C = inv(S_norm) @ G @ inv(S_norm) # 计算余弦相似度矩阵 C" ] }, { "cell_type": "markdown", "id": "fa9f5362-0b41-4a6c-8ca1-2324672bd552", "metadata": {}, "source": [ "## 数据矩阵的质心 E(X)" ] }, { "cell_type": "code", "execution_count": 11, "id": "7e51dd81-5415-46ee-be03-9bbfba20be8a", "metadata": {}, "outputs": [], "source": [ "E_X = X_df.mean().to_frame().T # 计算 X 的均值，并转换为单行 DataFrame" ] }, { "cell_type": "markdown", "id": "83005aa2-ba44-4ee6-8bcd-c70f86d0c366", "metadata": {}, "source": [ "## 数据去均值处理 X_c" ] }, { "cell_type": "code", "execution_count": 12, "id": "e98077f7-6426-427d-a6f4-8a85bef517c4", "metadata": {}, "outputs": [], "source": [ "X_c = X_df.sub(X_df.mean()) # 每列减去其均值，得到去均值矩阵 X_c" ] }, { "cell_type": "markdown", "id": "3f96ef51-d242-4f3e-a95d-c4174d394d63", "metadata": {}, "source": [ "## 协方差矩阵 Sigma" ] }, { "cell_type": "code", "execution_count": 13, "id": "8aae2ef4-1803-4522-9d72-0749936c3bab", "metadata": {}, "outputs": [], "source": [ "SIGMA = X_df.cov() # 计算协方差矩阵 SIGMA" ] }, { "cell_type": "markdown", "id": "2aee8ac6-dd46-446f-82d7-4f953a3b03aa", "metadata": {}, "source": [ "## 相关矩阵 P" ] }, { "cell_type": "code", "execution_count": 14, "id": "b79c796e-be1e-4ace-be31-87af4c1daf84", "metadata": {}, "outputs": [], "source": [ "RHO = X_df.corr() # 计算相关矩阵 RHO" ] }, { "cell_type": "markdown", "id": "b087fa60-77ef-4280-8e20-d4efa134ba69", "metadata": {}, "source": [ "## 数据标准化 Z_X" ] }, { "cell_type": "code", "execution_count": 15, "id": "8b473969-650c-4c52-ae8d-f89ce7b9dce0", "metadata": {}, "outputs": [], "source": [ "Z_X = zscore(X_df) # 对 X 的每列标准化，使其均值为 0，标准差为 1" ] }, { "cell_type": "markdown", "id": "a70b1f17-e178-4f52-9346-e330986b22b0", "metadata": {}, "source": [ "## QR 分解" ] }, { "cell_type": "code", "execution_count": 16, "id": "fc15ec6d-931a-4472-b1d0-7d1aaa1285dd", "metadata": {}, "outputs": [], "source": [ "Q, R = qr(X_df, mode='reduced') # 对 X 进行 QR 分解，mode='reduced' 保留最小矩阵维度" ] }, { "cell_type": "markdown", "id": "d1977ba7-04d1-4d11-b9a4-dc5b0e1ff5c8", "metadata": {}, "source": [ "## Cholesky 分解" ] }, { "cell_type": "code", "execution_count": 17, "id": "7f129525-f4e7-48b7-ba19-bc49ad4515e3", "metadata": {}, "outputs": [], "source": [ "L_G = chol(G) # 对 Gram 矩阵 G 进行 Cholesky 分解，得到下三角矩阵 L_G" ] }, { "cell_type": "code", "execution_count": 18, "id": "293cfc24-8d6c-48b5-8b55-0fd2173f10de", "metadata": {}, "outputs": [], "source": [ "R_G = L_G.T # 上三角矩阵 R_G 为 L_G 的转置" ] }, { "cell_type": "markdown", "id": "44091d77-46f4-416f-8305-d66438fc9d62", "metadata": {}, "source": [ "## 协方差矩阵 Sigma 的 Cholesky 分解" ] }, { "cell_type": "code", "execution_count": 19, "id": "959d25e3-0ab4-40c1-b79d-f6388712b381", "metadata": {}, "outputs": [], "source": [ "L_Sigma = chol(SIGMA) # 对协方差矩阵 SIGMA 进行 Cholesky 分解" ] }, { "cell_type": "code", "execution_count": 20, "id": "94d8ce95-d99d-43a1-b244-402cf2e07f28", "metadata": {}, "outputs": [], "source": [ "R_Sigma = L_Sigma.T # 上三角矩阵 R_Sigma 为 L_Sigma 的转置" ] }, { "cell_type": "markdown", "id": "4f02b402-a8bc-464a-9006-4dca94b9b219", "metadata": {}, "source": [ "## Gram 矩阵 G 的特征值分解" ] }, { "cell_type": "code", "execution_count": 21, "id": "b63e3321-afb5-4b9e-ac92-a74c39537169", "metadata": {}, "outputs": [], "source": [ "Lambs_G, V_G = eig(G) # 对 G 进行特征值分解，得到特征值 Lambs_G 和特征向量 V_G" ] }, { "cell_type": "code", "execution_count": 22, "id": "40135be7-b292-4765-8a8b-866b4f22418f", "metadata": {}, "outputs": [], "source": [ "Lambs_G = np.diag(Lambs_G) # 将特征值转换为对角矩阵形式" ] }, { "cell_type": "markdown", "id": "559af984-f9b8-4f34-8c69-af3035dfb589", "metadata": {}, "source": [ "## 协方差矩阵 Sigma 的特征值分解" ] }, { "cell_type": "code", "execution_count": 23, "id": "0280dc12-1245-4b5d-8099-8fe3abc94ccc", "metadata": {}, "outputs": [], "source": [ "Lambs_sigma, V_sigma = eig(SIGMA) # 对 SIGMA 进行特征值分解" ] }, { "cell_type": "code", "execution_count": 24, "id": "7941e036-5e24-4a15-8612-20dfd3d60779", "metadata": {}, "outputs": [], "source": [ "Lambs_sigma = np.diag(Lambs_sigma) # 将特征值转换为对角矩阵形式" ] }, { "cell_type": "markdown", "id": "d387720b-43f9-4c24-a918-48eb686acb8a", "metadata": {}, "source": [ "## 相关矩阵 P 的特征值分解" ] }, { "cell_type": "code", "execution_count": 25, "id": "7227d988-df95-44f1-9986-bd3f0922d7da", "metadata": {}, "outputs": [], "source": [ "Lambs_P, V_P = eig(RHO) # 对相关矩阵 RHO 进行特征值分解" ] }, { "cell_type": "code", "execution_count": 26, "id": "a8a40ce6-cdde-4fea-be3e-1d34367b9aac", "metadata": {}, "outputs": [], "source": [ "Lambs_P = np.diag(Lambs_P) # 将特征值转换为对角矩阵形式" ] }, { "cell_type": "markdown", "id": "2891d64a-1faa-47a6-bcba-5c56611e96cf", "metadata": {}, "source": [ "## 原始数据 X 的 SVD 分解" ] }, { "cell_type": "code", "execution_count": 27, "id": "37ffec3a-f1ea-45dc-9fa0-be3e27bc15e7", "metadata": {}, "outputs": [], "source": [ "U_X, S_X_, V_X = svd(X_df, full_matrices=False) # 对 X 进行 SVD 分解" ] }, { "cell_type": "code", "execution_count": 28, "id": "472ff2d6-e2fe-4ba5-b341-cc13ebf416fe", "metadata": {}, "outputs": [], "source": [ "V_X = V_X.T # 转置 V_X" ] }, { "cell_type": "code", "execution_count": 29, "id": "fd6103a7-12b4-431f-9ad5-1d7d4a3c6564", "metadata": {}, "outputs": [], "source": [ "S_X = np.diag(S_X_) # 将奇异值转换为对角矩阵" ] }, { "cell_type": "markdown", "id": "93dc741b-c4b5-48ae-a32e-8b80bafa2ae6", "metadata": {}, "source": [ "## 去均值数据 X_c 的 SVD 分解" ] }, { "cell_type": "code", "execution_count": 30, "id": "58e1f3b8-6838-410e-bbfa-1b538b8713c1", "metadata": {}, "outputs": [], "source": [ "U_Xc, S_Xc, V_Xc = svd(X_c, full_matrices=False) # 对去均值后的数据 X_c 进行 SVD 分解" ] }, { "cell_type": "code", "execution_count": 31, "id": "98a5eb4a-8112-4021-9339-27779343347c", "metadata": {}, "outputs": [], "source": [ "V_Xc = V_Xc.T # 转置 V_Xc" ] }, { "cell_type": "code", "execution_count": 32, "id": "fb80d669-ea06-4ee3-a74e-1e3b8a00e954", "metadata": {}, "outputs": [], "source": [ "S_Xc = np.diag(S_Xc) # 将奇异值转换为对角矩阵" ] }, { "cell_type": "markdown", "id": "9b015450-0229-445f-bf61-487d53fa3efe", "metadata": {}, "source": [ "## 标准化数据 Z_X 的 SVD 分解" ] }, { "cell_type": "code", "execution_count": 33, "id": "689bb4cf-2b59-45ca-b8e4-acaf31d8e139", "metadata": {}, "outputs": [], "source": [ "U_Z, S_Z, V_Z = svd(Z_X, full_matrices=False) # 对标准化后的数据 Z_X 进行 SVD 分解" ] }, { "cell_type": "code", "execution_count": 34, "id": "a296884a-2df5-4b34-b1ad-ffec488c2f37", "metadata": {}, "outputs": [], "source": [ "V_Z = V_Z.T # 转置 V_Z" ] }, { "cell_type": "code", "execution_count": 35, "id": "a3bd573a-de20-4d7c-81a2-f88c2bcf0263", "metadata": {}, "outputs": [], "source": [ "S_Z = np.diag(S_Z) # 将奇异值转换为对角矩阵" ] }, { "cell_type": "code", "execution_count": null, "id": "85a80909-2aac-49ed-bb7a-f8cc6b80ee7d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ecd322f4-f919-4be2-adc3-69d28ef25e69", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }