Book4_Power-of-Matrix/Book4_Ch24_Python_Codes/Bk4_Ch24_01.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "73bd968b-d970-4a05-94ef-4e7abf990827",
   "metadata": {},
   "source": [
    "Chapter 24\n",
    "\n",
    "# 数据分解\n",
    "Book_4《矩阵力量》 | 鸢尾花书：从加减乘除到机器学习 (第二版)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84c7e223-f9df-4575-a7d8-ab27812fea5c",
   "metadata": {},
   "source": [
    "这段代码对鸢尾花数据集的特征矩阵 $X$ 进行了多种矩阵操作和分解，以便分析数据的结构和特性。首先，代码计算了 $X$ 的 **Gram 矩阵** $G = X^T X$，展示了数据样本的内积关系。接着，基于 $G$ 构造了 **余弦相似度矩阵** $C$，通过对 $G$ 中的特征进行归一化处理，使得 $C$ 中的每个元素代表样本间的相似性。随后，代码计算数据的 **质心**（均值向量） $E(X)$ 并生成 **去均值数据矩阵** $X_c = X - E(X)$，使数据中心化，以便消除均值对数据的影响。\n",
    "\n",
    "代码进一步计算了 **协方差矩阵** $\\Sigma = \\frac{1}{N} X_c^T X_c$ 和 **相关矩阵** $\\rho$，分别表示特征间的协方差和标准化后的相关性。\n",
    "\n",
    "接下来，代码进行了多种矩阵分解，包括：\n",
    "\n",
    "1. **QR 分解**：将原始矩阵 $X$ 分解为一个正交矩阵 $Q$ 和一个上三角矩阵 $R$，满足 $X = QR$。\n",
    "  \n",
    "2. **Cholesky 分解**：对 Gram 矩阵 $G$ 和协方差矩阵 $\\Sigma$ 进行分解，得到其对应的下三角矩阵 $L$，使得 $G = LL^T$ 和 $\\Sigma = LL^T$。\n",
    "  \n",
    "3. **特征值分解**：对 Gram 矩阵 $G$、协方差矩阵 $\\Sigma$ 和相关矩阵 $\\rho$ 进行特征值分解，得到其特征值（对角化的 $\\Lambda$ 矩阵）和特征向量矩阵 $V$，满足 $G = V \\Lambda V^T$、$\\Sigma = V \\Lambda V^T$ 和 $\\rho = V \\Lambda V^T$。\n",
    "\n",
    "4. **奇异值分解（SVD）**：对原始数据 $X$、去均值数据 $X_c$ 和标准化数据 $Z_X$ 分别进行 SVD 分解，得到分解形式 $X = U S V^T$，其中 $U$ 和 $V$ 分别表示数据的左右奇异向量矩阵，$S$ 是包含奇异值的对角矩阵。SVD 提供了数据的主成分方向和大小，用于后续的主成分分析（PCA）或特征提取。\n",
    "\n",
    "通过这些分解和操作，代码全面分析了数据的相似性、协方差结构、相关性、主成分和特征方向，为理解数据的内在结构提供了重要参考。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f95b3fca-b77f-4e10-a84d-62f6876de01a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np  # 导入 numpy 进行数值计算\n",
    "import matplotlib.pyplot as plt  # 导入 matplotlib 用于绘图\n",
    "import pandas as pd  # 导入 pandas 进行数据操作\n",
    "from sklearn.datasets import load_iris  # 从 sklearn 加载鸢尾花数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c3ab92a3-7a2b-4655-b39f-41014f0bf0c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.linalg import inv  # 导入 inv 函数用于矩阵求逆\n",
    "from scipy.stats import zscore  # 导入 zscore 函数用于标准化\n",
    "from numpy.linalg import qr  # 导入 qr 函数进行 QR 分解\n",
    "from numpy.linalg import cholesky as chol  # 导入 cholesky 函数用于 Cholesky 分解\n",
    "from numpy.linalg import eig  # 导入 eig 函数进行特征值分解\n",
    "from numpy.linalg import svd  # 导入 svd 函数用于奇异值分解"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c401295c-eabb-4713-94aa-2311ec8a972d",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2397142d-b32e-465f-834f-65b06c127cea",
   "metadata": {},
   "outputs": [],
   "source": [
    "iris = load_iris()  # 从 sklearn 加载鸢尾花数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "313bf88d-fcfd-4947-beb3-bcb2d7bc56e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = iris.data  # 特征矩阵 X\n",
    "y = iris.target  # 目标标签 y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "91f49a78-e515-4201-9eb3-d8993b32471d",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_names = ['Sepal length, x1', 'Sepal width, x2',\n",
    "                 'Petal length, x3', 'Petal width, x4']  # 特征名称"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "730bcf6c-1e64-4bda-85fe-80a14077e339",
   "metadata": {},
   "source": [
    "## 将特征矩阵 X 转换为 DataFrame 格式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c86246d1-a83b-4495-a982-36e715eeae6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_df = pd.DataFrame(X, columns=feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54cef963-570f-47c3-b5dd-222fdc550707",
   "metadata": {},
   "source": [
    "## 原始数据 X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "802ac1d9-148c-4e3b-b00f-9f1ee45648d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = X_df.to_numpy()  # 转换 DataFrame 为 numpy 数组"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "877eeaab-7719-4b83-b829-280143f81c5a",
   "metadata": {},
   "source": [
    "## Gram 矩阵 G"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "775d5366-beec-4349-bae2-7726b9735f6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "G = X.T @ X  # 计算 Gram 矩阵，G = X^T X"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "967a5f97-96e3-43c5-9527-76d2879e8f03",
   "metadata": {},
   "source": [
    "## 余弦相似度矩阵 C"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3639f25d-1ad5-4646-bfd5-675cf79fd512",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 使用特征范数计算相似度\n",
    "S_norm = np.diag(np.sqrt(np.diag(G)))  # 生成缩放矩阵，对角线元素为每列特征的范数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2052f952-8ee5-4914-9bb9-33bce33b92b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "C = inv(S_norm) @ G @ inv(S_norm)  # 计算余弦相似度矩阵 C"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa9f5362-0b41-4a6c-8ca1-2324672bd552",
   "metadata": {},
   "source": [
    "## 数据矩阵的质心 E(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7e51dd81-5415-46ee-be03-9bbfba20be8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "E_X = X_df.mean().to_frame().T  # 计算 X 的均值，并转换为单行 DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83005aa2-ba44-4ee6-8bcd-c70f86d0c366",
   "metadata": {},
   "source": [
    "## 数据去均值处理 X_c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e98077f7-6426-427d-a6f4-8a85bef517c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_c = X_df.sub(X_df.mean())  # 每列减去其均值，得到去均值矩阵 X_c"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f96ef51-d242-4f3e-a95d-c4174d394d63",
   "metadata": {},
   "source": [
    "## 协方差矩阵 Sigma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8aae2ef4-1803-4522-9d72-0749936c3bab",
   "metadata": {},
   "outputs": [],
   "source": [
    "SIGMA = X_df.cov()  # 计算协方差矩阵 SIGMA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2aee8ac6-dd46-446f-82d7-4f953a3b03aa",
   "metadata": {},
   "source": [
    "## 相关矩阵 P"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b79c796e-be1e-4ace-be31-87af4c1daf84",
   "metadata": {},
   "outputs": [],
   "source": [
    "RHO = X_df.corr()  # 计算相关矩阵 RHO"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b087fa60-77ef-4280-8e20-d4efa134ba69",
   "metadata": {},
   "source": [
    "## 数据标准化 Z_X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8b473969-650c-4c52-ae8d-f89ce7b9dce0",
   "metadata": {},
   "outputs": [],
   "source": [
    "Z_X = zscore(X_df)  # 对 X 的每列标准化，使其均值为 0，标准差为 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a70b1f17-e178-4f52-9346-e330986b22b0",
   "metadata": {},
   "source": [
    "## QR 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "fc15ec6d-931a-4472-b1d0-7d1aaa1285dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "Q, R = qr(X_df, mode='reduced')  # 对 X 进行 QR 分解，mode='reduced' 保留最小矩阵维度"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1977ba7-04d1-4d11-b9a4-dc5b0e1ff5c8",
   "metadata": {},
   "source": [
    "## Cholesky 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "7f129525-f4e7-48b7-ba19-bc49ad4515e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "L_G = chol(G)  # 对 Gram 矩阵 G 进行 Cholesky 分解，得到下三角矩阵 L_G"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "293cfc24-8d6c-48b5-8b55-0fd2173f10de",
   "metadata": {},
   "outputs": [],
   "source": [
    "R_G = L_G.T  # 上三角矩阵 R_G 为 L_G 的转置"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44091d77-46f4-416f-8305-d66438fc9d62",
   "metadata": {},
   "source": [
    "## 协方差矩阵 Sigma 的 Cholesky 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "959d25e3-0ab4-40c1-b79d-f6388712b381",
   "metadata": {},
   "outputs": [],
   "source": [
    "L_Sigma = chol(SIGMA)  # 对协方差矩阵 SIGMA 进行 Cholesky 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "94d8ce95-d99d-43a1-b244-402cf2e07f28",
   "metadata": {},
   "outputs": [],
   "source": [
    "R_Sigma = L_Sigma.T  # 上三角矩阵 R_Sigma 为 L_Sigma 的转置"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f02b402-a8bc-464a-9006-4dca94b9b219",
   "metadata": {},
   "source": [
    "## Gram 矩阵 G 的特征值分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b63e3321-afb5-4b9e-ac92-a74c39537169",
   "metadata": {},
   "outputs": [],
   "source": [
    "Lambs_G, V_G = eig(G)  # 对 G 进行特征值分解，得到特征值 Lambs_G 和特征向量 V_G"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "40135be7-b292-4765-8a8b-866b4f22418f",
   "metadata": {},
   "outputs": [],
   "source": [
    "Lambs_G = np.diag(Lambs_G)  # 将特征值转换为对角矩阵形式"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "559af984-f9b8-4f34-8c69-af3035dfb589",
   "metadata": {},
   "source": [
    "## 协方差矩阵 Sigma 的特征值分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "0280dc12-1245-4b5d-8099-8fe3abc94ccc",
   "metadata": {},
   "outputs": [],
   "source": [
    "Lambs_sigma, V_sigma = eig(SIGMA)  # 对 SIGMA 进行特征值分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "7941e036-5e24-4a15-8612-20dfd3d60779",
   "metadata": {},
   "outputs": [],
   "source": [
    "Lambs_sigma = np.diag(Lambs_sigma)  # 将特征值转换为对角矩阵形式"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d387720b-43f9-4c24-a918-48eb686acb8a",
   "metadata": {},
   "source": [
    "## 相关矩阵 P 的特征值分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "7227d988-df95-44f1-9986-bd3f0922d7da",
   "metadata": {},
   "outputs": [],
   "source": [
    "Lambs_P, V_P = eig(RHO)  # 对相关矩阵 RHO 进行特征值分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "a8a40ce6-cdde-4fea-be3e-1d34367b9aac",
   "metadata": {},
   "outputs": [],
   "source": [
    "Lambs_P = np.diag(Lambs_P)  # 将特征值转换为对角矩阵形式"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2891d64a-1faa-47a6-bcba-5c56611e96cf",
   "metadata": {},
   "source": [
    "## 原始数据 X 的 SVD 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "37ffec3a-f1ea-45dc-9fa0-be3e27bc15e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "U_X, S_X_, V_X = svd(X_df, full_matrices=False)  # 对 X 进行 SVD 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "472ff2d6-e2fe-4ba5-b341-cc13ebf416fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "V_X = V_X.T  # 转置 V_X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "fd6103a7-12b4-431f-9ad5-1d7d4a3c6564",
   "metadata": {},
   "outputs": [],
   "source": [
    "S_X = np.diag(S_X_)  # 将奇异值转换为对角矩阵"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93dc741b-c4b5-48ae-a32e-8b80bafa2ae6",
   "metadata": {},
   "source": [
    "## 去均值数据 X_c 的 SVD 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "58e1f3b8-6838-410e-bbfa-1b538b8713c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "U_Xc, S_Xc, V_Xc = svd(X_c, full_matrices=False)  # 对去均值后的数据 X_c 进行 SVD 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "98a5eb4a-8112-4021-9339-27779343347c",
   "metadata": {},
   "outputs": [],
   "source": [
    "V_Xc = V_Xc.T  # 转置 V_Xc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "fb80d669-ea06-4ee3-a74e-1e3b8a00e954",
   "metadata": {},
   "outputs": [],
   "source": [
    "S_Xc = np.diag(S_Xc)  # 将奇异值转换为对角矩阵"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b015450-0229-445f-bf61-487d53fa3efe",
   "metadata": {},
   "source": [
    "## 标准化数据 Z_X 的 SVD 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "689bb4cf-2b59-45ca-b8e4-acaf31d8e139",
   "metadata": {},
   "outputs": [],
   "source": [
    "U_Z, S_Z, V_Z = svd(Z_X, full_matrices=False)  # 对标准化后的数据 Z_X 进行 SVD 分解"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "a296884a-2df5-4b34-b1ad-ffec488c2f37",
   "metadata": {},
   "outputs": [],
   "source": [
    "V_Z = V_Z.T  # 转置 V_Z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "a3bd573a-de20-4d7c-81a2-f88c2bcf0263",
   "metadata": {},
   "outputs": [],
   "source": [
    "S_Z = np.diag(S_Z)  # 将奇异值转换为对角矩阵"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85a80909-2aac-49ed-bb7a-f8cc6b80ee7d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecd322f4-f919-4be2-adc3-69d28ef25e69",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}