{ "cells": [ { "cell_type": "markdown", "id": "73bd968b-d970-4a05-94ef-4e7abf990827", "metadata": {}, "source": [ "Chapter 02\n", "\n", "# 余弦距离\n", "Book_4《矩阵力量》 | 鸢尾花书:从加减乘除到机器学习 (第二版)" ] }, { "cell_type": "markdown", "id": "a4d2d126-5db0-4038-a1dc-9a83708d2220", "metadata": {}, "source": [ "此代码加载了鸢尾花数据集,并提取了数据集中的几个特定数据点以计算它们之间的余弦距离和夹角余弦值。选取的特征包含所有特征,而数据点分别为第1、2、51、101个观测。\n", "\n", "### 计算公式\n", "代码使用余弦距离和余弦相似度来度量两个向量之间的相似性。\n", "\n", "1. 余弦距离公式:\n", "\n", "$$\n", "\\text{cosine\\_distance} = 1 - \\frac{x_1 \\cdot x_2}{\\|x_1\\| \\|x_2\\|}\n", "$$\n", "\n", "2. 向量间夹角的余弦值:\n", "\n", "$$\n", "\\cos \\theta = \\frac{x_1 \\cdot x_2}{\\|x_1\\| \\|x_2\\|}\n", "$$\n", "\n", "余弦距离反映了两个向量间的角度差异,而夹角余弦值则用于计算向量的相似性。" ] }, { "cell_type": "markdown", "id": "6aec7dfe-a3d4-4cb9-b921-b3a137455aca", "metadata": {}, "source": [ "## 导入所需库" ] }, { "cell_type": "code", "execution_count": 1, "id": "83ff9c7b-6adb-4fe1-91ef-4325850851b2", "metadata": {}, "outputs": [], "source": [ "from scipy.spatial import distance # 导入SciPy库,用于计算向量间的距离\n", "from sklearn import datasets # 导入sklearn数据集模块\n", "import numpy as np # 导入NumPy库,用于数值计算" ] }, { "cell_type": "markdown", "id": "4723d4a7-92b7-4c8d-8dcc-e678a7daded0", "metadata": {}, "source": [ "## 导入鸢尾花数据集" ] }, { "cell_type": "code", "execution_count": 2, "id": "f4654f70-41cf-40b9-ad1d-5e39c337a40a", "metadata": {}, "outputs": [], "source": [ "iris = datasets.load_iris() # 加载鸢尾花数据集" ] }, { "cell_type": "markdown", "id": "a7d90350-e6b8-4f35-a30f-f17e3042bbfa", "metadata": {}, "source": [ "## 使用前两个特征:萼片长度和萼片宽度" ] }, { "cell_type": "code", "execution_count": 3, "id": "111624e1-5071-43b5-ad7f-f91c1ea86f1f", "metadata": {}, "outputs": [], "source": [ "X = iris.data[:, :] # 提取所有特征数据" ] }, { "cell_type": "markdown", "id": "08b685c0-3e21-4c99-8f19-d15fc50ba309", "metadata": {}, "source": [ "## 提取4个数据点" ] }, { "cell_type": "code", "execution_count": 4, "id": "af961fce-c345-4d97-a8d8-dc1fd27f75e6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5.1, 3.5, 1.4, 0.2])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1_data = X[0, :] # 第一个数据点\n", "x1_data" ] }, { "cell_type": "code", "execution_count": 5, "id": "98593e09-12b0-43a1-81c0-97751c3f44cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([4.9, 3. , 1.4, 0.2])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x2_data = X[1, :] # 第二个数据点\n", "x2_data" ] }, { "cell_type": "code", "execution_count": 6, "id": "261efd1f-06bc-4361-996b-62a2f621234a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([7. , 3.2, 4.7, 1.4])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x51_data = X[50, :] # 第51个数据点\n", "x51_data" ] }, { "cell_type": "code", "execution_count": 7, "id": "637692dd-8df7-472c-97e5-c1be1475c891", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([6.3, 3.3, 6. , 2.5])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x101_data = X[100, :] # 第101个数据点\n", "x101_data" ] }, { "cell_type": "markdown", "id": "a14578f6-7866-4fea-8354-2633c0a50df7", "metadata": {}, "source": [ "## 计算余弦距离和夹角余弦值" ] }, { "cell_type": "code", "execution_count": 8, "id": "d73204a7-6235-4449-a3ea-90d677c04fe6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0014208364959781283" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1_x2_cos_dist = distance.cosine(x1_data, x2_data) \n", "# 计算x1和x2的余弦距离\n", "x1_x2_cos_dist" ] }, { "cell_type": "code", "execution_count": 9, "id": "1b3df27e-df51-4564-a192-b3d0544fd820", "metadata": {}, "outputs": [], "source": [ "x1_norm = np.linalg.norm(x1_data) # 计算x1的L2范数" ] }, { "cell_type": "code", "execution_count": 10, "id": "efb9a6c3-ccfa-4000-8135-677fdd68c81d", "metadata": {}, "outputs": [], "source": [ "x2_norm = np.linalg.norm(x2_data) # 计算x2的L2范数" ] }, { "cell_type": "code", "execution_count": 11, "id": "06c48d91-451f-404c-8e11-6006943c4a38", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37.489999999999995" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1_dot_x2 = x1_data.T @ x2_data # 计算x1和x2的点积\n", "x1_dot_x2" ] }, { "cell_type": "code", "execution_count": 12, "id": "5818da3f-985c-4c48-8fc9-159ec8067f2e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9985791635040218" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1_x2_cos = x1_dot_x2 / x1_norm / x2_norm # 计算x1和x2的夹角余弦值\n", "x1_x2_cos" ] }, { "cell_type": "code", "execution_count": 16, "id": "f3e0969d-ffbf-4c10-95ee-69898c57dc0c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0014208364959782394" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 - x1_x2_cos" ] }, { "cell_type": "code", "execution_count": 15, "id": "62c0e4a9-c455-4290-9389-a2ee56c2adc1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0014208364959781283" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1_x2_cos_dist = distance.cosine(x1_data, x2_data) # 计算x1和x2的余弦距离\n", "x1_x2_cos_dist" ] }, { "cell_type": "code", "execution_count": 13, "id": "46e0f1db-9989-4908-89de-ffe0f0a8f431", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.07161964128508791" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1_x51_cos_dist = distance.cosine(x1_data, x51_data) # 计算x1和x51的余弦距离\n", "x1_x51_cos_dist" ] }, { "cell_type": "code", "execution_count": 14, "id": "0f516143-46a8-42ba-8e31-fe4b08d428de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1399186683412712" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1_x101_cos_dist = distance.cosine(x1_data, x101_data) # 计算x1和x101的余弦距离\n", "x1_x101_cos_dist" ] }, { "cell_type": "code", "execution_count": null, "id": "85a80909-2aac-49ed-bb7a-f8cc6b80ee7d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ecd322f4-f919-4be2-adc3-69d28ef25e69", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }