mirror of
https://github.com/apachecn/ailearning.git
synced 2026-04-15 18:59:55 +08:00
735 lines
10 KiB
Markdown
735 lines
10 KiB
Markdown
# 数组读写
|
||
|
||
## 从文本中读取数组
|
||
|
||
In [1]:
|
||
|
||
```py
|
||
import numpy as np
|
||
|
||
```
|
||
|
||
### 空格(制表符)分割的文本
|
||
|
||
假设我们有这样的一个空白分割的文件:
|
||
|
||
In [2]:
|
||
|
||
```py
|
||
%%writefile myfile.txt
|
||
2.1 2.3 3.2 1.3 3.1
|
||
6.1 3.1 4.2 2.3 1.8
|
||
|
||
```
|
||
|
||
```py
|
||
Writing myfile.txt
|
||
|
||
```
|
||
|
||
为了生成数组,我们首先将数据转化成一个列表组成的列表,再将这个列表转换为数组:
|
||
|
||
In [3]:
|
||
|
||
```py
|
||
data = []
|
||
|
||
with open('myfile.txt') as f:
|
||
# 每次读一行
|
||
for line in f:
|
||
fileds = line.split()
|
||
row_data = [float(x) for x in fileds]
|
||
data.append(row_data)
|
||
|
||
data = np.array(data)
|
||
|
||
```
|
||
|
||
In [4]:
|
||
|
||
```py
|
||
data
|
||
|
||
```
|
||
|
||
Out[4]:
|
||
|
||
```py
|
||
array([[ 2.1, 2.3, 3.2, 1.3, 3.1],
|
||
[ 6.1, 3.1, 4.2, 2.3, 1.8]])
|
||
```
|
||
|
||
不过,更简便的是使用 `loadtxt` 方法:
|
||
|
||
In [5]:
|
||
|
||
```py
|
||
data = np.loadtxt('myfile.txt')
|
||
data
|
||
|
||
```
|
||
|
||
Out[5]:
|
||
|
||
```py
|
||
array([[ 2.1, 2.3, 3.2, 1.3, 3.1],
|
||
[ 6.1, 3.1, 4.2, 2.3, 1.8]])
|
||
```
|
||
|
||
### 逗号分隔文件
|
||
|
||
In [6]:
|
||
|
||
```py
|
||
%%writefile myfile.txt
|
||
2.1, 2.3, 3.2, 1.3, 3.1
|
||
6.1, 3.1, 4.2, 2.3, 1.8
|
||
|
||
```
|
||
|
||
```py
|
||
Overwriting myfile.txt
|
||
|
||
```
|
||
|
||
对于逗号分隔的文件(通常为`.csv`格式),我们可以稍微修改之前繁琐的过程,将 `split` 的参数变成 `','`即可。
|
||
|
||
不过,`loadtxt` 函数也可以读这样的文件,只需要制定分割符的参数即可:
|
||
|
||
In [7]:
|
||
|
||
```py
|
||
data = np.loadtxt('myfile.txt', delimiter=',')
|
||
data
|
||
|
||
```
|
||
|
||
Out[7]:
|
||
|
||
```py
|
||
array([[ 2.1, 2.3, 3.2, 1.3, 3.1],
|
||
[ 6.1, 3.1, 4.2, 2.3, 1.8]])
|
||
```
|
||
|
||
### loadtxt 函数
|
||
|
||
```py
|
||
loadtxt(fname, dtype=<type 'float'>,
|
||
comments='#', delimiter=None,
|
||
converters=None, skiprows=0,
|
||
usecols=None, unpack=False, ndmin=0)
|
||
```
|
||
|
||
`loadtxt` 有很多可选参数,其中 `delimiter` 就是刚才用到的分隔符参数。
|
||
|
||
`skiprows` 参数表示忽略开头的行数,可以用来读写含有标题的文本
|
||
|
||
In [8]:
|
||
|
||
```py
|
||
%%writefile myfile.txt
|
||
X Y Z MAG ANG
|
||
2.1 2.3 3.2 1.3 3.1
|
||
6.1 3.1 4.2 2.3 1.8
|
||
|
||
```
|
||
|
||
```py
|
||
Overwriting myfile.txt
|
||
|
||
```
|
||
|
||
In [9]:
|
||
|
||
```py
|
||
np.loadtxt('myfile.txt', skiprows=1)
|
||
|
||
```
|
||
|
||
Out[9]:
|
||
|
||
```py
|
||
array([[ 2.1, 2.3, 3.2, 1.3, 3.1],
|
||
[ 6.1, 3.1, 4.2, 2.3, 1.8]])
|
||
```
|
||
|
||
此外,有一个功能更为全面的 `genfromtxt` 函数,能处理更多的情况,但相应的速度和效率会慢一些。
|
||
|
||
```py
|
||
genfromtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None,
|
||
skiprows=0, skip_header=0, skip_footer=0, converters=None,
|
||
missing='', missing_values=None, filling_values=None, usecols=None,
|
||
names=None, excludelist=None, deletechars=None, replace_space='_',
|
||
autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None,
|
||
usemask=False, loose=True, invalid_raise=True)
|
||
```
|
||
|
||
### loadtxt 的更多特性
|
||
|
||
对于这样一个文件:
|
||
|
||
In [10]:
|
||
|
||
```py
|
||
%%writefile myfile.txt
|
||
-- BEGINNING OF THE FILE
|
||
% Day, Month, Year, Skip, Power
|
||
01, 01, 2000, x876, 13 % wow!
|
||
% we don't want have Jan 03rd
|
||
04, 01, 2000, xfed, 55
|
||
|
||
```
|
||
|
||
```py
|
||
Overwriting myfile.txt
|
||
|
||
```
|
||
|
||
In [11]:
|
||
|
||
```py
|
||
data = np.loadtxt('myfile.txt',
|
||
skiprows=1, #忽略第一行
|
||
dtype=np.int, #数组类型
|
||
delimiter=',', #逗号分割
|
||
usecols=(0,1,2,4), #指定使用哪几列数据
|
||
comments='%' #百分号为注释符
|
||
)
|
||
data
|
||
|
||
```
|
||
|
||
Out[11]:
|
||
|
||
```py
|
||
array([[ 1, 1, 2000, 13],
|
||
[ 4, 1, 2000, 55]])
|
||
```
|
||
|
||
### loadtxt 自定义转换方法
|
||
|
||
In [12]:
|
||
|
||
```py
|
||
%%writefile myfile.txt
|
||
2010-01-01 2.3 3.2
|
||
2011-01-01 6.1 3.1
|
||
|
||
```
|
||
|
||
```py
|
||
Overwriting myfile.txt
|
||
|
||
```
|
||
|
||
假设我们的文本包含日期,我们可以使用 `datetime` 在 `loadtxt` 中处理:
|
||
|
||
In [13]:
|
||
|
||
```py
|
||
import datetime
|
||
|
||
def date_converter(s):
|
||
return datetime.datetime.strptime(s, "%Y-%m-%d")
|
||
|
||
data = np.loadtxt('myfile.txt',
|
||
dtype=np.object, #数据类型为对象
|
||
converters={0:date_converter, #第一列使用自定义转换方法
|
||
1:float, #第二第三使用浮点数转换
|
||
2:float})
|
||
|
||
data
|
||
|
||
```
|
||
|
||
Out[13]:
|
||
|
||
```py
|
||
array([[datetime.datetime(2010, 1, 1, 0, 0), 2.3, 3.2],
|
||
[datetime.datetime(2011, 1, 1, 0, 0), 6.1, 3.1]], dtype=object)
|
||
```
|
||
|
||
移除 `myfile.txt`:
|
||
|
||
In [14]:
|
||
|
||
```py
|
||
import os
|
||
os.remove('myfile.txt')
|
||
|
||
```
|
||
|
||
### 读写各种格式的文件
|
||
|
||
如下表所示:
|
||
|
||
| 文件格式 | 使用的包 | 函数 |
|
||
| --- | --- | --- |
|
||
| txt | numpy | loadtxt, genfromtxt, fromfile, savetxt, tofile |
|
||
| csv | csv | reader, writer |
|
||
| Matlab | scipy.io | loadmat, savemat |
|
||
| hdf | pytables, h5py | |
|
||
| NetCDF | netCDF4, scipy.io.netcdf | netCDF4.Dataset, scipy.io.netcdf.netcdf_file |
|
||
| **文件格式** | **使用的包** | **备注** |
|
||
| wav | scipy.io.wavfile | 音频文件 |
|
||
| jpeg,png,... | PIL, scipy.misc.pilutil | 图像文件 |
|
||
| fits | pyfits | 天文图像 |
|
||
|
||
此外, `pandas` ——一个用来处理时间序列的包中包含处理各种文件的方法,具体可参见它的文档:
|
||
|
||
[http://pandas.pydata.org/pandas-docs/stable/io.html](http://pandas.pydata.org/pandas-docs/stable/io.html)
|
||
|
||
## 将数组写入文件
|
||
|
||
`savetxt` 可以将数组写入文件,默认使用科学计数法的形式保存:
|
||
|
||
In [15]:
|
||
|
||
```py
|
||
data = np.array([[1,2],
|
||
[3,4]])
|
||
|
||
np.savetxt('out.txt', data)
|
||
|
||
```
|
||
|
||
In [16]:
|
||
|
||
```py
|
||
with open('out.txt') as f:
|
||
for line in f:
|
||
print line,
|
||
|
||
```
|
||
|
||
```py
|
||
1.000000000000000000e+00 2.000000000000000000e+00
|
||
3.000000000000000000e+00 4.000000000000000000e+00
|
||
|
||
```
|
||
|
||
也可以使用类似**C**语言中 `printf` 的方式指定输出的格式:
|
||
|
||
In [17]:
|
||
|
||
```py
|
||
data = np.array([[1,2],
|
||
[3,4]])
|
||
|
||
np.savetxt('out.txt', data, fmt="%d") #保存为整数
|
||
|
||
```
|
||
|
||
In [18]:
|
||
|
||
```py
|
||
with open('out.txt') as f:
|
||
for line in f:
|
||
print line,
|
||
|
||
```
|
||
|
||
```py
|
||
1 2
|
||
3 4
|
||
|
||
```
|
||
|
||
逗号分隔的输出:
|
||
|
||
In [19]:
|
||
|
||
```py
|
||
data = np.array([[1,2],
|
||
[3,4]])
|
||
|
||
np.savetxt('out.txt', data, fmt="%.2f", delimiter=',') #保存为2位小数的浮点数,用逗号分隔
|
||
|
||
```
|
||
|
||
In [20]:
|
||
|
||
```py
|
||
with open('out.txt') as f:
|
||
for line in f:
|
||
print line,
|
||
|
||
```
|
||
|
||
```py
|
||
1.00,2.00
|
||
3.00,4.00
|
||
|
||
```
|
||
|
||
复数值默认会加上括号:
|
||
|
||
In [21]:
|
||
|
||
```py
|
||
data = np.array([[1+1j,2],
|
||
[3,4]])
|
||
|
||
np.savetxt('out.txt', data, fmt="%.2f", delimiter=',') #保存为2位小数的浮点数,用逗号分隔
|
||
|
||
```
|
||
|
||
In [22]:
|
||
|
||
```py
|
||
with open('out.txt') as f:
|
||
for line in f:
|
||
print line,
|
||
|
||
```
|
||
|
||
```py
|
||
(1.00+1.00j), (2.00+0.00j)
|
||
(3.00+0.00j), (4.00+0.00j)
|
||
|
||
```
|
||
|
||
更多参数:
|
||
|
||
```py
|
||
savetxt(fname,
|
||
X,
|
||
fmt='%.18e',
|
||
delimiter=' ',
|
||
newline='\n',
|
||
header='',
|
||
footer='',
|
||
comments='# ')
|
||
```
|
||
|
||
移除 `out.txt`:
|
||
|
||
In [23]:
|
||
|
||
```py
|
||
import os
|
||
os.remove('out.txt')
|
||
|
||
```
|
||
|
||
## Numpy 二进制格式
|
||
|
||
数组可以储存成二进制格式,单个的数组保存为 `.npy` 格式,多个数组保存为多个`.npy`文件组成的 `.npz` 格式,每个 `.npy` 文件包含一个数组。
|
||
|
||
与文本格式不同,二进制格式保存了数组的 `shape, dtype` 信息,以便完全重构出保存的数组。
|
||
|
||
保存的方法:
|
||
|
||
* `save(file, arr)` 保存单个数组,`.npy` 格式
|
||
* `savez(file, *args, **kwds)` 保存多个数组,无压缩的 `.npz` 格式
|
||
* `savez_compressed(file, *args, **kwds)` 保存多个数组,有压缩的 `.npz` 格式
|
||
|
||
读取的方法:
|
||
|
||
* `load(file, mmap_mode=None)` 对于 `.npy`,返回保存的数组,对于 `.npz`,返回一个名称-数组对组成的字典。
|
||
|
||
### 单个数组的读写
|
||
|
||
In [24]:
|
||
|
||
```py
|
||
a = np.array([[1.0,2.0], [3.0,4.0]])
|
||
|
||
fname = 'afile.npy'
|
||
np.save(fname, a)
|
||
|
||
```
|
||
|
||
In [25]:
|
||
|
||
```py
|
||
aa = np.load(fname)
|
||
aa
|
||
|
||
```
|
||
|
||
Out[25]:
|
||
|
||
```py
|
||
array([[ 1., 2.],
|
||
[ 3., 4.]])
|
||
```
|
||
|
||
删除生成的文件:
|
||
|
||
In [26]:
|
||
|
||
```py
|
||
import os
|
||
os.remove('afile.npy')
|
||
|
||
```
|
||
|
||
### 二进制与文本大小比较
|
||
|
||
In [27]:
|
||
|
||
```py
|
||
a = np.arange(10000.)
|
||
|
||
```
|
||
|
||
保存为文本:
|
||
|
||
In [28]:
|
||
|
||
```py
|
||
np.savetxt('a.txt', a)
|
||
|
||
```
|
||
|
||
查看大小:
|
||
|
||
In [29]:
|
||
|
||
```py
|
||
import os
|
||
os.stat('a.txt').st_size
|
||
|
||
```
|
||
|
||
Out[29]:
|
||
|
||
```py
|
||
260000L
|
||
```
|
||
|
||
保存为二进制:
|
||
|
||
In [30]:
|
||
|
||
```py
|
||
np.save('a.npy', a)
|
||
|
||
```
|
||
|
||
查看大小:
|
||
|
||
In [31]:
|
||
|
||
```py
|
||
os.stat('a.npy').st_size
|
||
|
||
```
|
||
|
||
Out[31]:
|
||
|
||
```py
|
||
80080L
|
||
```
|
||
|
||
删除生成的文件:
|
||
|
||
In [32]:
|
||
|
||
```py
|
||
os.remove('a.npy')
|
||
os.remove('a.txt')
|
||
|
||
```
|
||
|
||
可以看到,二进制文件大约是文本文件的三分之一。
|
||
|
||
### 保存多个数组
|
||
|
||
In [33]:
|
||
|
||
```py
|
||
a = np.array([[1.0,2.0],
|
||
[3.0,4.0]])
|
||
b = np.arange(1000)
|
||
|
||
```
|
||
|
||
保存多个数组:
|
||
|
||
In [34]:
|
||
|
||
```py
|
||
np.savez('data.npz', a=a, b=b)
|
||
|
||
```
|
||
|
||
查看里面包含的文件:
|
||
|
||
In [35]:
|
||
|
||
```py
|
||
!unzip -l data.npz
|
||
|
||
```
|
||
|
||
```py
|
||
Archive: data.npz
|
||
Length Date Time Name
|
||
--------- ---------- ----- ----
|
||
112 2015/08/10 00:46 a.npy
|
||
4080 2015/08/10 00:46 b.npy
|
||
--------- -------
|
||
4192 2 files
|
||
|
||
```
|
||
|
||
载入数据:
|
||
|
||
In [36]:
|
||
|
||
```py
|
||
data = np.load('data.npz')
|
||
|
||
```
|
||
|
||
载入后可以像字典一样进行操作:
|
||
|
||
In [37]:
|
||
|
||
```py
|
||
data.keys()
|
||
|
||
```
|
||
|
||
Out[37]:
|
||
|
||
```py
|
||
['a', 'b']
|
||
```
|
||
|
||
In [38]:
|
||
|
||
```py
|
||
data['a']
|
||
|
||
```
|
||
|
||
Out[38]:
|
||
|
||
```py
|
||
array([[ 1., 2.],
|
||
[ 3., 4.]])
|
||
```
|
||
|
||
In [39]:
|
||
|
||
```py
|
||
data['b'].shape
|
||
|
||
```
|
||
|
||
Out[39]:
|
||
|
||
```py
|
||
(1000L,)
|
||
```
|
||
|
||
删除文件:
|
||
|
||
In [40]:
|
||
|
||
```py
|
||
# 要先删除 data,否则删除时会报错
|
||
del data
|
||
|
||
os.remove('data.npz')
|
||
|
||
```
|
||
|
||
### 压缩文件
|
||
|
||
当数据比较整齐时:
|
||
|
||
In [41]:
|
||
|
||
```py
|
||
a = np.arange(20000.)
|
||
|
||
```
|
||
|
||
无压缩大小:
|
||
|
||
In [42]:
|
||
|
||
```py
|
||
np.savez('a.npz', a=a)
|
||
os.stat('a.npz').st_size
|
||
|
||
```
|
||
|
||
Out[42]:
|
||
|
||
```py
|
||
160188L
|
||
```
|
||
|
||
有压缩大小:
|
||
|
||
In [43]:
|
||
|
||
```py
|
||
np.savez_compressed('a2.npz', a=a)
|
||
os.stat('a2.npz').st_size
|
||
|
||
```
|
||
|
||
Out[43]:
|
||
|
||
```py
|
||
26885L
|
||
```
|
||
|
||
大约有 6x 的压缩效果。
|
||
|
||
当数据比较混乱时:
|
||
|
||
In [44]:
|
||
|
||
```py
|
||
a = np.random.rand(20000.)
|
||
|
||
```
|
||
|
||
无压缩大小:
|
||
|
||
In [45]:
|
||
|
||
```py
|
||
np.savez('a.npz', a=a)
|
||
os.stat('a.npz').st_size
|
||
|
||
```
|
||
|
||
Out[45]:
|
||
|
||
```py
|
||
160188L
|
||
```
|
||
|
||
有压缩大小:
|
||
|
||
In [46]:
|
||
|
||
```py
|
||
np.savez_compressed('a2.npz', a=a)
|
||
os.stat('a2.npz').st_size
|
||
|
||
```
|
||
|
||
Out[46]:
|
||
|
||
```py
|
||
151105L
|
||
```
|
||
|
||
只有大约 1.06x 的压缩效果。
|
||
|
||
In [47]:
|
||
|
||
```py
|
||
os.remove('a.npz')
|
||
os.remove('a2.npz')
|
||
|
||
``` |