# 正则表达式和 re 模块

## 正则表达式

[正则表达式](http://baike.baidu.com/view/94238.htm)是用来匹配字符串或者子串的一种模式，匹配的字符串可以很具体，也可以很一般化。

`Python` 标准库提供了 `re` 模块。

In [1]:

```py
import re

```

## re.match & re.search

在 `re` 模块中， `re.match` 和 `re.search` 是常用的两个方法：

```py
re.match(pattern, string[, flags])
re.search(pattern, string[, flags]) 
```

两者都寻找第一个匹配成功的部分，成功则返回一个 `match` 对象，不成功则返回 `None`，不同之处在于 `re.match` 只匹配字符串的开头部分，而 `re.search` 匹配的则是整个字符串中的子串。

## re.findall & re.finditer

`re.findall(pattern, string)` 返回所有匹配的对象， `re.finditer` 则返回一个迭代器。

## re.split

`re.split(pattern, string[, maxsplit])` 按照 `pattern` 指定的内容对字符串进行分割。

## re.sub

`re.sub(pattern, repl, string[, count])` 将 `pattern` 匹配的内容进行替换。

## re.compile

`re.compile(pattern)` 生成一个 `pattern` 对象，这个对象有匹配，替换，分割字符串的方法。

## 正则表达式规则

正则表达式由一些普通字符和一些元字符（metacharacters）组成。普通字符包括大小写的字母和数字，而元字符则具有特殊的含义：

| 子表达式 | 匹配内容 |
| --- | --- |
| `.` | 匹配除了换行符之外的内容 |
| `\w` | 匹配所有字母和数字字符 |
| `\d` | 匹配所有数字，相当于 `[0-9]` |
| `\s` | 匹配空白，相当于 `[\t\n\t\f\v]` |
| `\W,\D,\S` | 匹配对应小写字母形式的补 |
| `[...]` | 表示可以匹配的集合，支持范围表示如 `a-z`, `0-9` 等 |
| `(...)` | 表示作为一个整体进行匹配 |
| ¦ | 表示逻辑或 |
| `^` | 表示匹配后面的子表达式的补 |
| `*` | 表示匹配前面的子表达式 0 次或更多次 |
| `+` | 表示匹配前面的子表达式 1 次或更多次 |
| `?` | 表示匹配前面的子表达式 0 次或 1 次 |
| `{m}` | 表示匹配前面的子表达式 m 次 |
| `{m,}` | 表示匹配前面的子表达式至少 m 次 |
| `{m,n}` | 表示匹配前面的子表达式至少 m 次，至多 n 次 |

例如：

*   `ca*t 匹配： ct, cat, caaaat, ...`
*   `ab\d|ac\d 匹配： ab1, ac9, ...`
*   `([^a-q]bd) 匹配： rbd, 5bd, ...`

## 例子

假设我们要匹配这样的字符串：

In [2]:

```py
string = 'hello world'
pattern = 'hello (\w+)'

match = re.match(pattern, string)
print match

```

```py
<_sre.SRE_Match object at 0x0000000003A5DA80>

```

一旦找到了符合条件的部分，我们便可以使用 `group` 方法查看匹配的部分：

In [3]:

```py
if match is not None:
    print match.group(0)

```

```py
hello world

```

In [4]:

```py
if match is not None:
    print match.group(1)

```

```py
world

```

我们可以改变 string 的内容：

In [5]:

```py
string = 'hello there'
pattern = 'hello (\w+)'

match = re.match(pattern, string)
if match is not None:
    print match.group(0)
    print match.group(1)

```

```py
hello there
there

```

通常，`match.group(0)` 匹配整个返回的内容，之后的 `1,2,3,...` 返回规则中每个括号（按照括号的位置排序）匹配的部分。

如果某个 `pattern` 需要反复使用，那么我们可以将它预先编译：

In [6]:

```py
pattern1 = re.compile('hello (\w+)')

match = pattern1.match(string)
if match is not None:
    print match.group(1)

```

```py
there

```

由于元字符的存在，所以对于一些特殊字符，我们需要使用 `'\'` 进行逃逸字符的处理，使用表达式 `'\\'` 来匹配 `'\'` 。

但事实上，`Python` 本身对逃逸字符也是这样处理的：

In [7]:

```py
pattern = '\\'
print pattern

```

```py
\

```

因为逃逸字符的问题，我们需要使用四个 `'\\\\'` 来匹配一个单独的 `'\'`：

In [8]:

```py
pattern = '\\\\'
path = "C:\\foo\\bar\\baz.txt"
print re.split(pattern, path)

```

```py
['C:', 'foo', 'bar', 'baz.txt']

```

这样看起来十分麻烦，好在 `Python` 提供了 `raw string` 来忽略对逃逸字符串的处理，从而可以这样进行匹配：

In [9]:

```py
pattern = r'\\'
path = r"C:\foo\bar\baz.txt"
print re.split(pattern, path)

```

```py
['C:', 'foo', 'bar', 'baz.txt']

```

如果规则太多复杂，正则表达式不一定是个好选择。

## Numpy 的 fromregex()

In [10]:

```py
%%file test.dat 
1312 foo
1534    bar
444  qux

```

```py
Writing test.dat

```

```py
fromregex(file, pattern, dtype)
```

`dtype` 中的内容与 `pattern` 的括号一一对应：

In [11]:

```py
pattern = "(\d+)\s+(...)"
dt = [('num', 'int64'), ('key', 'S3')]

from numpy import fromregex
output = fromregex('test.dat', pattern, dt)
print output

```

```py
[(1312L, 'foo') (1534L, 'bar') (444L, 'qux')]

```

显示 `num` 项：

In [12]:

```py
print output['num']

```

```py
[1312 1534  444]

```

In [13]:

```py
import os
os.remove('test.dat')

```