Content-Type: text/x-zim-wiki Wiki-Format: zim 0.4 Creation-Date: 2012-11-29T14:31:19+08:00 ====== re ====== Created Thursday 29 November 2012 http://docs.python.org/2/library/re.html 7.2. re — Regular expression operations This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be **Unicode strings(这意味着re可以匹配中文字符串)** as well as **8-bit strings**. Regular expressions use the backslash character **('\')** to indicate special forms or to allow special characters to be used without invoking their special meaning. This __collides__ with Python’s usage of the same character for the same purpose in __string literals__; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. python会对字符串字面量(string literal)进行处理,使其中的转义字符具有相应的特殊含义,然后__将处理的结果__交给相应的后续函数。 In [11]: print **"ab\bc"** ac //可见python会将\b解释为退格符,处理的**结果'ac'**会传给print语句。 In [12]: print **r"ab\bc" ** ab\bc //pyton不会对raw string作任何处理 In [13]: print r"ab\\c" ab\\c In [14]: print "ab\\c" ab\c In [15]: print "ab\\\\c" ab\\c In [16]: The solution is to use __Python’s raw string__ notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is **a two-character string** containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation. It is important to note that most regular expression operations are available as __module-level functions__ and __RegexObject methods__. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters. 大多数的正则表达式操作既存在与module-level functions,也存在于RegexObject对象的方法中,后者是调用re.compile()函数对pattern编译后的结果。 ===== 7.2.1. Regular Expression Syntax ===== A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing). Regular expressions can be __concatenated__ to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. 最简单的re是单个字符,通过将多个字符结合(concatenated)起来,可以形成新的更复杂的正则表达式。 This holds unless A or B contain low precedence(优先权) operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced above, or almost any textbook about compiler construction. A brief explanation of the format of regular expressions follows. For further information and a gentler presentation, consult the Regular Expression HOWTO. Regular expressions can contain both __special and ordinary__ characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. (In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.) Some characters, like __'|' or '('__, are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Regular expression pattern strings may not contain null bytes, but can specify the null byte using the **\number** notation, e.g., '\x00'. The special characters are: '.' (Dot.) In the default mode, this matches any character __except a newline__. If the __DOTALL flag__ has been specified, this matches any character including a newline. '^' (Caret.) Matches the start of the __string__, and in __MULTILINE mode__ also matches immediately after each newline. '$' Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string. Python中的^和$默认只匹配字符串的开始和结束,如果启用了MULTILINE标志,则含义为匹配行首和行尾。 下面的元字符表示前面的RE重复的次数,注意RE和重复元字符和起来形成一个pattern。 '*' Causes the resulting RE to match __0 or more__ repetitions of the **preceding RE**, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s. 这里的preceding RE可以是一个很简单的正则表达式,如单个字符。注意a*可以与__空字符__相匹配,如果想至少出现一个a,应该使用aa* '+' Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’. '?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’. ***?, +?, ?? //非贪婪版本** 例如:msg = '

dsfsdf

dksjfskldf

' __r'

.*

'将匹配整个字符串,而r'

.*?

将只匹配到第一个.__ The '*', '+', and '?' qualifiers are __all greedy__**; they match as much text as possible.** __注意:贪婪也是有尺度的,也就是说在尽可能地长的前提是后面的模式能够有机会匹配。python试着在后面的模式能匹配的情况下尽可能地长。非贪婪也类似,尽可能短的前提是后面的模式也能够匹配。__ Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '

title

', it will match the entire string, and not just '

'. Adding '?' after the qualifier makes it perform the match in **non-greedy or minimal fashion**; as few characters as possible will be matched. Using .*? in the previous expression will match only '

'. {m} Specifies that __exactly m copies__ __of__ the previous RE should be matched; **fewer** matches cause the entire RE not to match(但是待匹配的字符串中包含的重复字符次数比m多,则可以匹配成功。). For example, a{6} will match **exactly** six 'a' characters, but not five. {m,n} Causes the resulting RE to match **from m to n** repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of **zero**, and omitting n specifies an **infinite upper** bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form. __m,n之一可以省,省了m默认从0开始,省了n表示无穷大。该重复模式总是尽可能地匹配多,但是最多为n。__ {m,n}? Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters. __注意:__上面的重复模式 * ? + {m}等都是__对其前面最简单的RE进行重复__,例如ab*只是对字符b而不是ab进行重复。如果前面是group,则对整个group进行重复。 '\' Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals **a special sequence**; special sequences are discussed below. 对特殊字符进行转义,使其具有字面含义;或者表示一个__转义序列__。 If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions. 建议对所有的复杂正则表达式使用raw string。 [] Used to indicate __a set of characters__. In a set: Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'. Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if its placed as the first or last character (e.g. [a-]), it will match a literal '-'. Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'. **特殊字符位于[]中时,具有其字面含义,但是]除外。** __Character classes__ such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether LOCALE or UNICODE mode is in force. Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set. To match a literal ']' inside a set, precede it with a backslash, or place it __at the beginning__ of the set. For example, both [()[\]{}] and []()[{}] will both match a parenthesis. '|' A|B, where A and B can be arbitrary REs(**A和B长度没有限制,直到RE边界或遇到下一个|,但是groups除外。**), creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|]. (...) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the __\number__ special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)]. 模式组(pattern group),__整体作为一个正则表达式。后面跟重复模式时,表示模式组里的模式(不包含括号)重复多次。例如:__ (a[0-9]c){2}(d[0-9]f)表示__模式组的模式__a[0-9]c重复三次,即等效为(a[0-9]c)a[0-9]c(d[0-9]f),所以实际上__只有2个而不是3个__模式组。 In [49]: re.search('a(b[0-9]c){2}(d[0-9]e)', 'ab1cb2cd1e').group() //相当于group(0),只返回__模式匹配的字符串__而不一定是完整的字符串。 Out[49]: 'ab1cb2cd1e' In [51]: re.search('a(b[0-9]c){2}(d[0-9]e)', 'ab1cb2cd1e**eee**').group(0,1,2) Out[51]: ('ab1cb2cd1e', 'b2c', 'd1e') //待匹配字符串的后三个字符**并没有**被匹配。 In [48]: re.search('a(b[0-9]c){2}(d[0-9]e)', 'ab1cb2cd1e').groups() Out[48]: ('b2c', 'd1e') //返回两个元素的tuple,说明模式中只有两个pattern group。其中__重复的模式组其最终值为最后一次匹配的内容(所以上面结果为'b2c'而不是'b1c')。__ In [50]: re.search('a(b[0-9]c){2}(d[0-9]e)', 'ab1cb2cd1e').group(0,1,2) Out[50]: ('ab1cb2cd1e', 'b2c', 'd1e') //返回模式匹配的字符串,pattern group1 和pattern group2匹配的字符串。注意,pattern group__从1开始编号__。 In [46]: re.search('**(a(b[0-9]c){2}(d[0-9]e))**', 'ab1cb2cd1e').group() //注意模式组最外面的括号。 Out[46]: 'ab1cb2cd1e' In [45]: re.search('(a(b[0-9]c){2}(d[0-9]e))', 'ab1cb2cd1e').groups() //返回所有pattern group匹配的内容。 Out[45]: ('ab1cb2cd1e', 'b2c', 'd1e') //可见模式中有三个pattern group,__模式编号从第一个左括号开始为1__。 In [47]: re.search('(a(b[0-9]c){2}(d[0-9]e))', 'ab1cb2cd1e').group(0,1,2,3) Out[47]: ('ab1cb2cd1e', 'ab1cb2cd1e', 'b2c', 'd1e') In [56]: re.search('(a(b[0-9]c**(b[0-9]c){2}**)(d[0-9]e))', 'ab1cb2cb3cd1e').groups() //注意:第二个pattern group**嵌套了**另一个group。 Out[56]: ('ab1cb2cb3cd1e', 'b1cb2cb3c', 'b3c', 'd1e') In [58]: re.search('(a(b[0-9]c(b[0-9]c)__\3__)(d[0-9]e))', 'ab1cb2cb2cd1e').groups() //错误,对第三个pattern group匹配内容的__引用不能位于第三个pattern group中__。 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) In [59]: re.search('(a(b[0-9]c(b[0-9]c))\3(d[0-9]e))', 'ab1cb2cb2cd1e').groups() //错误,pattern使用的string literal而且其中含有转义字符\3,因此__应用\\3或raw string__. --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) In [61]: re.search(r"(a(b[0-9]c(b[0-9]c)**\3**)(d[0-9]e))", 'ab1cb2cb2cd1e').groups() //正确,\3引用的是第三个group__匹配的内容__。 Out[61]: ('ab1cb2cb2cd1e', 'b1cb2cb2c', 'b2c', 'd1e') (?...) This is an extension notation (a '?' following a '(' is not meaningful otherwise). __单独的左括号后跟问号是没有特殊含义的,除非后面还有右括号。__ The first character after the '?' determines what the meaning and further syntax of the construct is. Extensions usually do not create a new group; (?P...) is the only exception to this rule. Following are the currently supported extensions. __(?...)是(...)的扩展形式,其实际含义取决于?后的第一个字符__。具体情况如下所示: (?iLmsux) (**One or more** letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: **re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose)**, for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. 这种形式的好处是在正则表达式的pattern中指定flags,这样就不用在相关函数中设置。 Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined. (?:...) A non-capturing version of regular parentheses(这里的non-capturing指的是不能不占用pattern group 编号的模式组). Matches whatever regular expression is inside the parentheses, but the substring matched by the group __cannot be retrieved__ after performing a match or referenced later in the pattern. 括号中的pattern还是用于匹配,但是__匹配后的内容不能在以后引用__(例如通过\number的形式,或MatchObject中的group id方式)。 这种匹配形式是为以后作准备的。 (?P...) __为模式组定义另外一个名称__,以后可以通过该名称应用该模式组匹配的内容。同时模式组编号(从1开始)仍可以使用。 Similar to regular parentheses, but the substring matched by the group is accessible within the rest of the regular expression __via the symbolic group name name__. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is __also a numbered group__, just as if the group were not named. So the **group named id** in the example below can also be referenced as the numbered group 1. For example, if the pattern is (?P[a-zA-Z_]\w*), the group can be referenced by its name in arguments to methods of __match objects__, such as **m.group('id') or m.end('id')**, and also by name in the regular expression itself (using **(?P=id)**) and replacement text given to **.sub() (using \g)**. (?P=name) //匹配前面的模式组名称为name的内容,__与\n的功能类似__,本身不占用模式组编号。 Matches whatever text was matched by the earlier group named name. (?#...) A comment; the contents of the parentheses are simply ignored. (?=...) __只有当字符串匹配括号中的模式时,才考虑括号前面的模式是否匹配。也就是说,先在带匹配字符串中查找符合括号中的模式字符串,然后才看其前面的内容是否和括号前的模式匹配。如果匹配则返回括号前面的内容。__ Matches if ... matches next**(if ...matches next是前提,如果成立则matches,即match ...前面的模式)**, but doesn’t consume any of the string(__...匹配的内容并不在结果之中__). This is called **a lookahead assertion**. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'. 实际上Isaac(?=Asimov)先在Isaac Asimov中查找符合Asimov模式的子串,然后看其前面的字符是否是c,如果是再看其前面的字符是否是a...,最后返回的是Isaac。例如: >>>msg = "zhang jun" >>>import re >>>m = re.search(r"zhang (?=jun)", msg) __//先匹配jun,然后look ahead是否匹配,若是则返回前面匹配的内容。__ >>>print m.group() //返回匹配到的字符串,可见**(?=jun)模式匹配的内容并没有返回**,即上面所说的doesn't consume any of the string. >>>zhang >>>m = re.search(r"(?<=zhang )jun", msg) __//先匹配zhang,然后look behind 是否匹配,若是则返回后面匹配的内容。__ >>>print m.group() >>>jun (?!...) //(?=...)的相反情形 Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. (?<=...) Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a **positive lookbehind assertion**. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some __fixed length__, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function: >>> >>> import re >>> m = re.search('(?<=abc)def', 'abcdef') >>> m.group(0) 'def' This example looks for a word following a hyphen: >>> >>> m = re.search('(?<=-)\w+', 'spam-egg') >>> m.group(0) 'egg' (?) is a poor email matching pattern, which will match with '' as well as 'user@host.com', but not with ') 注意:模式中共有三个模式组,(?:\.\w+)并不占用组编号。 New in version 2.4. The special sequences consist of '\' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, __\$ matches the character '$'__. __也就是说对于\c的形式,如果c不是下面列出的类型,则\c和c等价。__ \number Matches __the contents__ of the group of the same number. Groups are numbered __starting from 1__. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is __3 octal digits long__, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters. 作为pattern group匹配内容的引用含义时,number的值为__1-99__.。如果number第一个数字为0或为3位八进制数,则python认为number是一个__代表字符__的八进制字符序列编码。在[...]中的所有\number形式的意义都代表字符。 \A #与/Z向对应。 Matches only at the start of the string. \b #这里的b理解为__boundary__ Matches the empty string, but only at the __beginning or end of a word__. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, __r'\bfoo\b'__ matches __'foo'__, 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'. Inside a character range(即[...]形式), \b represents the backspace character, for compatibility with Python’s string literals. __python中的单词word是由字母、数字、下划线组成的。注意,\b代表word的边界字符(该字符可以为空),所以\b只能用于word的两边如r"\bfoo\b",注意由于\b代表的字符可以为空,所以r'\bfoo\b'与'foo'匹配。__ \B **#not boundary** Matches the empty string, but only when it is __not at the beginning or end of a word.__ This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, so is also subject to the settings of LOCALE and UNICODE. __\B匹配不在word中间的empty string,这里的empty string 指的是除了字母、数字、下划线以外的字符串。也就是说\B匹配位于word开始或结尾的empty string。__ \d When the UNICODE flag is not specified, matches any decimal digit; this is equivalent to __the set [0-9]__. With UNICODE, it will match whatever is classified as a decimal digit in the Unicode character properties database. \D When the UNICODE flag is not specified, matches any non-digit character; this is __equivalent to the set [^0-9]__. With UNICODE, it will match anything other than character marked as digits in the Unicode character properties database. \s When the UNICODE flag is not specified, it matches __any whitespace character__, this is equivalent to __the set [ \t\n\r\f\v]__. The LOCALE flag has no extra effect on matching of the space. If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database. \S When the UNICODE flags is not specified, matches __any non-whitespace character__; this is equivalent to the __set [^ \t\n\r\f\v]__ The LOCALE flag has no extra effect on non-whitespace match. If UNICODE is set, then any character not marked as space in the Unicode character properties database is matched. \w When the LOCALE and UNICODE flags are not specified, matches __any alphanumeric character and the underscore__; this is equivalent to __the set [a-zA-Z0-9_]__. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. \W When the LOCALE and UNICODE flags are not specified, matches any __non-alphanumeric__ character; this is equivalent to the set __[^a-zA-Z0-9_]__. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classied as not alphanumeric in the Unicode character properties database. \Z Matches only **at the end** of the string. If both LOCALE and UNICODE flags are included for a particular sequence, then LOCALE flag takes effect first followed by the UNICODE. Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser: \a \b \f \n \r \t \v \x \\ (Note that \b is used to represent word boundaries, and means “backspace” __only inside__ character classes.) Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length. ===== 7.2.2. Module Contents ===== The module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications __always use__ the compiled form. __module级函数同时也可以用于编译后生成的正则表达式对象,而且后者更有效率。__ **re.compile(pattern, flags=0) //返回一个RE Object,注意pattern最好使用raw string形式。** Compile a regular expression pattern into a **regular expression object,** which can be used for matching using its match() and search() methods, described below. The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using __bitwise OR__ (the | operator). The sequence prog = re.compile(pattern) //prog是一个正则表达式对象。 result = prog.match(string) is equivalent to result = re.match(pattern, string) but using re.compile() and saving the resulting regular expression object for reuse is **more efficient** when the expression will be used several times in a single program. Note The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions. **re.DEBUG** Display debug information about compiled expression. **re.I** **re.IGNORECASE** Perform __case-insensitive matching__; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale. re.L re.LOCALE Make \w, \W, \b, \B, \s and \S dependent on the current locale. re.M re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string __and__ at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. ^和$默认只匹配字符串的开始和结束。在编译pattern如果使用了这个flag,则它们__还可以__匹配行首和行尾。 re.S re.DOTALL Make the '.' special character match any character at all, __including a newline__; without this flag, '.' will match anything except a newline. re.U re.UNICODE Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0. re.X **re.VERBOSE** This flag allows you to write regular expressions that __look nicer__. __Whitespace within the pattern is ignored__, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored. That means that the two following regular expression objects that match a decimal number are functionally equal: a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", **re.X**) **//pattern中的空格都被忽视,除非前面带有未转义的反斜杠。** b = re.compile(r"\d+\.\d*") ==== re.search(pattern, string, flags=0) //如果有匹配,返回一个MatchObject实例;否则返回None ==== __注意:search会在string中搜索匹配pattern的字符串,该字符串可以位于任何位置,而且只返回第一个匹配的__。 Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding __MatchObject__ instance. Return **None** if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string. ==== re.match(pattern, string, flags=0) ==== If zero or more characters __at the beginning of string __match the regular expression pattern, return a corresponding **MatchObject** instance. Return **None** if the string does not match the pattern; note that this is different from a zero-length match. Note that even in MULTILINE mode, re.match() will **only match** at the beginning of the string and not at the beginning of each line.(string中可能包含多行,但是还是从string的开始匹配) 注意:re.match()__只会匹配字符串的开头,而不是行首。__ If you want to locate a **match anywhere** in string, use search() instead (see also search() vs. match()). ===== 实例: ===== In [4]: msg = "This dog is a yellow dog, it is 5 years old." In [5]: import re In [6]: reobj = re.compile(__r"^(This).*(?Pdog).*(yellow).*(?P=dog).*(old)"__) In [7]: mobj = reobj.search(msg) In [8]: mobj.group() //re.group()如果无参数,则返回pattern匹配的**整个字符串** Out[8]: 'This dog is a yellow dog, it is 5 years old' In [9]: mobj.group(0) //re.group()与re.group(0)__等价__ Out[9]: 'This dog is a yellow dog, it is 5 years old' In [10]: mobj.group(1) //re.group()如果只有一个参数,则返回一个字符串 Out[10]: 'This' In [11]: mobj.group(2) Out[11]: 'dog' In [13]: mobj.group(1,2,3,4) //re.group()如果有多个参数,则__返回一个tuple__,其中每个值与指定的参数对应。 Out[13]: ('This', 'dog', 'yellow', 'old') In [14]: mobj.groups() __//re.groups()则返回一个包含有所有匹配模式组的字符串tuple。其实现其实是group(1-99).__ Out[14]: ('This', 'dog', 'yellow', 'old') In [17]: mobj.group(**"dog"**) //re.group()的参数可以为模式组的编号,也可以为模式组的名称。名称为__字符串__ Out[17]: 'dog' In [12]: mobj.group(1,2,3,4,5) __//注意(?P=name)并不算一个命名的模式组,也不占有一个模式组编号。__ --------------------------------------------------------------------------- IndexError Traceback (most recent call last) in () ----> 1 mobj.group(1,2,3,4,5) IndexError: no such group In [15]: mobj.groupdict() //返回模式串中匹配的命名的subgroup。key为subgroup名称,值为匹配的内容。 Out[15]: {'dog': 'dog'} In [16]: mobj.start("dog") //返回某一subgrop匹配的字符串手地址。 Out[16]: 5 In [19]: mobj.span("dog") //返回某一subgroup匹配的字符串内容的跨距。 Out[19]: (5, 8) In [20]: mobj.lastindex //返回最后一个subgroup的数字编号。注意(?P=name)和(?:...)模式不占用编号。 Out[20]: 4 In [21]: mobj.lastgroup In [22]: mobj.string Out[22]: 'This dog is a yellow dog, it is 5 years old.' In [23]: mobj.re Out[23]: re.compile(r'^(This).*(?Pdog).*(yellow).*(?P=dog).*(old)', re.UNICODE) In [24]: msg = 'This dog is a yellow dog, it is 5 years old.' In [25]: reobj = re.compile(r'^(This).*(?Pdog).*__(?:yellow)__.*(?P=dog).*(old)', re.UNICODE) In [26]: mobj = reobj.search(msg) In [27]: mobj.groups() __//可见(?:yellow) subgroup并不占用编号,也不能在以后引用。__ Out[27]: ('This', 'dog', 'old') In [28]: mobj.group() **//但是(?:yello)匹配的内容还是会被打印出来。** Out[28]: 'This dog is a yellow dog, it is 5 years old' In [30]: msg = "123456" In [31]: reobj = re.compile(__r"(..)+"__) //模式串__表面上只有一个__subgroup,因此(..)+匹配的内容__只能通过一个__subgroup number引用。 In [32]: mobj = reobj.search(msg) In [33]: mobj.group() Out[33]: '123456' In [34]: mobj.groups() //可见只有一个subgroup。 Out[34]: **('56',)** In [35]: mobj.group(1) Out[35]: '56' In [36]: mobj.group(2) --------------------------------------------------------------------------- IndexError Traceback (most recent call last) in () ----> 1 mobj.group(2) __IndexError__: no such group In [37]: ==== re.split(pattern, string, maxsplit=0, flags=0) //pattern定义了分割字符(串)的模式,结果为分割后的字符串列表。如果pattern中含有pattern group,在pattern匹配的前提下,匹配group的字符串也将出现在结果列表中。 ==== Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.) If there are **capturing groups** in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string: >>> >>> re.split('(\W+)', '...words, words...') ['''', '...', 'words', ', ', 'words', '...', ''''] That way, separator components are always found **at the same relative indices** within the result list (e.g., if there’s one capturing group in the separator, the 0th, the 2nd and so forth). * __也就是说,如果pattern中没有括号,则会用pattern匹配的字符分割string,返回分割后的字符串列表。__ * __如果pattern中包含括号,则结果列表中还包含pattern匹配的分割字符串。__ Note that split will never split a string on **an empty pattern match**. For example: >>> re.split('x*', 'foo') ['foo'] >>> re.split("(?m)^$", "foo\n\nbar\n") ['foo\n\nbar\n'] ==== 实例 ==== >>> re.split('\W+', 'Words, words, words.') //pattern中不包含括号,结果列表为分割后的字符串。 ['Words', 'words', 'words', __''__] //注意最后面的空字符串,注意__分割后子字符串的个数为偶数。__ >>> re.split('(\W+)', 'Words, words, words.') //pattern中包含括号,所以结果列表中还包含pattern匹配的分割字符串。 ['Words', __', '__, 'words', __', '__, 'words', __'.'__, **''**] //黄色的即为pattern中匹配的分割字符串。 >>> re.split('\W+', 'Words, words, words.', 1) //maxsplit指定分割的次数,剩余的字符串作为结果返回。 ['Words', **'words, words.'**] //最后一个元素为未分割的字符串 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) ['0', '3', '9'] In [37]: ms = 'sdfsdkfgeekarddjfksdjfkgeekarddskfjksdfgeekardxdjfksdfj' In [38]: res = re.split(r'**(geekard)**', ms) **//注意geekard是一个subgroup会出现在结果列表中,但是**__整个模式__**才是一个分割类型。** In [39]: res Out[39]: ['sdfsdkf', __'geekard'__, 'djfksdjfk', 'geekard', 'dskfjksdfgeekardxdjfksdfj'] In [75]: re.split(r"(a)([0-9])(c)", 'ab1cb2cb2cd1e') //分割字符串模式匹配失败,**待匹配字符串没有被分割。** Out[75]: ['ab1cb2cb2cd1e'] In [77]: re.split(r"b[0-9]c", 'ab1cb2cb2cd1e') Out[77]: ['a', '', '', 'd1e'] //注意,分割字符串连续后的结果列表中有两个连续的空字符。 In [76]: re.split(r"(b)([0-9])(c)", 'ab1cb2cb2cd1e') Out[76]: ['a', **'b', '1', 'c'**, __''__, **'b', '2', 'c'**, __''__, 'b', '2', 'c', 'd1e'] //结果列表中空字符的由来。 In [78]: Changed in version 2.7: Added the optional flags argument. 没有re.find()函数 ===== re.findall(pattern, string, flags=0) //结果是一个列表 ===== **string中的某部分字符串**__先要完整地匹配pattern__**,但是根据pattern中包含的subgroup多少,返回结果列表的内容有所不同:** 1. **如果pattern中无subgroup(即没有括号),则返回的是一个完整匹配pattern的字符串列表。** 2. **如果pattern中有一个subgroup,则返回的是每次匹配subgroup的字符串列表(subgroup外的pattern匹配的内容不返回)。** 3. **如果pattern中有多个subgroup,则返回的是一个tuple list,每个tuple中的元素为每次subgroup匹配到的字符串。** Return **all non-overlapping matches** of pattern in string, as __a list of strings__. The string is scanned left-to-right, and matches are returned in the order found. If one or more **groups** are present in the pattern, return **a** list of groups; this will be **a list of tuples** if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match. New in version 1.5.2. Changed in version 2.4: Added the optional flags argument. ===== 实例: ===== In [46]: ms = 'sdfsdkfgeekarddjfksdjfkgeekarddskfjksdfgeekardxdjfksdfj' In [48]: res = re.findall(r'geekard', ms) //pattern中没有subgroup,返回的是**一个每次完整匹配的字符串列表**。 In [49]: res Out[49]: ['geekard', 'geekard'] //完整匹配两次 In [50]: res = re.findall(r'**(geekard)**', ms) __//先要完整匹配__,然后每次返回的时subgroup匹配到的内容。 In [51]: res Out[51]: ['geekard', 'geekard'] In [52]: res = re.findall(r'**()(geekard)**', ms) //先要完整匹配,由于pattern中有两个subgroup,所以__每次返回一个tuple__,其中的元素为每个subgroup匹配到的字符串 In [53]: res Out[53]: [('', 'geekard'), ('', 'geekard')] In [54]: ==== re.finditer(pattern, string, flags=0) //返回一可迭代对象,每次迭代返回一个MatchObject实例(只有当pattern中含有pattern group时,MatchObject的groups()才返回内容)。 ==== Return an iterator yielding __MatchObject__ instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match. New in version 2.2. Changed in version 2.4: Added the optional flags argument. ===== 实例: ===== >>> import re >>> ms = 'sdfsdkfgeekarddjfksdjfkgeekarddskfjksdfgeekardxdjfksdfj' >>> res = re.finditer(r'**geekard**', ms) //pattern中**没有subgroup** >>> for mobj in res: ... mobj.groups() //mobj.groups()相当于mobj.group(1-99)。__由于pattern中没有subgroup,所以所有对组编号的引用返回为空__。 ... () () >>> res = re.finditer(r'geekard', ms) //与findall类似,但是返回的是一个迭代器对象,每次迭代时返回一个MatchObject对象。 >>> for mobj in res: //mobj为MatchObject对象。 ... mobj.group() //mobj.group()返回**完整的匹配字符串** ... 'geekard' 'geekard' >>> >>> res = re.finditer(r'**(geekard)()**', ms) //pattern中__有两个subgroups__,所以mobj.groups()和mobj.group(1-2)可以使用。 >>> for mobj in res: ... mobj.groups() ... ('geekard', '') //**返回的是一个tuple** ('geekard', '') >>> for mobj in res: ... mobj.group() ... >>> res = re.finditer(r'(geekard)()', ms) >>> for mobj in res: ... mobj.group() ... 'geekard' 'geekard' >>> ===== re.sub(pattern, repl, string, count=0, flags=0) ===== pattern可以是一个字符串或RE对象,如果count为0 ,则默认替换string中**所有**匹配pattern的字符串。 Return the string obtained by replacing the **leftmost non-overlapping** occurrences of pattern in string by the replacement repl. __用repl的内容替换string中匹配pattern的内容,返回的结果是替换后的string字符串。如果pattern中包含有subgroup,则可以在repl中通过\number的形式引用匹配到的内容,这样就可以有选择性的保留原string中的内容。__ If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. **Backreferences**, such as \6, are replaced with **the substring matched by group 6** in the pattern. For example: >>> >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', //pattern匹配整个string,所以整个string被rel替换。 ... r'static PyObject*\npy_\1(void)\n{', ... 'def myfunc():') 'static PyObject*\npy_myfunc(void)\n{' ===== 实例: ===== >>> ms = 'sdfsdkfgeekarddjfksdjfkgeekarddskfjksdfgeekardxdjfksdfj' >>> res = re.sub(r'geekard', '11111', ms) **//将ms字符串中所有geekard替换为11111** >>> res 'sdfsdkf11111djfksdjfk11111dskfjksdf11111xdjfksdfj' >>> res = re.sub(r'(geekard)', '11111', ms) **//同上** >>> res 'sdfsdkf11111djfksdjfk11111dskfjksdf11111xdjfksdfj' >>> res = re.sub(r'()(geekard)', '11111', ms) **//将ms字符串中所有geekard替换为11111** >>> res 'sdfsdkf11111djfksdjfk11111dskfjksdfgeekardxdjfksdfj' >>> res = re.sub(r'()(geekard)', '\111111', ms) **//python会将rel看作一个字符串,先对其中的转义字符进行解释。所以\111111被解释为\111和111,前者为大写字母I的八进制值。** >>> res 'sdfsdkf__I111__djfksdjfkI111dskfjksdfgeekardxdjfksdfj' >>> res = re.sub(r'()(geekard)', '\1abc', ms) //**由于python先对rel字符串中的转义字符处理,所以这里的\1被解释为ASCII中的值为1的符号。** >>> res 'sdfsdkf__\x01__abcdjfksdjfk\x01abcdskfjksdfgeekardxdjfksdfj' >>> res = re.sub(r'()(geekard)', __r'\1abc'__, ms) //通过在rel字符串前加r字符,使其其中的转义字符不被python预先解释,而是传给正则表达式处理。 >>> res 'sdfsdkfabcdjfksdjfkabcdskfjksdfgeekardxdjfksdfj' >>> res = re.sub(r'()(geekard)', __r'\2abc'__, ms) //这里的\2代表**pattern中的第二个subgroup匹配的内容**。 >>> res 'sdfsdkfgeekardabcdjfksdjfkgeekardabcdskfjksdfgeekardxdjfksdfj' >>> If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single __match object__ argument, and returns the replacement **string**. For example: >>> >>> def dashrepl(matchobj): **//参数是MatchObject对象** ... if matchobj.**group(0)** == '-': return ' ' //**由于sub的pattern中不包含括号,group(i)也为空,所以groups()也为空(groups()等效于,group(1:99))。** ... else: return '-' >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') //{m,n}是**尽可能多**地重复,所以被sub了两次,每次都是--被替换为- 'pro--gram files' >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 'Baked Beans & Spam' The pattern may be a string or an RE object. The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, __all occurrences__ will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'. In addition to character escapes and backreferences as described above, __\g__ will use the substring matched by the group named name, as defined by the **(?P...)** syntax. \g uses the corresponding group number; __\g<2> is therefore equivalent to \2__, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference __\g<0>__ substitutes in the entire substring matched by the RE. Changed in version 2.7: Added the optional flags argument. ==== re.subn(pattern, repl, string, count=0, flags=0) ==== Perform the same operation as sub(), but return a tuple **(new_string, number_of_subs_made)**. Changed in version 2.7: Added the optional flags argument. ==== re.escape(string) ==== Return string with __all non-alphanumerics backslashed__; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it. escape()函数将对string中的所有非字母数字字符转义,这样结果字符串为字符串字面量 In [5]: re.escape(r"122jds\fd") Out[5]: '122jds**\\\\f**d' In [6]: re.escape("122jds\fd") Out[6]: '122jds**\\\x0**cd' ==== re.purge() ==== Clear the regular expression cache. ==== exception re.error ==== Exception raised when a string passed to one of the functions here is **not a valid regular expression** (for example, it might contain unmatched parentheses) or when some other error occurs during compilation or matching. It is never an error if a string contains no match for a pattern. ===== 7.2.3. Regular Expression Objects ===== __class re.RegexObject__ The RegexObject class supports the following methods and attributes: ==== search(string[, pos[, endpos]]) ==== Scan through string looking for a location where this regular expression produces a match, and return a corresponding **MatchObject** instance. Return **None** if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string. The optional second parameter pos gives **an index in the string** where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start. The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found, otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0). >>> >>> pattern = re.compile("d") >>> pattern.search("dog") # Match at index 0 <_sre.SRE_Match object at ...> >>> pattern.search("dog", 1) # No match; search doesn't include the "d" ==== match(string[, pos[, endpos]]) ==== If **zero or more** characters at the beginning of string match this regular expression, return a corresponding **MatchObject** instance. Return **None** if the string does not match the pattern; note that this is different from a zero-length match. The optional pos and endpos parameters have the same meaning as for the search() method. >>> >>> pattern = re.compile("o") >>> pattern.match("dog") # No match as "o" is not at the start of "dog". >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". <_sre.SRE_Match object at ...> If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()). ==== split(string, maxsplit=0) ==== Identical to the split() function, using the compiled pattern. ==== findall(string[, pos[, endpos]]) ==== Similar to the findall() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match(). ==== finditer(string[, pos[, endpos]]) ==== Similar to the finditer() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match(). ==== sub(repl, string, count=0) ==== Identical to the sub() function, using the compiled pattern. ==== subn(repl, string, count=0) ==== Identical to the subn() function, using the compiled pattern. ==== flags ==== The regex matching flags. This is a combination of the flags given to compile() and any (?...) inline flags in the pattern. ==== groups ==== **The number of **__capturing groups__ in the pattern. ==== groupindex ==== A dictionary mapping any symbolic **group names defined by (?P) to group numbers**. The dictionary is empty if no symbolic groups were used in the pattern. ==== pattern ==== The pattern string from which the RE object was compiled. ===== 7.2.4. Match Objects ===== __class re.MatchObject__ Match objects always have **a boolean value of True**. Since match() and search() return __None__ when there is no match, you can test whether there was a match with a simple if statement: match = re.search(pattern, string) if match: process(match) Match objects support the following methods and attributes: ==== expand(template) ==== Return the string obtained by doing __backslash substitution on the template string template__, as done by the sub() method. Escapes such as \n are converted to the appropriate characters, and numeric backreferences (\1, \2) and named backreferences (\g<1>, __\g__) are replaced by the contents of the corresponding group. ==== group([group1, ...]) ==== Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned. >>> >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") >>> m.group(0) # The entire match 'Isaac Newton' >>> m.group(1) # The first parenthesized subgroup. 'Isaac' >>> m.group(2) # The second parenthesized subgroup. 'Newton' >>> m.group(1, 2) # Multiple arguments give us a tuple. ('Isaac', 'Newton') If the regular expression uses the (?P...) syntax, the groupN arguments may also be strings identifying groups by their group name. If a string argument is not used as a group name in the pattern, an IndexError exception is raised. A moderately complicated example: >>> >>> m = re.match(r"(?P\w+) (?P\w+)", "Malcolm Reynolds") >>> m.group('first_name') 'Malcolm' >>> m.group('last_name') 'Reynolds' Named groups can also be referred to by their index: >>> >>> m.group(1) 'Malcolm' >>> m.group(2) 'Reynolds' If a group matches multiple times, only the last match is accessible: >>> >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. >>> m.group(1) # Returns only the last match. 'c3' ==== groups([default]) ==== Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None. (Incompatibility note: in the original Python 1.5 release, if the tuple was one element long, a string would be returned instead. In later versions (from 1.5.1 on), a singleton tuple is returned in such cases.) For example: >>> >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") >>> m.groups() ('24', '1632') If we make the decimal place and everything after it optional, not all groups might participate in the match. These groups will default to None unless the default argument is given: >>> >>> m = re.match(r"(\d+)\.?(\d+)?", "24") >>> m.groups() # Second group defaults to None. ('24', None) >>> m.groups('0') # Now, the second group defaults to '0'. ('24', '0') ==== groupdict([default]) ==== Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. The default argument is used for groups that did not participate in the match; it defaults to None. For example: >>> >>> m = re.match(r"(?P\w+) (?P\w+)", "Malcolm Reynolds") >>> m.groupdict() {'first_name': 'Malcolm', 'last_name': 'Reynolds'} ==== start([group]) ==== ==== end([group]) ==== Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is m.string[m.start(g):m.end(g)] Note that m.start(group) will equal m.end(group) if group matched a null string. For example, after m = re.search('b(c?)', 'cba'), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception. An example that will remove remove_this from email addresses: >>> >>> email = "tony@tiremove_thisger.net" >>> m = re.search("remove_this", email) >>> email[:m.start()] + email[m.end():] 'tony@tiger.net' ==== span([group]) ==== For MatchObject m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match. ==== pos ==== The value of pos which was passed to the search() or match() method of the RegexObject. This is the index into the string at which the RE engine started looking for a match. ==== endpos ==== The value of endpos which was passed to the search() or match() method of the RegexObject. This is the index into the string beyond which the RE engine will not go. ==== lastindex ==== The integer index of the __last matched__** capturing group**, or **None** if no group was matched at all. For example, the expressions **(a)b, ((a)(b)), and ((ab)) will have lastindex == 1** if applied to the string 'ab', while the expression (a)(b) will have lastindex == 2, if applied to the same string. ==== lastgroup ==== The name of the **last matched capturing group**, or None if the group didn’t have a name, or if no group was matched at all. 如果pattern中__最后一个subgroup有名字而且该pattern匹配字符串__,则lastgroup为其名字字符串。 ==== re ==== The regular expression object whose match() or search() method produced this MatchObject instance. ==== string ==== The string passed to match() or search(). ===== 7.2.5. Examples ===== ==== 7.2.5.1. Checking For a Pair ==== In this example, we’ll use the following helper function to display match objects a little more gracefully: def displaymatch(match): if match is None: return None return '' % (match.**group()**, match.groups()) **math.group()返回匹配的整个字符串,而math.groups()返回的时subgroup的tuple。** Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value. To see if a given string is a valid hand, one could do the following: >>> >>> valid = re.compile(r"^[a2-9tjqk]{5}$") >>> displaymatch(valid.match("akt5q")) # Valid. "" >>> displaymatch(valid.match("akt5e")) # Invalid. >>> displaymatch(valid.match("akt")) # Invalid. >>> displaymatch(valid.match("727ak")) # Valid. "" That last hand, "727ak", contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such: >>> >>> pair = re.compile(r".*(.).*\1") >>> displaymatch(pair.match("717ak")) # Pair of 7s. "" >>> displaymatch(pair.match("718ak")) # No pairs. >>> displaymatch(pair.match("354aa")) # Pair of aces. "" To find out what card the pair consists of, one could use the group() method of MatchObject in the following manner: >>> >>> pair.match("717ak").group(1) '7' # Error because re.match() returns None, which doesn't have a group() method: >>> pair.match("718ak").group(1) Traceback (most recent call last): File "", line 1, in re.match(r".*(.).*\1", "718ak").group(1) AttributeError: 'NoneType' object has no attribute 'group' >>> pair.match("354aa").group(1) 'a' 7.2.5.2. Simulating scanf() Python does not currently have an equivalent to scanf(). Regular expressions are generally more powerful, though also more verbose, than scanf() format strings. The table below offers some more-or-less equivalent mappings between scanf() format tokens and regular expressions. scanf() Token Regular Expression %c . %5c .{5} %d [-+]?\d+ %e, %E, %f, %g [-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)? %i [-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+) %o [-+]?[0-7]+ %s \S+ %u \d+ %x, %X [-+]?(0[xX])?[\dA-Fa-f]+ To extract the filename and numbers from a string like /usr/sbin/sendmail - 0 errors, 4 warnings you would use a scanf() format like %s - %d errors, %d warnings The equivalent regular expression would be (\S+) - (\d+) errors, (\d+) warnings 7.2.5.3. search() vs. match() Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default). For example: >>> >>> re.match("c", "abcdef") # No match >>> re.search("c", "abcdef") # Match <_sre.SRE_Match object at ...> Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string: >>> >>> re.match("c", "abcdef") # No match >>> re.search("^c", "abcdef") # No match >>> re.search("^a", "abcdef") # Match <_sre.SRE_Match object at ...> Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line. >>> >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match <_sre.SRE_Match object at ...> 7.2.5.4. Making a Phonebook split() splits a string into a list delimited by the passed pattern. The method is invaluable for converting textual data into data structures that can be easily read and modified by Python as demonstrated in the following example that creates a phonebook. First, here is the input. Normally it may come from a file, here we are using triple-quoted string syntax: >>> >>> text = """Ross McFluff: 834.345.1254 155 Elm Street ... ... Ronald Heathmore: 892.345.3428 436 Finley Avenue ... Frank Burger: 925.541.7625 662 South Dogwood Way ... ... ... Heather Albrecht: 548.326.4584 919 Park Place""" The entries are separated by one or more newlines. Now we convert the string into a list with each nonempty line having its own entry: >>> >>> entries = re.split("\n+", text) >>> entries ['Ross McFluff: 834.345.1254 155 Elm Street', 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 'Frank Burger: 925.541.7625 662 South Dogwood Way', 'Heather Albrecht: 548.326.4584 919 Park Place'] Finally, split each entry into a list with first name, last name, telephone number, and address. We use the maxsplit parameter of split() because the address has spaces, our splitting pattern, in it: >>> >>> [re.split(":? ", entry, 3) for entry in entries] [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] The :? pattern matches the colon after the last name, so that it does not occur in the result list. With a maxsplit of 4, we could separate the house number from the street name: >>> >>> [re.split(":? ", entry, 4) for entry in entries] [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 7.2.5.5. Text Munging sub() replaces every occurrence of a pattern with a string or the result of a function. This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters: >>> >>> def repl(m): ... inner_word = list(m.group(2)) ... random.shuffle(inner_word) ... return m.group(1) + "".join(inner_word) + m.group(3) >>> text = "Professor Abdolmalek, please report your absences promptly." >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 7.2.5.6. Finding all Adverbs findall() matches all occurrences of a pattern, not just the first one as search() does. For example, if one was a writer and wanted to find all of the adverbs in some text, he or she might use findall() in the following manner: >>> >>> text = "He was carefully disguised but captured quickly by police." >>> re.findall(r"\w+ly", text) ['carefully', 'quickly'] 7.2.5.7. Finding all Adverbs and their Positions If one wants more information about all matches of a pattern than the matched text, finditer() is useful as it provides instances of MatchObject instead of strings. Continuing with the previous example, if one was a writer who wanted to find all of the adverbs and their positions in some text, he or she would use finditer() in the following manner: >>> >>> text = "He was carefully disguised but captured quickly by police." >>> for m in re.finditer(r"\w+ly", text): ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0)) 07-16: carefully 40-47: quickly 7.2.5.8. Raw String Notation Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical: >>> >>> re.match(r"\W(.)\1\W", " ff ") <_sre.SRE_Match object at ...> >>> re.match("\\W(.)\\1\\W", " ff ") <_sre.SRE_Match object at ...> When one wants to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means r"\\". Without raw string notation, one must use "\\\\", making the following lines of code functionally identical: >>> >>> re.match(r"\\", r"\\") <_sre.SRE_Match object at ...> >>> re.match("\\\\", r"\\") <_sre.SRE_Match object at ...>