>>> from fnmatch import fnmatchcase >>> addresses = [ ... '5412 N CLARK ST', ... '1060 W ADDISON ST', ... '1039 W GRANVILLE AVE', ... '2122 N CLARK ST', ... '4802 N BROADWAY', ... ] >>> [addr for addr in addresses if fnmatchcase(addr, '* ST')] ['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST'] >>> [addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')] ['5412 N CLARK ST'] >>>
很典型的出现在当你用点 (.) 去匹配任意字符的时候,忘记了点(.)不能匹配换行符的事实,匹配 C 语言分割的注释:
1 2 3 4 5 6 7 8 9 10
>>> comment = re.compile(r'/\*(.*?)\*/') >>> text1 = '/* this is a comment */' >>> text2 = '''/* this is a ... multiline comment */ ... ''' >>> comment.findall(text1) [' this is a comment '] >>> comment.findall(text2) [] >>>
可以修改模式字符串,增加对换行的支持
1 2 3 4
>>> comment = re.compile(r'/\*((?:.|\n)*?)\*/') >>> comment.findall(text2) [' this is a\nmultiline comment '] >>>
>>> s = '\ufb01'# A single character >>> s ' fi' >>> unicodedata.normalize('NFD', s) ' fi' # Notice how the combined letters are broken apart here >>> unicodedata.normalize('NFKD', s) 'fi' >>> unicodedata.normalize('NFKC', s) 'fi' >>>
用正则表达式处理Unicode字符
你正在使用正则表达式处理文本,但是关注的是 Unicode 字符处理。
默认情况下 re 模块已经对一些 Unicode 字符类有了基本的支持。比如,\\d已经匹配任意的unicode数字字符了
1 2 3 4 5 6 7 8 9
>>> import re >>> num = re.compile('\d+') >>> # ASCII digits >>> num.match('123') <_sre.SRE_Match object at 0x1007d9ed0> >>> # Arabic digits >>> num.match('\u0661\u0662\u0663') <_sre.SRE_Match object at 0x101234030> >>>
>>> import unicodedata >>> import sys >>> cmb_chrs = dict.fromkeys(c for c inrange(sys.maxunicode) ... if unicodedata.combining(chr(c))) ... >>> b = unicodedata.normalize('NFD', a) >>> b 'pýtĥöñ is awesome\n' >>> b.translate(cmb_chrs) 'python is awesome\n' >>>
>>> parts = ['Is', 'Chicago', 'Not', 'Chicago?'] >>> ' '.join(parts) 'Is Chicago Not Chicago?' >>> ','.join(parts) 'Is,Chicago,Not,Chicago?' >>> >>> ''.join(parts) 'IsChicagoNotChicago?' >>>
1 2 3 4
>>> b = 'Not Chicago?' >>> a + ' ' + b 'Is Chicago Not Chicago?' >>>
如果你想在源码中将两个字面字符串合并起来,你只需要简单的将它们放到一起,不需要用加号 (+)。
1 2 3 4
>>> a = 'li''rui''long' >>> a 'liruilong' >>>
嗯,字符串变量是不行的,有些天真了哈….,只适用于字面量
1 2 3 4 5 6 7 8 9 10 11 12 13 14
>>> a = 'li''rui''long' >>> a 'liruilong' >>> a a File "<stdin>", line 1 a a ^ SyntaxError: invalid syntax >>> a = a a File "<stdin>", line 1 a = a a ^ SyntaxError: invalid syntax >>>
>>> s = '{name} has {n} messages.' >>> s.format(name='Guido', n=37) 'Guido has 37 messages.' >>>
如果要被替换的变量能在变量域中找到,那么你可以结合使用 format map()和 vars()
1 2 3 4 5 6
>>> s = '{name} has {n} messages.' >>> name = 'Guido' >>> n = 37 >>> s.format_map(vars()) 'Guido has 37 messages.' >>>
vars() 还有一个有意思的特性就是它也适用于对象实例。强大到超乎了的想象…
1 2 3 4 5 6 7 8 9
>>> classInfo: ... def__init__(self, name, n): ... self.name = name ... self.n = n ... >>> a = Info('Guido',37) >>> s.format_map(vars(a)) 'Guido has 37 messages.' >>>
format 和 format map() 的一个缺陷就是它们并不能很好的处理变量缺失的情况,一种避免这种错误的方法是另外定义一个含有missing ()方法的字典对象,从2.5版本开始,如果派生自dict的子类定义了 __missing__() 方法,当访问不存在的键时,dict[key]会调用 __missing__() 方法取得默认值。
>>> name = 'Guido' >>> n = 37 >>> print(sub('Hello {name}')) Hello Guido >>> print(sub('You have {n} messages.')) You have 37 messages. >>> print(sub('Your favorite color is {color}')) Your favorite color is {color} >>>
对于Python的字符串替换,如果不使用format() 和 format map() 还可以有如下方式
1 2 3 4 5
>>> name = 'Guido' >>> n = 37 >>> '%(name) has %(n) messages.' % vars() 'Guido has 37 messages.' >>>
1 2 3 4 5 6 7
>>> import string >>> name = 'Guido' >>> n = 37 >>> s = string.Template('$name has $n messages.') >>> s.substitute(vars()) 'Guido has 37 messages.' >>>
以指定列宽格式化字符串
你有一些长字符串,想以指定的列宽将它们重新格式化。
使用 textwrap 模块来格式化字符串的输出
1 2 3 4 5 6 7 8
>>> s = "Look into my eyes, look into my eyes, the eyes, the eyes, \ ... the eyes, not around the eyes, don't look around the eyes, \ ... look into my eyes, you're under." >>> import textwrap >>> print(textwrap.fill(s, 70)) Look into my eyes, look into my eyes, the eyes, the eyes, the eyes, not around the eyes, don't look around the eyes, look into my eyes, you're under.
1 2 3 4 5
>>> print(textwrap.fill(s, 40)) Look into my eyes, look into my eyes, the eyes, the eyes, the eyes, not around the eyes, don't look around the eyes, look into my eyes, you're under.
1 2 3 4 5 6 7 8 9 10 11
>>> print(textwrap.fill(s, 40, initial_indent=' ')) Look into my eyes, look into my eyes, the eyes, the eyes, the eyes, not around the eyes, don't look around the eyes, look into my eyes, you're under. >>> print(textwrap.fill(s, 40, subsequent_indent=' ')) Look into my eyes, look into my eyes, the eyes, the eyes, the eyes, not around the eyes, don't look around the eyes, look into my eyes, you're under. >>>
>>> print(textwrap.fill(s, os.get_terminal_size().columns, initial_indent=' ')) Look into my eyes, look into my eyes, the eyes, the eyes, the eyes, not around the eyes, don't look around the eyes, look into my eyes, you're under. >>>
在字符串中处理 html 和 xml
你想将 HTML 或者 XML 实体如 &entity; 或 &#code; 替换为对应的文本。再者,你需要转换文本中特定的字符 (比如<, >, 或 &)。
>>> s = 'Elements are written as "<tag>text</tag>".' >>> import html >>> print(s) Elements are written as"<tag>text</tag>". >>> print(html.escape(s)) Elements are written as "<tag>text</tag>". >>> # Disable escaping of quotes >>> print(html.escape(s, quote=False))
为了替换文本中的编码实体,你需要使用另外一种方法。如果你正在处理 HTML或者 XML 文本,试着先使用一个合适的 HTML 或者 XML 解析
html ,这个方法被移除了,我的3.9的版本,
1 2 3 4 5 6
>>> from html.parser import HTMLParser >>> p = HTMLParser() >>> p.unescape(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'HTMLParser'object has no attribute 'unescape'
xml
1 2 3 4 5
>>> t = 'The prompt is >>>' >>> from xml.sax.saxutils import unescape >>> unescape(t) 'The prompt is >>>' >>>
>>> data = b'FOO:BAR,SPAM' >>> import re >>> re.split('[:,]',data) ['FOO', 'BAR', 'SPAM'] >>> re.split(b'[:,]',data) ['FOO', 'BAR', 'SPAM'] >>>
window下
1 2 3 4 5 6 7 8 9 10 11
>>> data = b'FOO:BAR,SPAM' >>> import re >>> re.split('[:,]',data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "D:\python\Python310\lib\re.py", line 231, in split return _compile(pattern, flags).split(string, maxsplit) TypeError: cannot use a string pattern on a bytes-like object >>> re.split(b'[:,]',data) [b'FOO', b'BAR', b'SPAM'] >>>