mirror of
https://github.com/python/cpython.git
synced 2024-11-27 11:55:13 +08:00
bpo-31714: Improved regular expression documentation. (#3907)
This commit is contained in:
parent
ef611c96ea
commit
cd195e2a7a
@ -153,8 +153,8 @@ These sequences can be included inside a character class. For example,
|
||||
``','`` or ``'.'``.
|
||||
|
||||
The final metacharacter in this section is ``.``. It matches anything except a
|
||||
newline character, and there's an alternate mode (``re.DOTALL``) where it will
|
||||
match even a newline. ``'.'`` is often used where you want to match "any
|
||||
newline character, and there's an alternate mode (:const:`re.DOTALL`) where it will
|
||||
match even a newline. ``.`` is often used where you want to match "any
|
||||
character".
|
||||
|
||||
|
||||
@ -168,14 +168,11 @@ wouldn't be much of an advance. Another capability is that you can specify that
|
||||
portions of the RE must be repeated a certain number of times.
|
||||
|
||||
The first metacharacter for repeating things that we'll look at is ``*``. ``*``
|
||||
doesn't match the literal character ``*``; instead, it specifies that the
|
||||
doesn't match the literal character ``'*'``; instead, it specifies that the
|
||||
previous character can be matched zero or more times, instead of exactly once.
|
||||
|
||||
For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``),
|
||||
``caaat`` (3 ``a`` characters), and so forth. The RE engine has various
|
||||
internal limitations stemming from the size of C's ``int`` type that will
|
||||
prevent it from matching over 2 billion ``a`` characters; patterns
|
||||
are usually not written to match that much data.
|
||||
For example, ``ca*t`` will match ``'ct'`` (0 ``'a'`` characters), ``'cat'`` (1 ``'a'``),
|
||||
``'caaat'`` (3 ``'a'`` characters), and so forth.
|
||||
|
||||
Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
|
||||
engine will try to repeat it as many times as possible. If later portions of the
|
||||
@ -185,7 +182,7 @@ fewer repetitions.
|
||||
A step-by-step example will make this more obvious. Let's consider the
|
||||
expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters
|
||||
from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching
|
||||
this RE against the string ``abcbd``.
|
||||
this RE against the string ``'abcbd'``.
|
||||
|
||||
+------+-----------+---------------------------------+
|
||||
| Step | Matched | Explanation |
|
||||
@ -218,7 +215,7 @@ this RE against the string ``abcbd``.
|
||||
| | | it succeeds. |
|
||||
+------+-----------+---------------------------------+
|
||||
|
||||
The end of the RE has now been reached, and it has matched ``abcb``. This
|
||||
The end of the RE has now been reached, and it has matched ``'abcb'``. This
|
||||
demonstrates how the matching engine goes as far as it can at first, and if no
|
||||
match is found it will then progressively back up and retry the rest of the RE
|
||||
again and again. It will back up until it has tried zero matches for
|
||||
@ -229,24 +226,23 @@ Another repeating metacharacter is ``+``, which matches one or more times. Pay
|
||||
careful attention to the difference between ``*`` and ``+``; ``*`` matches
|
||||
*zero* or more times, so whatever's being repeated may not be present at all,
|
||||
while ``+`` requires at least *one* occurrence. To use a similar example,
|
||||
``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match
|
||||
``ct``.
|
||||
``ca+t`` will match ``'cat'`` (1 ``'a'``), ``'caaat'`` (3 ``'a'``\ s), but won't
|
||||
match ``'ct'``.
|
||||
|
||||
There are two more repeating qualifiers. The question mark character, ``?``,
|
||||
matches either once or zero times; you can think of it as marking something as
|
||||
being optional. For example, ``home-?brew`` matches either ``homebrew`` or
|
||||
``home-brew``.
|
||||
being optional. For example, ``home-?brew`` matches either ``'homebrew'`` or
|
||||
``'home-brew'``.
|
||||
|
||||
The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are
|
||||
decimal integers. This qualifier means there must be at least *m* repetitions,
|
||||
and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and
|
||||
``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which
|
||||
and at most *n*. For example, ``a/{1,3}b`` will match ``'a/b'``, ``'a//b'``, and
|
||||
``'a///b'``. It won't match ``'ab'``, which has no slashes, or ``'a////b'``, which
|
||||
has four.
|
||||
|
||||
You can omit either *m* or *n*; in that case, a reasonable value is assumed for
|
||||
the missing value. Omitting *m* is interpreted as a lower limit of 0, while
|
||||
omitting *n* results in an upper bound of infinity --- actually, the upper bound
|
||||
is the 2-billion limit mentioned earlier, but that might as well be infinity.
|
||||
omitting *n* results in an upper bound of infinity.
|
||||
|
||||
Readers of a reductionist bent may notice that the three other qualifiers can
|
||||
all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}``
|
||||
@ -366,7 +362,7 @@ for a complete listing.
|
||||
| | returns them as an :term:`iterator`. |
|
||||
+------------------+-----------------------------------------------+
|
||||
|
||||
:meth:`~re.regex.match` and :meth:`~re.regex.search` return ``None`` if no match can be found. If
|
||||
:meth:`~re.Pattern.match` and :meth:`~re.Pattern.search` return ``None`` if no match can be found. If
|
||||
they're successful, a :ref:`match object <match-objects>` instance is returned,
|
||||
containing information about the match: where it starts and ends, the substring
|
||||
it matched, and more.
|
||||
@ -388,24 +384,24 @@ Python interpreter, import the :mod:`re` module, and compile a RE::
|
||||
|
||||
Now, you can try matching various strings against the RE ``[a-z]+``. An empty
|
||||
string shouldn't match at all, since ``+`` means 'one or more repetitions'.
|
||||
:meth:`match` should return ``None`` in this case, which will cause the
|
||||
:meth:`~re.Pattern.match` should return ``None`` in this case, which will cause the
|
||||
interpreter to print no output. You can explicitly print the result of
|
||||
:meth:`match` to make this clear. ::
|
||||
:meth:`!match` to make this clear. ::
|
||||
|
||||
>>> p.match("")
|
||||
>>> print(p.match(""))
|
||||
None
|
||||
|
||||
Now, let's try it on a string that it should match, such as ``tempo``. In this
|
||||
case, :meth:`match` will return a :ref:`match object <match-objects>`, so you
|
||||
case, :meth:`~re.Pattern.match` will return a :ref:`match object <match-objects>`, so you
|
||||
should store the result in a variable for later use. ::
|
||||
|
||||
>>> m = p.match('tempo')
|
||||
>>> m #doctest: +ELLIPSIS
|
||||
>>> m
|
||||
<re.Match object; span=(0, 5), match='tempo'>
|
||||
|
||||
Now you can query the :ref:`match object <match-objects>` for information
|
||||
about the matching string. :ref:`match object <match-objects>` instances
|
||||
about the matching string. Match object instances
|
||||
also have several methods and attributes; the most important ones are:
|
||||
|
||||
+------------------+--------------------------------------------+
|
||||
@ -430,17 +426,17 @@ Trying these methods will soon clarify their meaning::
|
||||
>>> m.span()
|
||||
(0, 5)
|
||||
|
||||
:meth:`~re.match.group` returns the substring that was matched by the RE. :meth:`~re.match.start`
|
||||
and :meth:`~re.match.end` return the starting and ending index of the match. :meth:`~re.match.span`
|
||||
returns both start and end indexes in a single tuple. Since the :meth:`match`
|
||||
method only checks if the RE matches at the start of a string, :meth:`start`
|
||||
will always be zero. However, the :meth:`search` method of patterns
|
||||
:meth:`~re.Match.group` returns the substring that was matched by the RE. :meth:`~re.Match.start`
|
||||
and :meth:`~re.Match.end` return the starting and ending index of the match. :meth:`~re.Match.span`
|
||||
returns both start and end indexes in a single tuple. Since the :meth:`~re.Pattern.match`
|
||||
method only checks if the RE matches at the start of a string, :meth:`!start`
|
||||
will always be zero. However, the :meth:`~re.Pattern.search` method of patterns
|
||||
scans through the string, so the match may not start at zero in that
|
||||
case. ::
|
||||
|
||||
>>> print(p.match('::: message'))
|
||||
None
|
||||
>>> m = p.search('::: message'); print(m) #doctest: +ELLIPSIS
|
||||
>>> m = p.search('::: message'); print(m)
|
||||
<re.Match object; span=(4, 11), match='message'>
|
||||
>>> m.group()
|
||||
'message'
|
||||
@ -459,14 +455,14 @@ In actual programs, the most common style is to store the
|
||||
print('No match')
|
||||
|
||||
Two pattern methods return all of the matches for a pattern.
|
||||
:meth:`~re.regex.findall` returns a list of matching strings::
|
||||
:meth:`~re.Pattern.findall` returns a list of matching strings::
|
||||
|
||||
>>> p = re.compile('\d+')
|
||||
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
|
||||
['12', '11', '10']
|
||||
|
||||
:meth:`findall` has to create the entire list before it can be returned as the
|
||||
result. The :meth:`~re.regex.finditer` method returns a sequence of
|
||||
:meth:`~re.Pattern.findall` has to create the entire list before it can be returned as the
|
||||
result. The :meth:`~re.Pattern.finditer` method returns a sequence of
|
||||
:ref:`match object <match-objects>` instances as an :term:`iterator`::
|
||||
|
||||
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
|
||||
@ -529,14 +525,14 @@ of each one.
|
||||
| | characters with the respective property. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including |
|
||||
| | newlines |
|
||||
| | newlines. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches |
|
||||
| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`LOCALE`, :const:`L` | Do a locale-aware match |
|
||||
| :const:`LOCALE`, :const:`L` | Do a locale-aware match. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`MULTILINE`, :const:`M` | Multi-line matching, affecting ``^`` and |
|
||||
| | ``$`` |
|
||||
| | ``$``. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized |
|
||||
| (for 'extended') | more cleanly and understandably. |
|
||||
@ -549,27 +545,41 @@ of each one.
|
||||
|
||||
Perform case-insensitive matching; character class and literal strings will
|
||||
match letters by ignoring case. For example, ``[A-Z]`` will match lowercase
|
||||
letters, too, and ``Spam`` will match ``Spam``, ``spam``, or ``spAM``. This
|
||||
lowercasing doesn't take the current locale into account; it will if you also
|
||||
set the :const:`LOCALE` flag.
|
||||
letters, too. Full Unicode matching also works unless the :const:`ASCII`
|
||||
flag is used to disable non-ASCII matches. When the Unicode patterns
|
||||
``[a-z]`` or ``[A-Z]`` are used in combination with the :const:`IGNORECASE`
|
||||
flag, they will match the 52 ASCII letters and 4 additional non-ASCII
|
||||
letters: 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
|
||||
Latin small letter dotless i), 'ſ' (U+017F, Latin small letter long s) and
|
||||
'K' (U+212A, Kelvin sign). ``Spam`` will match ``'Spam'``, ``'spam'``,
|
||||
``'spAM'``, or ``'ſpam'`` (the latter is matched only in Unicode mode).
|
||||
This lowercasing doesn't take the current locale into account;
|
||||
it will if you also set the :const:`LOCALE` flag.
|
||||
|
||||
|
||||
.. data:: L
|
||||
LOCALE
|
||||
:noindex:
|
||||
|
||||
Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale
|
||||
instead of the Unicode database.
|
||||
Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching dependent
|
||||
on the current locale instead of the Unicode database.
|
||||
|
||||
Locales are a feature of the C library intended to help in writing programs that
|
||||
take account of language differences. For example, if you're processing French
|
||||
text, you'd want to be able to write ``\w+`` to match words, but ``\w`` only
|
||||
matches the character class ``[A-Za-z]``; it won't match ``'é'`` or ``'ç'``. If
|
||||
your system is configured properly and a French locale is selected, certain C
|
||||
functions will tell the program that ``'é'`` should also be considered a letter.
|
||||
Locales are a feature of the C library intended to help in writing programs
|
||||
that take account of language differences. For example, if you're
|
||||
processing encoded French text, you'd want to be able to write ``\w+`` to
|
||||
match words, but ``\w`` only matches the character class ``[A-Za-z]`` in
|
||||
bytes patterns; it won't match bytes corresponding to ``é`` or ``ç``.
|
||||
If your system is configured properly and a French locale is selected,
|
||||
certain C functions will tell the program that the byte corresponding to
|
||||
``é`` should also be considered a letter.
|
||||
Setting the :const:`LOCALE` flag when compiling a regular expression will cause
|
||||
the resulting compiled object to use these C functions for ``\w``; this is
|
||||
slower, but also enables ``\w+`` to match French words as you'd expect.
|
||||
The use of this flag is discouraged in Python 3 as the locale mechanism
|
||||
is very unreliable, it only handles one "culture" at a time, and it only
|
||||
works with 8-bit locales. Unicode matching is already enabled by default
|
||||
in Python 3 for Unicode (str) patterns, and it is able to handle different
|
||||
locales/languages.
|
||||
|
||||
|
||||
.. data:: M
|
||||
@ -667,11 +677,11 @@ zero-width assertions should never be repeated, because if they match once at a
|
||||
given location, they can obviously be matched an infinite number of times.
|
||||
|
||||
``|``
|
||||
Alternation, or the "or" operator. If A and B are regular expressions,
|
||||
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
|
||||
Alternation, or the "or" operator. If *A* and *B* are regular expressions,
|
||||
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very
|
||||
low precedence in order to make it work reasonably when you're alternating
|
||||
multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``,
|
||||
not ``Cro``, a ``'w'`` or an ``'S'``, and ``ervo``.
|
||||
multi-character strings. ``Crow|Servo`` will match either ``'Crow'`` or ``'Servo'``,
|
||||
not ``'Cro'``, a ``'w'`` or an ``'S'``, and ``'ervo'``.
|
||||
|
||||
To match a literal ``'|'``, use ``\|``, or enclose it inside a character class,
|
||||
as in ``[|]``.
|
||||
@ -689,8 +699,7 @@ given location, they can obviously be matched an infinite number of times.
|
||||
>>> print(re.search('^From', 'Reciting From Memory'))
|
||||
None
|
||||
|
||||
.. To match a literal \character{\^}, use \regexp{\e\^} or enclose it
|
||||
.. inside a character class, as in \regexp{[{\e}\^]}.
|
||||
To match a literal ``'^'``, use ``\^``.
|
||||
|
||||
``$``
|
||||
Matches at the end of a line, which is defined as either the end of the string,
|
||||
@ -725,7 +734,7 @@ given location, they can obviously be matched an infinite number of times.
|
||||
match when it's contained inside another word. ::
|
||||
|
||||
>>> p = re.compile(r'\bclass\b')
|
||||
>>> print(p.search('no class at all')) #doctest: +ELLIPSIS
|
||||
>>> print(p.search('no class at all'))
|
||||
<re.Match object; span=(3, 8), match='class'>
|
||||
>>> print(p.search('the declassified algorithm'))
|
||||
None
|
||||
@ -743,7 +752,7 @@ given location, they can obviously be matched an infinite number of times.
|
||||
>>> p = re.compile('\bclass\b')
|
||||
>>> print(p.search('no class at all'))
|
||||
None
|
||||
>>> print(p.search('\b' + 'class' + '\b')) #doctest: +ELLIPSIS
|
||||
>>> print(p.search('\b' + 'class' + '\b'))
|
||||
<re.Match object; span=(0, 7), match='\x08class\x08'>
|
||||
|
||||
Second, inside a character class, where there's no use for this assertion,
|
||||
@ -786,7 +795,8 @@ of a group with a repeating qualifier, such as ``*``, ``+``, ``?``, or
|
||||
|
||||
Groups indicated with ``'('``, ``')'`` also capture the starting and ending
|
||||
index of the text that they match; this can be retrieved by passing an argument
|
||||
to :meth:`group`, :meth:`start`, :meth:`end`, and :meth:`span`. Groups are
|
||||
to :meth:`~re.Match.group`, :meth:`~re.Match.start`, :meth:`~re.Match.end`, and
|
||||
:meth:`~re.Match.span`. Groups are
|
||||
numbered starting with 0. Group 0 is always present; it's the whole RE, so
|
||||
:ref:`match object <match-objects>` methods all have group 0 as their default
|
||||
argument. Later we'll see how to express groups that don't capture the span
|
||||
@ -812,13 +822,13 @@ from left to right. ::
|
||||
>>> m.group(2)
|
||||
'b'
|
||||
|
||||
:meth:`group` can be passed multiple group numbers at a time, in which case it
|
||||
:meth:`~re.Match.group` can be passed multiple group numbers at a time, in which case it
|
||||
will return a tuple containing the corresponding values for those groups. ::
|
||||
|
||||
>>> m.group(2,1,2)
|
||||
('b', 'abc', 'b')
|
||||
|
||||
The :meth:`groups` method returns a tuple containing the strings for all the
|
||||
The :meth:`~re.Match.groups` method returns a tuple containing the strings for all the
|
||||
subgroups, from 1 up to however many there are. ::
|
||||
|
||||
>>> m.groups()
|
||||
@ -1034,7 +1044,7 @@ using the following pattern methods:
|
||||
| ``sub()`` | Find all substrings where the RE matches, and |
|
||||
| | replace them with a different string |
|
||||
+------------------+-----------------------------------------------+
|
||||
| ``subn()`` | Does the same thing as :meth:`sub`, but |
|
||||
| ``subn()`` | Does the same thing as :meth:`!sub`, but |
|
||||
| | returns the new string and the number of |
|
||||
| | replacements |
|
||||
+------------------+-----------------------------------------------+
|
||||
@ -1043,10 +1053,10 @@ using the following pattern methods:
|
||||
Splitting Strings
|
||||
-----------------
|
||||
|
||||
The :meth:`split` method of a pattern splits a string apart
|
||||
The :meth:`~re.Pattern.split` method of a pattern splits a string apart
|
||||
wherever the RE matches, returning a list of the pieces. It's similar to the
|
||||
:meth:`split` method of strings but provides much more generality in the
|
||||
delimiters that you can split by; string :meth:`split` only supports splitting by
|
||||
:meth:`~str.split` method of strings but provides much more generality in the
|
||||
delimiters that you can split by; string :meth:`!split` only supports splitting by
|
||||
whitespace or by a fixed string. As you'd expect, there's a module-level
|
||||
:func:`re.split` function, too.
|
||||
|
||||
@ -1098,7 +1108,7 @@ Search and Replace
|
||||
------------------
|
||||
|
||||
Another common task is to find all the matches for a pattern, and replace them
|
||||
with a different string. The :meth:`sub` method takes a replacement value,
|
||||
with a different string. The :meth:`~re.Pattern.sub` method takes a replacement value,
|
||||
which can be either a string or a function, and the string to be processed.
|
||||
|
||||
.. method:: .sub(replacement, string[, count=0])
|
||||
@ -1112,7 +1122,7 @@ which can be either a string or a function, and the string to be processed.
|
||||
replaced; *count* must be a non-negative integer. The default value of 0 means
|
||||
to replace all occurrences.
|
||||
|
||||
Here's a simple example of using the :meth:`sub` method. It replaces colour
|
||||
Here's a simple example of using the :meth:`~re.Pattern.sub` method. It replaces colour
|
||||
names with the word ``colour``::
|
||||
|
||||
>>> p = re.compile('(blue|white|red)')
|
||||
@ -1121,7 +1131,7 @@ names with the word ``colour``::
|
||||
>>> p.sub('colour', 'blue socks and red shoes', count=1)
|
||||
'colour socks and red shoes'
|
||||
|
||||
The :meth:`subn` method does the same work, but returns a 2-tuple containing the
|
||||
The :meth:`~re.Pattern.subn` method does the same work, but returns a 2-tuple containing the
|
||||
new string value and the number of replacements that were performed::
|
||||
|
||||
>>> p = re.compile('(blue|white|red)')
|
||||
@ -1206,24 +1216,24 @@ Use String Methods
|
||||
|
||||
Sometimes using the :mod:`re` module is a mistake. If you're matching a fixed
|
||||
string, or a single character class, and you're not using any :mod:`re` features
|
||||
such as the :const:`IGNORECASE` flag, then the full power of regular expressions
|
||||
such as the :const:`~re.IGNORECASE` flag, then the full power of regular expressions
|
||||
may not be required. Strings have several methods for performing operations with
|
||||
fixed strings and they're usually much faster, because the implementation is a
|
||||
single small C loop that's been optimized for the purpose, instead of the large,
|
||||
more generalized regular expression engine.
|
||||
|
||||
One example might be replacing a single fixed string with another one; for
|
||||
example, you might replace ``word`` with ``deed``. ``re.sub()`` seems like the
|
||||
function to use for this, but consider the :meth:`replace` method. Note that
|
||||
:func:`replace` will also replace ``word`` inside words, turning ``swordfish``
|
||||
example, you might replace ``word`` with ``deed``. :func:`re.sub` seems like the
|
||||
function to use for this, but consider the :meth:`~str.replace` method. Note that
|
||||
:meth:`!replace` will also replace ``word`` inside words, turning ``swordfish``
|
||||
into ``sdeedfish``, but the naive RE ``word`` would have done that, too. (To
|
||||
avoid performing the substitution on parts of words, the pattern would have to
|
||||
be ``\bword\b``, in order to require that ``word`` have a word boundary on
|
||||
either side. This takes the job beyond :meth:`replace`'s abilities.)
|
||||
either side. This takes the job beyond :meth:`!replace`'s abilities.)
|
||||
|
||||
Another common task is deleting every occurrence of a single character from a
|
||||
string or replacing it with another single character. You might do this with
|
||||
something like ``re.sub('\n', ' ', S)``, but :meth:`translate` is capable of
|
||||
something like ``re.sub('\n', ' ', S)``, but :meth:`~str.translate` is capable of
|
||||
doing both tasks and will be faster than any regular expression operation can
|
||||
be.
|
||||
|
||||
@ -1234,18 +1244,18 @@ can be solved with a faster and simpler string method.
|
||||
match() versus search()
|
||||
-----------------------
|
||||
|
||||
The :func:`match` function only checks if the RE matches at the beginning of the
|
||||
string while :func:`search` will scan forward through the string for a match.
|
||||
It's important to keep this distinction in mind. Remember, :func:`match` will
|
||||
The :func:`~re.match` function only checks if the RE matches at the beginning of the
|
||||
string while :func:`~re.search` will scan forward through the string for a match.
|
||||
It's important to keep this distinction in mind. Remember, :func:`!match` will
|
||||
only report a successful match which will start at 0; if the match wouldn't
|
||||
start at zero, :func:`match` will *not* report it. ::
|
||||
start at zero, :func:`!match` will *not* report it. ::
|
||||
|
||||
>>> print(re.match('super', 'superstition').span())
|
||||
(0, 5)
|
||||
>>> print(re.match('super', 'insuperable'))
|
||||
None
|
||||
|
||||
On the other hand, :func:`search` will scan forward through the string,
|
||||
On the other hand, :func:`~re.search` will scan forward through the string,
|
||||
reporting the first match it finds. ::
|
||||
|
||||
>>> print(re.search('super', 'superstition').span())
|
||||
@ -1284,12 +1294,12 @@ doesn't work because of the greedy nature of ``.*``. ::
|
||||
>>> print(re.match('<.*>', s).group())
|
||||
<html><head><title>Title</title>
|
||||
|
||||
The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of
|
||||
The RE matches the ``'<'`` in ``'<html>'``, and the ``.*`` consumes the rest of
|
||||
the string. There's still more left in the RE, though, and the ``>`` can't
|
||||
match at the end of the string, so the regular expression engine has to
|
||||
backtrack character by character until it finds a match for the ``>``. The
|
||||
final match extends from the ``'<'`` in ``<html>`` to the ``'>'`` in
|
||||
``</title>``, which isn't what you want.
|
||||
final match extends from the ``'<'`` in ``'<html>'`` to the ``'>'`` in
|
||||
``'</title>'``, which isn't what you want.
|
||||
|
||||
In this case, the solution is to use the non-greedy qualifiers ``*?``, ``+?``,
|
||||
``??``, or ``{m,n}?``, which match as *little* text as possible. In the above
|
||||
@ -1315,7 +1325,7 @@ notation, but they're not terribly readable. REs of moderate complexity can
|
||||
become lengthy collections of backslashes, parentheses, and metacharacters,
|
||||
making them difficult to read and understand.
|
||||
|
||||
For such REs, specifying the ``re.VERBOSE`` flag when compiling the regular
|
||||
For such REs, specifying the :const:`re.VERBOSE` flag when compiling the regular
|
||||
expression can be helpful, because it allows you to format the regular
|
||||
expression more clearly.
|
||||
|
||||
@ -1354,5 +1364,5 @@ Friedl's Mastering Regular Expressions, published by O'Reilly. Unfortunately,
|
||||
it exclusively concentrates on Perl and Java's flavours of regular expressions,
|
||||
and doesn't contain any Python material at all, so it won't be useful as a
|
||||
reference for programming in Python. (The first edition covered Python's
|
||||
now-removed :mod:`regex` module, which won't help you much.) Consider checking
|
||||
now-removed :mod:`!regex` module, which won't help you much.) Consider checking
|
||||
it out from your library.
|
||||
|
@ -14,8 +14,9 @@
|
||||
This module provides regular expression matching operations similar to
|
||||
those found in Perl.
|
||||
|
||||
Both patterns and strings to be searched can be Unicode strings as well as
|
||||
8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
|
||||
Both patterns and strings to be searched can be Unicode strings (:class:`str`)
|
||||
as well as 8-bit strings (:class:`bytes`).
|
||||
However, Unicode strings and 8-bit strings cannot be mixed:
|
||||
that is, you cannot match a Unicode string with a byte pattern or
|
||||
vice-versa; similarly, when asking for a substitution, the replacement
|
||||
string must be of the same type as both the pattern and the search string.
|
||||
@ -81,9 +82,7 @@ strings to be matched ``'in single quotes'``.)
|
||||
|
||||
Some characters, like ``'|'`` or ``'('``, are special. Special
|
||||
characters either stand for classes of ordinary characters, or affect
|
||||
how the regular expressions around them are interpreted. Regular
|
||||
expression pattern strings may not contain null bytes, but can specify
|
||||
the null byte using a ``\number`` notation such as ``'\x00'``.
|
||||
how the regular expressions around them are interpreted.
|
||||
|
||||
Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
|
||||
directly nested. This avoids ambiguity with the non-greedy modifier suffix
|
||||
@ -94,16 +93,16 @@ the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
|
||||
|
||||
The special characters are:
|
||||
|
||||
``'.'``
|
||||
``.``
|
||||
(Dot.) In the default mode, this matches any character except a newline. If
|
||||
the :const:`DOTALL` flag has been specified, this matches any character
|
||||
including a newline.
|
||||
|
||||
``'^'``
|
||||
``^``
|
||||
(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
|
||||
matches immediately after each newline.
|
||||
|
||||
``'$'``
|
||||
``$``
|
||||
Matches the end of the string or just before the newline at the end of the
|
||||
string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
|
||||
matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
|
||||
@ -112,28 +111,28 @@ The special characters are:
|
||||
a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
|
||||
the newline, and one at the end of the string.
|
||||
|
||||
``'*'``
|
||||
``*``
|
||||
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
|
||||
many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
|
||||
by any number of 'b's.
|
||||
|
||||
``'+'``
|
||||
``+``
|
||||
Causes the resulting RE to match 1 or more repetitions of the preceding RE.
|
||||
``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
|
||||
match just 'a'.
|
||||
|
||||
``'?'``
|
||||
``?``
|
||||
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
|
||||
``ab?`` will match either 'a' or 'ab'.
|
||||
|
||||
``*?``, ``+?``, ``??``
|
||||
The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
|
||||
as much text as possible. Sometimes this behaviour isn't desired; if the RE
|
||||
``<.*>`` is matched against ``<a> b <c>``, it will match the entire
|
||||
string, and not just ``<a>``. Adding ``?`` after the qualifier makes it
|
||||
``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
|
||||
string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
|
||||
perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
|
||||
characters as possible will be matched. Using the RE ``<.*?>`` will match
|
||||
only ``<a>``.
|
||||
only ``'<a>'``.
|
||||
|
||||
``{m}``
|
||||
Specifies that exactly *m* copies of the previous RE should be matched; fewer
|
||||
@ -145,8 +144,8 @@ The special characters are:
|
||||
RE, attempting to match as many repetitions as possible. For example,
|
||||
``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
|
||||
lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
|
||||
example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
|
||||
followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
|
||||
example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
|
||||
followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
|
||||
modifier would be confused with the previously described form.
|
||||
|
||||
``{m,n}?``
|
||||
@ -156,7 +155,7 @@ The special characters are:
|
||||
6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
|
||||
while ``a{3,5}?`` will only match 3 characters.
|
||||
|
||||
``'\'``
|
||||
``\``
|
||||
Either escapes special characters (permitting you to match characters like
|
||||
``'*'``, ``'?'``, and so forth), or signals a special sequence; special
|
||||
sequences are discussed below.
|
||||
@ -179,8 +178,8 @@ The special characters are:
|
||||
them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
|
||||
``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
|
||||
``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
|
||||
``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
|
||||
it will match a literal ``'-'``.
|
||||
``[a\-z]``) or if it's placed as the first or last character
|
||||
(e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
|
||||
|
||||
* Special characters lose their special meaning inside sets. For example,
|
||||
``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
|
||||
@ -201,13 +200,13 @@ The special characters are:
|
||||
place it at the beginning of the set. For example, both ``[()[\]{}]`` and
|
||||
``[]()[{}]`` will both match a parenthesis.
|
||||
|
||||
``'|'``
|
||||
``A|B``, where A and B can be arbitrary REs, creates a regular expression that
|
||||
will match either A or B. An arbitrary number of REs can be separated by the
|
||||
``|``
|
||||
``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
|
||||
will match either *A* or *B*. An arbitrary number of REs can be separated by the
|
||||
``'|'`` in this way. This can be used inside groups (see below) as well. As
|
||||
the target string is scanned, REs separated by ``'|'`` are tried from left to
|
||||
right. When one pattern completely matches, that branch is accepted. This means
|
||||
that once ``A`` matches, ``B`` will not be tested further, even if it would
|
||||
that once *A* matches, *B* will not be tested further, even if it would
|
||||
produce a longer overall match. In other words, the ``'|'`` operator is never
|
||||
greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
|
||||
character class, as in ``[|]``.
|
||||
@ -217,7 +216,7 @@ The special characters are:
|
||||
start and end of a group; the contents of a group can be retrieved after a match
|
||||
has been performed, and can be matched later in the string with the ``\number``
|
||||
special sequence, described below. To match the literals ``'('`` or ``')'``,
|
||||
use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
|
||||
use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
|
||||
|
||||
``(?...)``
|
||||
This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
|
||||
@ -232,10 +231,11 @@ The special characters are:
|
||||
letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
|
||||
:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
|
||||
:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
|
||||
and :const:`re.X` (verbose), for the entire regular expression. (The
|
||||
flags are described in :ref:`contents-of-module-re`.) This
|
||||
is useful if you wish to include the flags as part of the regular
|
||||
expression, instead of passing a *flag* argument to the
|
||||
:const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
|
||||
for the entire regular expression.
|
||||
(The flags are described in :ref:`contents-of-module-re`.)
|
||||
This is useful if you wish to include the flags as part of the
|
||||
regular expression, instead of passing a *flag* argument to the
|
||||
:func:`re.compile` function. Flags should be used first in the
|
||||
expression string.
|
||||
|
||||
@ -272,10 +272,10 @@ The special characters are:
|
||||
| in the same pattern itself | * ``(?P=quote)`` (as shown) |
|
||||
| | * ``\1`` |
|
||||
+---------------------------------------+----------------------------------+
|
||||
| when processing match object ``m`` | * ``m.group('quote')`` |
|
||||
| when processing match object *m* | * ``m.group('quote')`` |
|
||||
| | * ``m.end('quote')`` (etc.) |
|
||||
+---------------------------------------+----------------------------------+
|
||||
| in a string passed to the ``repl`` | * ``\g<quote>`` |
|
||||
| in a string passed to the *repl* | * ``\g<quote>`` |
|
||||
| argument of ``re.sub()`` | * ``\g<1>`` |
|
||||
| | * ``\1`` |
|
||||
+---------------------------------------+----------------------------------+
|
||||
@ -289,18 +289,18 @@ The special characters are:
|
||||
|
||||
``(?=...)``
|
||||
Matches if ``...`` matches next, but doesn't consume any of the string. This is
|
||||
called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
|
||||
called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
|
||||
``'Isaac '`` only if it's followed by ``'Asimov'``.
|
||||
|
||||
``(?!...)``
|
||||
Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
|
||||
Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
|
||||
For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
|
||||
followed by ``'Asimov'``.
|
||||
|
||||
``(?<=...)``
|
||||
Matches if the current position in the string is preceded by a match for ``...``
|
||||
that ends at the current position. This is called a :dfn:`positive lookbehind
|
||||
assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
|
||||
assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
|
||||
lookbehind will back up 3 characters and check if the contained pattern matches.
|
||||
The contained pattern must only match strings of some fixed length, meaning that
|
||||
``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
|
||||
@ -358,26 +358,26 @@ character ``'$'``.
|
||||
|
||||
``\b``
|
||||
Matches the empty string, but only at the beginning or end of a word.
|
||||
A word is defined as a sequence of Unicode alphanumeric or underscore
|
||||
characters, so the end of a word is indicated by whitespace or a
|
||||
non-alphanumeric, non-underscore Unicode character. Note that formally,
|
||||
A word is defined as a sequence of word characters. Note that formally,
|
||||
``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
|
||||
(or vice versa), or between ``\w`` and the beginning/end of the string.
|
||||
This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
|
||||
``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
|
||||
|
||||
By default Unicode alphanumerics are the ones used, but this can be changed
|
||||
by using the :const:`ASCII` flag. Inside a character range, ``\b``
|
||||
represents the backspace character, for compatibility with Python's string
|
||||
literals.
|
||||
By default Unicode alphanumerics are the ones used in Unicode patterns, but
|
||||
this can be changed by using the :const:`ASCII` flag. Word boundaries are
|
||||
determined by the current locale if the :const:`LOCALE` flag is used.
|
||||
Inside a character range, ``\b`` represents the backspace character, for
|
||||
compatibility with Python's string literals.
|
||||
|
||||
``\B``
|
||||
Matches the empty string, but only when it is *not* at the beginning or end
|
||||
of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
|
||||
``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
|
||||
``\B`` is just the opposite of ``\b``, so word characters are
|
||||
Unicode alphanumerics or the underscore, although this can be changed
|
||||
by using the :const:`ASCII` flag.
|
||||
``\B`` is just the opposite of ``\b``, so word characters in Unicode
|
||||
patterns are Unicode alphanumerics or the underscore, although this can
|
||||
be changed by using the :const:`ASCII` flag. Word boundaries are
|
||||
determined by the current locale if the :const:`LOCALE` flag is used.
|
||||
|
||||
``\d``
|
||||
For Unicode (str) patterns:
|
||||
@ -387,11 +387,12 @@ character ``'$'``.
|
||||
used only ``[0-9]`` is matched (but the flag affects the entire
|
||||
regular expression, so in such cases using an explicit ``[0-9]``
|
||||
may be a better choice).
|
||||
|
||||
For 8-bit (bytes) patterns:
|
||||
Matches any decimal digit; this is equivalent to ``[0-9]``.
|
||||
|
||||
``\D``
|
||||
Matches any character which is not a Unicode decimal digit. This is
|
||||
Matches any character which is not a decimal digit. This is
|
||||
the opposite of ``\d``. If the :const:`ASCII` flag is used this
|
||||
becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
|
||||
regular expression, so in such cases using an explicit ``[^0-9]`` may
|
||||
@ -412,7 +413,7 @@ character ``'$'``.
|
||||
this is equivalent to ``[ \t\n\r\f\v]``.
|
||||
|
||||
``\S``
|
||||
Matches any character which is not a Unicode whitespace character. This is
|
||||
Matches any character which is not a whitespace character. This is
|
||||
the opposite of ``\s``. If the :const:`ASCII` flag is used this
|
||||
becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
|
||||
regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
|
||||
@ -426,16 +427,21 @@ character ``'$'``.
|
||||
``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
|
||||
regular expression, so in such cases using an explicit
|
||||
``[a-zA-Z0-9_]`` may be a better choice).
|
||||
|
||||
For 8-bit (bytes) patterns:
|
||||
Matches characters considered alphanumeric in the ASCII character set;
|
||||
this is equivalent to ``[a-zA-Z0-9_]``.
|
||||
this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
|
||||
used, matches characters considered alphanumeric in the current locale
|
||||
and the underscore.
|
||||
|
||||
``\W``
|
||||
Matches any character which is not a Unicode word character. This is
|
||||
Matches any character which is not a word character. This is
|
||||
the opposite of ``\w``. If the :const:`ASCII` flag is used this
|
||||
becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
|
||||
entire regular expression, so in such cases using an explicit
|
||||
``[^a-zA-Z0-9_]`` may be a better choice).
|
||||
``[^a-zA-Z0-9_]`` may be a better choice). If the :const:`LOCALE` flag is
|
||||
used, matches characters considered alphanumeric in the current locale
|
||||
and the underscore.
|
||||
|
||||
``\Z``
|
||||
Matches only at the end of the string.
|
||||
@ -451,7 +457,7 @@ accepted by the regular expression parser::
|
||||
only inside character classes.)
|
||||
|
||||
``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
|
||||
patterns. In bytes patterns they are not treated specially.
|
||||
patterns. In bytes patterns they are errors.
|
||||
|
||||
Octal escapes are included in a limited form. If the first digit is a 0, or if
|
||||
there are three octal digits, it is considered an octal escape. Otherwise, it is
|
||||
@ -526,6 +532,7 @@ form.
|
||||
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
|
||||
perform ASCII-only matching instead of full Unicode matching. This is only
|
||||
meaningful for Unicode patterns, and is ignored for byte patterns.
|
||||
Corresponds to the inline flag ``(?a)``.
|
||||
|
||||
Note that for backward compatibility, the :const:`re.U` flag still
|
||||
exists (as well as its synonym :const:`re.UNICODE` and its embedded
|
||||
@ -537,26 +544,40 @@ form.
|
||||
.. data:: DEBUG
|
||||
|
||||
Display debug information about compiled expression.
|
||||
No corresponding inline flag.
|
||||
|
||||
|
||||
.. data:: I
|
||||
IGNORECASE
|
||||
|
||||
Perform case-insensitive matching; expressions like ``[A-Z]`` will also
|
||||
match lowercase letters. The current locale does not change the effect of
|
||||
this flag. Full Unicode matching (such as ``Ü`` matching ``ü``) also
|
||||
works unless the :const:`re.ASCII` flag is also used to disable non-ASCII
|
||||
matches.
|
||||
match lowercase letters. Full Unicode matching (such as ``Ü`` matching
|
||||
``ü``) also works unless the :const:`re.ASCII` flag is used to disable
|
||||
non-ASCII matches. The current locale does not change the effect of this
|
||||
flag unless the :const:`re.LOCALE` flag is also used.
|
||||
Corresponds to the inline flag ``(?i)``.
|
||||
|
||||
Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
|
||||
combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
|
||||
letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
|
||||
letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
|
||||
'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
|
||||
If the :const:`ASCII` flag is used, only letters 'a' to 'z'
|
||||
and 'A' to 'Z' are matched (but the flag affects the entire regular
|
||||
expression, so in such cases using an explicit ``(?-i:[a-zA-Z])`` may be
|
||||
a better choice).
|
||||
|
||||
.. data:: L
|
||||
LOCALE
|
||||
|
||||
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
|
||||
current locale. The use of this flag is discouraged as the locale mechanism
|
||||
is very unreliable, and it only handles one "culture" at a time anyway;
|
||||
you should use Unicode matching instead, which is the default in Python 3
|
||||
for Unicode (str) patterns. This flag can be used only with bytes patterns.
|
||||
Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
|
||||
dependent on the current locale. This flag can be used only with bytes
|
||||
patterns. The use of this flag is discouraged as the locale mechanism
|
||||
is very unreliable, it only handles one "culture" at a time, and it only
|
||||
works with 8-bit locales. Unicode matching is already enabled by default
|
||||
in Python 3 for Unicode (str) patterns, and it is able to handle different
|
||||
locales/languages.
|
||||
Corresponds to the inline flag ``(?L)``.
|
||||
|
||||
.. versionchanged:: 3.6
|
||||
:const:`re.LOCALE` can be used only with bytes patterns and is
|
||||
@ -577,6 +598,7 @@ form.
|
||||
end of each line (immediately preceding each newline). By default, ``'^'``
|
||||
matches only at the beginning of the string, and ``'$'`` only at the end of the
|
||||
string and immediately before the newline (if any) at the end of the string.
|
||||
Corresponds to the inline flag ``(?m)``.
|
||||
|
||||
|
||||
.. data:: S
|
||||
@ -584,6 +606,7 @@ form.
|
||||
|
||||
Make the ``'.'`` special character match any character at all, including a
|
||||
newline; without this flag, ``'.'`` will match anything *except* a newline.
|
||||
Corresponds to the inline flag ``(?s)``.
|
||||
|
||||
|
||||
.. data:: X
|
||||
@ -605,7 +628,7 @@ form.
|
||||
\d * # some fractional digits""", re.X)
|
||||
b = re.compile(r"\d+\.\d*")
|
||||
|
||||
|
||||
Corresponds to the inline flag ``(?x)``.
|
||||
|
||||
|
||||
.. function:: search(pattern, string, flags=0)
|
||||
@ -660,7 +683,7 @@ form.
|
||||
|
||||
If there are capturing groups in the separator and it matches at the start of
|
||||
the string, the result will start with an empty string. The same holds for
|
||||
the end of the string:
|
||||
the end of the string::
|
||||
|
||||
>>> re.split('(\W+)', '...words, words...')
|
||||
['', '...', 'words', ', ', 'words', '...', '']
|
||||
@ -671,7 +694,7 @@ form.
|
||||
.. note::
|
||||
|
||||
:func:`split` doesn't currently split a string on an empty pattern match.
|
||||
For example:
|
||||
For example::
|
||||
|
||||
>>> re.split('x*', 'axbc')
|
||||
['a', 'bc']
|
||||
@ -728,7 +751,7 @@ form.
|
||||
converted to a single newline character, ``\r`` is converted to a carriage return, and
|
||||
so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
|
||||
as ``\6``, are replaced with the substring matched by group 6 in the pattern.
|
||||
For example:
|
||||
For example::
|
||||
|
||||
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
|
||||
... r'static PyObject*\npy_\1(void)\n{',
|
||||
@ -736,8 +759,8 @@ form.
|
||||
'static PyObject*\npy_myfunc(void)\n{'
|
||||
|
||||
If *repl* is a function, it is called for every non-overlapping occurrence of
|
||||
*pattern*. The function takes a single match object argument, and returns the
|
||||
replacement string. For example:
|
||||
*pattern*. The function takes a single :ref:`match object <match-objects>`
|
||||
argument, and returns the replacement string. For example::
|
||||
|
||||
>>> def dashrepl(matchobj):
|
||||
... if matchobj.group(0) == '-': return ' '
|
||||
@ -747,7 +770,7 @@ form.
|
||||
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
|
||||
'Baked Beans & Spam'
|
||||
|
||||
The pattern may be a string or a :class:`Pattern` object.
|
||||
The pattern may be a string or a :ref:`pattern object <re-objects>`.
|
||||
|
||||
The optional argument *count* is the maximum number of pattern occurrences to be
|
||||
replaced; *count* must be a non-negative integer. If omitted or zero, all
|
||||
@ -809,6 +832,14 @@ form.
|
||||
>>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
|
||||
/|\-|\+|\*\*|\*
|
||||
|
||||
This functions must not be used for the replacement string in :func:`sub`
|
||||
and :func:`subn`, only backslashes should be escaped. For example::
|
||||
|
||||
>>> digits_re = r'\d+'
|
||||
>>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
|
||||
>>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
|
||||
/usr/sbin/sendmail - \d+ errors, \d+ warnings
|
||||
|
||||
.. versionchanged:: 3.3
|
||||
The ``'_'`` character is no longer escaped.
|
||||
|
||||
@ -880,12 +911,12 @@ attributes:
|
||||
from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
|
||||
than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
|
||||
expression object, ``rx.search(string, 0, 50)`` is equivalent to
|
||||
``rx.search(string[:50], 0)``.
|
||||
``rx.search(string[:50], 0)``. ::
|
||||
|
||||
>>> pattern = re.compile("d")
|
||||
>>> pattern.search("dog") # Match at index 0
|
||||
<re.Match object; span=(0, 1), match='d'>
|
||||
>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
|
||||
>>> pattern = re.compile("d")
|
||||
>>> pattern.search("dog") # Match at index 0
|
||||
<re.Match object; span=(0, 1), match='d'>
|
||||
>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
|
||||
|
||||
|
||||
.. method:: Pattern.match(string[, pos[, endpos]])
|
||||
@ -896,12 +927,12 @@ attributes:
|
||||
different from a zero-length match.
|
||||
|
||||
The optional *pos* and *endpos* parameters have the same meaning as for the
|
||||
:meth:`~Pattern.search` method.
|
||||
:meth:`~Pattern.search` method. ::
|
||||
|
||||
>>> pattern = re.compile("o")
|
||||
>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
|
||||
>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
|
||||
<re.Match object; span=(1, 2), match='o'>
|
||||
>>> pattern = re.compile("o")
|
||||
>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
|
||||
>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
|
||||
<re.Match object; span=(1, 2), match='o'>
|
||||
|
||||
If you want to locate a match anywhere in *string*, use
|
||||
:meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
|
||||
@ -914,13 +945,13 @@ attributes:
|
||||
match the pattern; note that this is different from a zero-length match.
|
||||
|
||||
The optional *pos* and *endpos* parameters have the same meaning as for the
|
||||
:meth:`~Pattern.search` method.
|
||||
:meth:`~Pattern.search` method. ::
|
||||
|
||||
>>> pattern = re.compile("o[gh]")
|
||||
>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
|
||||
>>> pattern.fullmatch("ogre") # No match as not the full string matches.
|
||||
>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
|
||||
<re.Match object; span=(1, 3), match='og'>
|
||||
>>> pattern = re.compile("o[gh]")
|
||||
>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
|
||||
>>> pattern.fullmatch("ogre") # No match as not the full string matches.
|
||||
>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
|
||||
<re.Match object; span=(1, 3), match='og'>
|
||||
|
||||
.. versionadded:: 3.4
|
||||
|
||||
@ -934,14 +965,14 @@ attributes:
|
||||
|
||||
Similar to the :func:`findall` function, using the compiled pattern, but
|
||||
also accepts optional *pos* and *endpos* parameters that limit the search
|
||||
region like for :meth:`match`.
|
||||
region like for :meth:`search`.
|
||||
|
||||
|
||||
.. method:: Pattern.finditer(string[, pos[, endpos]])
|
||||
|
||||
Similar to the :func:`finditer` function, using the compiled pattern, but
|
||||
also accepts optional *pos* and *endpos* parameters that limit the search
|
||||
region like for :meth:`match`.
|
||||
region like for :meth:`search`.
|
||||
|
||||
|
||||
.. method:: Pattern.sub(repl, string, count=0)
|
||||
@ -1024,7 +1055,7 @@ Match objects support the following methods and attributes:
|
||||
pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
|
||||
part of the pattern that did not match, the corresponding result is ``None``.
|
||||
If a group is contained in a part of the pattern that matched multiple times,
|
||||
the last match is returned.
|
||||
the last match is returned. ::
|
||||
|
||||
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
|
||||
>>> m.group(0) # The entire match
|
||||
@ -1041,7 +1072,7 @@ Match objects support the following methods and attributes:
|
||||
string argument is not used as a group name in the pattern, an :exc:`IndexError`
|
||||
exception is raised.
|
||||
|
||||
A moderately complicated example:
|
||||
A moderately complicated example::
|
||||
|
||||
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
|
||||
>>> m.group('first_name')
|
||||
@ -1049,14 +1080,14 @@ Match objects support the following methods and attributes:
|
||||
>>> m.group('last_name')
|
||||
'Reynolds'
|
||||
|
||||
Named groups can also be referred to by their index:
|
||||
Named groups can also be referred to by their index::
|
||||
|
||||
>>> m.group(1)
|
||||
'Malcolm'
|
||||
>>> m.group(2)
|
||||
'Reynolds'
|
||||
|
||||
If a group matches multiple times, only the last match is accessible:
|
||||
If a group matches multiple times, only the last match is accessible::
|
||||
|
||||
>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
|
||||
>>> m.group(1) # Returns only the last match.
|
||||
@ -1066,7 +1097,7 @@ Match objects support the following methods and attributes:
|
||||
.. method:: Match.__getitem__(g)
|
||||
|
||||
This is identical to ``m.group(g)``. This allows easier access to
|
||||
an individual group from a match:
|
||||
an individual group from a match::
|
||||
|
||||
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
|
||||
>>> m[0] # The entire match
|
||||
@ -1085,7 +1116,7 @@ Match objects support the following methods and attributes:
|
||||
many groups are in the pattern. The *default* argument is used for groups that
|
||||
did not participate in the match; it defaults to ``None``.
|
||||
|
||||
For example:
|
||||
For example::
|
||||
|
||||
>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
|
||||
>>> m.groups()
|
||||
@ -1093,7 +1124,7 @@ Match objects support the following methods and attributes:
|
||||
|
||||
If we make the decimal place and everything after it optional, not all groups
|
||||
might participate in the match. These groups will default to ``None`` unless
|
||||
the *default* argument is given:
|
||||
the *default* argument is given::
|
||||
|
||||
>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
|
||||
>>> m.groups() # Second group defaults to None.
|
||||
@ -1106,7 +1137,7 @@ Match objects support the following methods and attributes:
|
||||
|
||||
Return a dictionary containing all the *named* subgroups of the match, keyed by
|
||||
the subgroup name. The *default* argument is used for groups that did not
|
||||
participate in the match; it defaults to ``None``. For example:
|
||||
participate in the match; it defaults to ``None``. For example::
|
||||
|
||||
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
|
||||
>>> m.groupdict()
|
||||
@ -1129,7 +1160,7 @@ Match objects support the following methods and attributes:
|
||||
``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
|
||||
2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
|
||||
|
||||
An example that will remove *remove_this* from email addresses:
|
||||
An example that will remove *remove_this* from email addresses::
|
||||
|
||||
>>> email = "tony@tiremove_thisger.net"
|
||||
>>> m = re.search("remove_this", email)
|
||||
@ -1175,7 +1206,7 @@ Match objects support the following methods and attributes:
|
||||
|
||||
.. attribute:: Match.re
|
||||
|
||||
The regular expression object whose :meth:`~Pattern.match` or
|
||||
The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
|
||||
:meth:`~Pattern.search` method produced this match instance.
|
||||
|
||||
|
||||
@ -1213,7 +1244,7 @@ a 5-character string with each character representing a card, "a" for ace, "k"
|
||||
for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
|
||||
representing the card with that value.
|
||||
|
||||
To see if a given string is a valid hand, one could do the following:
|
||||
To see if a given string is a valid hand, one could do the following::
|
||||
|
||||
>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
|
||||
>>> displaymatch(valid.match("akt5q")) # Valid.
|
||||
@ -1224,7 +1255,7 @@ To see if a given string is a valid hand, one could do the following:
|
||||
"<Match: '727ak', groups=()>"
|
||||
|
||||
That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
|
||||
To match this with a regular expression, one could use backreferences as such:
|
||||
To match this with a regular expression, one could use backreferences as such::
|
||||
|
||||
>>> pair = re.compile(r".*(.).*\1")
|
||||
>>> displaymatch(pair.match("717ak")) # Pair of 7s.
|
||||
@ -1326,7 +1357,7 @@ restrict the match at the beginning of the string::
|
||||
|
||||
Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
|
||||
beginning of the string, whereas using :func:`search` with a regular expression
|
||||
beginning with ``'^'`` will match at the beginning of each line.
|
||||
beginning with ``'^'`` will match at the beginning of each line. ::
|
||||
|
||||
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
|
||||
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
|
||||
@ -1342,7 +1373,7 @@ easily read and modified by Python as demonstrated in the following example that
|
||||
creates a phonebook.
|
||||
|
||||
First, here is the input. Normally it may come from a file, here we are using
|
||||
triple-quoted string syntax:
|
||||
triple-quoted string syntax::
|
||||
|
||||
>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
|
||||
...
|
||||
@ -1417,7 +1448,7 @@ Finding all Adverbs
|
||||
:func:`findall` matches *all* occurrences of a pattern, not just the first
|
||||
one as :func:`search` does. For example, if one was a writer and wanted to
|
||||
find all of the adverbs in some text, he or she might use :func:`findall` in
|
||||
the following manner:
|
||||
the following manner::
|
||||
|
||||
>>> text = "He was carefully disguised but captured quickly by police."
|
||||
>>> re.findall(r"\w+ly", text)
|
||||
@ -1431,7 +1462,7 @@ If one wants more information about all matches of a pattern than the matched
|
||||
text, :func:`finditer` is useful as it provides :ref:`match objects
|
||||
<match-objects>` instead of strings. Continuing with the previous example, if
|
||||
one was a writer who wanted to find all of the adverbs *and their positions* in
|
||||
some text, he or she would use :func:`finditer` in the following manner:
|
||||
some text, he or she would use :func:`finditer` in the following manner::
|
||||
|
||||
>>> text = "He was carefully disguised but captured quickly by police."
|
||||
>>> for m in re.finditer(r"\w+ly", text):
|
||||
@ -1446,7 +1477,7 @@ Raw String Notation
|
||||
Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
|
||||
every backslash (``'\'``) in a regular expression would have to be prefixed with
|
||||
another one to escape it. For example, the two following lines of code are
|
||||
functionally identical:
|
||||
functionally identical::
|
||||
|
||||
>>> re.match(r"\W(.)\1\W", " ff ")
|
||||
<re.Match object; span=(0, 4), match=' ff '>
|
||||
@ -1456,7 +1487,7 @@ functionally identical:
|
||||
When one wants to match a literal backslash, it must be escaped in the regular
|
||||
expression. With raw string notation, this means ``r"\\"``. Without raw string
|
||||
notation, one must use ``"\\\\"``, making the following lines of code
|
||||
functionally identical:
|
||||
functionally identical::
|
||||
|
||||
>>> re.match(r"\\", r"\\")
|
||||
<re.Match object; span=(0, 1), match='\\'>
|
||||
|
Loading…
Reference in New Issue
Block a user