Dazbo's Advent of Code solutions, written in Python
Regular Expression HOWTOPython RE modulePython Regex moduleRegexrPyrexp
Regular expressions (often shortened to regex) are a way to look for matching patterns in any text. We can use the pattern to determine where the pattern appears in the text, or to do more sophisticated things like replacing patterns in text.
A pattern is something we want to match within a string. Patterns can be simple, or complex. Patterns include the text we want to look for, along with metacharacters which have special meanings.
Check out this tutorial for a guide on how to build patterns.
Then, make a note of the awesome regexr.com, which is a great place to test and build your regular expressions. It also includes some really useful cheat sheets and references.
Python provides a built-in library for working with regular expressions, called re
. This will generally be good enough for most of our regular expression needs in Python. However, there are some niche cases where you want to do stuff that re
doesn’t offer. In this case, try out the third party Python regex module, which is basically re
on steroids. E.g. finding overlapping pattern matches.
In general, the approach to regex in Python is to compile the pattern, and then use one of handful of methods to apply the pattern to a string or strings.
For example:
import re
# Assume we've loaded in some multiline text data into the variable `data`
# We want to match rows of data that looks like: 5-7 z: qhcgzzz
# We want to obtain 5, 7, z, and qhcgzzz as four separate variables
matcher = re.compile(r"(\d+)-(\d+) ([a-z]): ([a-z]+)")
for row in data:
match = matcher.match(row)
min_val, max_val, policy_char, token_str = match.groups()
Here:
r"regex"
, to avoid any need for convoluted escape characters. The r
prefix turns the str
into Python’s raw string format. In short, it’s just a good idea to always pass patterns in this raw format.match(row)
method looks for the pattern within the text called row
. In particular, the match()
method will look for the match from the beginning of the line of data.search()
instead of match()
.match()
or search()
are successful, they return a match
object. If not, they return None
.(group)
. When we call the groups()
method against any successful match
object, this returns a tuple of the four groups in our regex pattern.It can be useful to perform assignment at the same time as checking if match object was returned. For example, here we will only enter the if
block if a match was found. If a match was found, then the match object will have been assigned to the variable called match
:
if match := matcher.match(row):
# do stuff with match object
We don’t have to compile the pattern in advance. For example, we can do this:
# We're looking for data like... "25,50 -> 30,600"
for line in data:
x1, y1, x2, y2 = map(int, re.match(r"(\d+),(\d+) -> (\d+),(\d+)", line).groups())
lines.append(Line(x1, y1, x2, y2))
Here, we’re:
match()
against re
, rather than against a precompiled pattern.str
type to int
type, since we expect the data to always be numeric.Imagine that the numbers above might contain negative values. This update to the regex expression will allow us to capture both positive and negative values:
for line in data:
x1, y1, x2, y2 = map(int, re.match(r"(-?\d+),(-?\d+) -> (-?\d+),(-?\d+)", line).groups())
lines.append(Line(x1, y1, x2, y2))
We can actually name groups in the regex pattern itself. Then, instead of calling groups()
on a match object, we can instead call groupdict()
. This returns a dictionary, where the keys are the names of the groups, and the values are the string values from the match.
Compare these two approaches:
import re
test = "John Smith"
# First, just using groups() and then unpacking the tuple
name_pattern = r"(\w+) (\w+)"
if (match := re.match(name_pattern, test)):
first_name, last_name = match.groups()
print(f"Unpacking groups(): {first_name}, {last_name}")
# Now, using named groups and returning a dict
name_pattern = r"(?P<first>\w+) (?P<last>\w+)"
if (match := re.match(name_pattern, test)):
name_dict = match.groupdict()
print(f"Using groupdict(): {name_dict['first']}, {name_dict['last']}")
Output:
Unpacking groups(): John, Smith
Using groupdict(): John, Smith
Even more usefully, we can actually embed Python variables within the pattern string. You can see how this can be useful if using groupdict()
:
test = "John Smith"
first_name_grp = "first"
last_name_grp = "last"
name_pattern = rf"(?P<{first_name_grp}>\w+) (?P<{last_name_grp}>\w+)"
if (match := re.match(name_pattern, test)):
name_dict = match.groupdict()
print("Using groupdict() with variables in the pattern: "
f"{name_dict[first_name_grp]}, {name_dict[last_name_grp]}")
Note how we’re prefixing the pattern string with both r
to make it raw, and f
in order to use f-string interpoloation, i.e. so that we can reference variables like {first_name_grp}
within the string.
The finditer()
function is useful for iterating over non-overlapping matches of a regular expression pattern within a given string. Its first parameter is the pattern to search for, and the second is the string to search. It returns an iterator that produces match
objects for each match found.
For example:
import re
text = "Hello, Mycroft! Mycroft is a hunky cat."
pattern = r"Mycroft"
matches = re.finditer(pattern, text)
for match in matches:
print("Match found:", match.group())
print("Start position:", match.start())
print("End position:", match.end())
print("Match span:", match.span())
print()
Here is the output:
Match found: Mycroft
Start position: 7
End position: 14
Match span: (7, 14)
Match found: Mycroft
Start position: 16
End position: 23
Match span: (16, 23)
Use the sub()
method to replace occurrences of a match with a replacement string.
For example:
line = re.sub(r"(\d+)", r"RULE_\1", line)
Here, any number that we find is replaced by “RULE_” + number. E.g.
15
becomes RULE_15
.
The trick to this is to use \n
to reference the nth
group in the preceeding pattern.
This example turns any number n
into X(n). E.g. 456
becomes X(456)
:
re.sub(r"(\d+)", r"X(\1)", input)
Here’s a more sophisticated example. It takes a string like:
= x yz | ab c
and replaces it with:
= ((x yz) / (ab c))
line = re.sub(r"= (.*) \| (.*)$", r"= ((\1) / (\2))", line)
INSTR_PATTERN = re.compile(r"(\d+),(\d+) through (\d+),(\d+)")
for line in data:
match = INSTR_PATTERN.search(line)
assert match, "All instruction lines are expeted to match"
tl_x, tl_y, br_x, br_y = map(int, match.groups())
Here we’re processing multiple lines of data. We’re looking for lines that contain something like:
4,14 through 6,16
(\d+)
groups are used to capture each of the four numbers in the line.AssertionError
.groups()
method to return the four groups as a tuple
. We’re then using tuple unpacking to unpack the tuple into four separate variables.map()
function, to convert each variable from a str
to an int
. Here, the map()
function is applying the int()
function each member of the tuple that was passed in.Here is another way to retrieve matches and their groups. Note how each line expects to return a single match, which is why we always index the return value of findall()
with [0]
. This match contains our four groups, as a tuple.
pattern = re.compile(r"(\d+),(\d+) through (\d+),(\d+)")
print("\nUsing findall with existing pattern:")
for line in data:
tl_x, tl_y, br_x, br_y = map(int, pattern.findall(line)[0])
print(f"tl_x: {tl_x}, tl_y: {tl_y}, br_x: {br_x}, br_y: {br_y}")
print("\nUsing findall, pattern on the fly:")
for line in data:
tl_x, tl_y, br_x, br_y = map(int, re.findall(r"(\d+),(\d+) through (\d+),(\d+)", line)[0])
print(f"tl_x: {tl_x}, tl_y: {tl_y}, br_x: {br_x}, br_y: {br_y}")
The next two two blocks of code achieve the same outcome. The first obtains a match
and then the groups
:
boxes = []
p = re.compile(r"(\d+)x(\d+)x(\d+)") # our regex returns a match containing three groups
for line in data:
if match := p.match(line):
dims = list(map(int, match.groups())) # turn tuple of str into ints
boxes.append(Box(dims))
And here we use findall()
, which circumvents the need to first get the match
.
boxes = []
p = re.compile(r"(\d+)x(\d+)x(\d+)") # our regex returns a match containing three groups
for line in data:
dims = list(map(int, p.findall(line)[0]))
boxes.append(Box(dims))