Regular expressions are hard. There’s a lot to them that vary between different regular expression engines. They’re often considered to be an afterthought when it comes to a programmer’s tools due to the relative infrequency at which they’re used. Whenever they are actually needed, the programmer most likely only reads the minimum amount required that will fulfill the parsing problem at hand. Regular expressions are super cryptic to read and even more so when writing them in Java. While I don’t necessarily enjoy working with regular expressions, I highly respect those who are comfortable enough to take on any sort of parsing problem. This post is an attempt for me to solidify some of the basic and advanced things I’ve been picking up with regular expressions recently.
Greedy will grab more, lazy will grab less
When people keep mentioning use greedy vs lazy quantifiers they’re talking about the
? + * symbols. The default quantifiers will use greedy matching, meaning, they will try to match as many of the quantified character before it continues trying to find matches with the rest of the expression. Appending a
? to any quantifier will cause the regular expression engine to become ‘lazy’ and only match the as little as possible before it finds a complete match. While having a lazy engine may sound like a bad thing sometimes you just don’t want your engine working so hard and going overboard with its matches. Say you’re making a chat client and you want to intercept colon delimited messages so that you can eventually replace those with emoticons.
1 2 3 4 5 6 7 8
note how the greedy regexp returns a single value consisting of both the :smile: and :happy: combined into a single match, while the lazy regexp properly returns two values splitting up the smile and happy. The greedy
. character matcher kept going until the final
: was found at the end of the string, while the lazy
.? character matcher was content with ending the first match and the first sight of the
: after “smile” and started a new match when it found the first
: for “happy”.
Using matched content in replaces with groupings
This is something that so easy but not entirely apparent when initially learning about regular expressions (at least not for me). Surrounding an expression with parenthesis will create a grouping. Groupings are designated in the order that they are written in the expression. Groupings can be designated with
( ). A complicated pattern can be written at the start of the expression and reused as a grouping reference later in the expression. Lets say that we want to catch whenever a user writes duplicate words in the chat box because we’re nice and we want to catch users making silly mistakes.
1 2 3 4
Text that’s captured in a grouping can also be spit out to be used in the replacement string. So continuing with our chat emoticon example, lets say that you want to replace the matched
: delimited text with an image tag that points to the appropriate file for each emoticon.
1 2 3 4
Parenthesis are placed around the
\w+ and is accessed with the
$1 in the replacement text. Notice how the grouping is accessed in the replacement text with the
Lookahead and Lookbehind to validate a match
Lookahead and lookbehinds are additions to a standard expression that allow you to check whether the base expression either follows, or is followed by another expression. Lookahead and lookbehinds are not matched, as in they will not be returned by the regexp engine. This is helpful in that sometimes you just want to capture a subset of a complete expression. Lookaheads are performed with
(?!notsuchandsuch). Lookbehinds are performed with
(?<!nottextbehind). Continuing the chat box example, lets say that we want to implement a feature where users can vote in a poll. Users can vote for a certain option by suffixing ‘vote’ to a number. Our goal is to intercept that number tied to the ‘vote’.
1 2 3 4
Okay! That does it. I know there’s way more about regular expressions I gotta internalize but I think this is a good place to end for now.