Natural Language Processing: Regular Expressions
Pattern Matching
Data isn’t just about numbers. We live in a world where a wealth of information is contained in the written word, and we should learn to utilize computers to help us find insights in this powerful tool of communication.
One way we can begin to do this, is by matching patterns within larger documents and structures.
In Python, the common appraoch is to use the module re
which stands for regular expressions. You’re probably familiar with
regular expressions from another language, but we’ll do a quick review of the common codes via official documentation.
Now, let’s put some basic patterns to the test.
Regular Expression Module re
The re module contains useful tools for us to utilize.
re.findall(pattern, string)
- Finds multiple occurances of the specified pattern within the string
re.match(pattern, string)
- Finds first occurance of the specified pattern if it occurs from the first index position
re.search(pattern, string)
- Finds first occurance of the specified pattern anywhere within the string
re.split(pattern, string)
- Splits the string on each occurance of the pattern, useful for tokenization
Creating a Reg Exp Pattern
Let’s put together what we’ve covered so far. Let’s create a simple string with a recurring pattern, and use a regular expression to find the matches.
re.findall()
Our pattern will be a capital letter, from A to Z, followed by the non-alphanumeric -
character. To test our pattern,
the input string will be A-b-C-D-e-F-G-1-55-S
, we should return [‘A’, ‘C’, ‘D’, ‘F’, ‘G’], but not S because it doesn’t
have the -
symbol following it.
Our pattern will be, r"([A-Z])\-"
. Let’s break this down.
r"
initiates the pattern until the closing "
mark.
The []
brackets will contain a range to be considered by the algorithm, in this case [A-Z]
will look for every capital
letter.
\-
Adds the condition that the capital letter is followed by a dash mark.
Finally, we ()
group the range values to be our returned result, we should extract just the capital letters that meet the
conditions above, and then isolate the output to drop the dash symbol.
import re
string = "A-b-C-D-e-F-G-1-55-S"
pattern = r"([A-Z])\-"
print(re.findall(pattern, string))
Result:
['A', 'C', 'D', 'F', 'G']
Great, it looks like we were successful. We avoided the lowercase chars, the numbers and the capital S preceded by a dash.
re.split()
What if we wanted to gather everything that DIDN’T match our pattern? Well we could use the following appraoch using split()
rather than findall()
, and we’ll drop the ()
grouping demarcation. This should allow us to extract only the items
that broke our previous rules.
import re
string = "A-b-C-D-e-F-G-1-55-S"
pattern = r"[A-Z]\-"
print(re.split(pattern, string))
Output:
['', 'b-', '', 'e-', '', '1-55-S']