Natural Language Processing: Regular Expressions

Pattern Matching

Data isn’t just about numbers. We live in a world where a wealth of information is contained in the written word, and we should learn to utilize computers to help us find insights in this powerful tool of communication.

One way we can begin to do this, is by matching patterns within larger documents and structures.

In Python, the common appraoch is to use the module re which stands for regular expressions. You’re probably familiar with regular expressions from another language, but we’ll do a quick review of the common codes via official documentation.

Now, let’s put some basic patterns to the test.

Regular Expression Module `re`

The re module contains useful tools for us to utilize.

re.findall(pattern, string) - Finds multiple occurances of the specified pattern within the string

re.match(pattern, string) - Finds first occurance of the specified pattern if it occurs from the first index position

re.search(pattern, string) - Finds first occurance of the specified pattern anywhere within the string

re.split(pattern, string) - Splits the string on each occurance of the pattern, useful for tokenization

Creating a Reg Exp Pattern

Let’s put together what we’ve covered so far. Let’s create a simple string with a recurring pattern, and use a regular expression to find the matches.

re.findall()

Our pattern will be a capital letter, from A to Z, followed by the non-alphanumeric - character. To test our pattern, the input string will be A-b-C-D-e-F-G-1-55-S, we should return [‘A’, ‘C’, ‘D’, ‘F’, ‘G’], but not S because it doesn’t have the - symbol following it.

Our pattern will be, r"([A-Z])\-". Let’s break this down.

r" initiates the pattern until the closing " mark.

The [] brackets will contain a range to be considered by the algorithm, in this case [A-Z] will look for every capital letter.

\- Adds the condition that the capital letter is followed by a dash mark.

Finally, we () group the range values to be our returned result, we should extract just the capital letters that meet the conditions above, and then isolate the output to drop the dash symbol.


import re

string = "A-b-C-D-e-F-G-1-55-S"
pattern = r"([A-Z])\-"
print(re.findall(pattern, string))

Result:


['A', 'C', 'D', 'F', 'G']

Great, it looks like we were successful. We avoided the lowercase chars, the numbers and the capital S preceded by a dash.

re.split()

What if we wanted to gather everything that DIDN’T match our pattern? Well we could use the following appraoch using split() rather than findall(), and we’ll drop the () grouping demarcation. This should allow us to extract only the items that broke our previous rules.


import re
string = "A-b-C-D-e-F-G-1-55-S"
pattern = r"[A-Z]\-"
print(re.split(pattern, string))

Output:


['', 'b-', '', 'e-', '', '1-55-S']

Written on March 28, 2018