Diving deeper into the world of Regular Expressions in Ruby
If you need an introduction or a refresher, I recommend checking out A Quick Guide to RegEx in Ruby by Gabriel Demes. In his article, he covers the characters, modifiers, anchors and flags syntax commonly used in Regex.
Let’s quickly go over some basics and methods associated with regex. In Ruby, to create a regular expression, we use the // (two forward slashes), the condition is placed between the slashes, and we can also place options following the second slash:
The =~ (equal sign and tilde symbol) will return the index position of the first match. The regex condition can either be on the left or right hand side of =~. For example, if we want to find the index position of the first letter e in phrase:
Next we have the scan method, which returns not the index positions but all the matches. If we were to look for all the the words within phrase, we can do something like this:
We also have the sub and gsub methods available. Both methods take in two arguments, the first is the pattern, the second is the replacement. The sub method will only replace the first occurrence of the match, whereas gsub will replace all occurrences of the match.
To substitute the first a in phrase with an *, we can use the sub method. And if we want to substitute every a in phrase with an *, we can use the gsub method.
Both the sub and gsub methods have a bang (!) version, and they both take regex as their first argument, as well as a regular string.
And finally, the match method will return a MatchData object, if the pattern you specify is not found, nil is returned.
Now the basics are out of the way, let’s dive in! You can download today’s code at my Github page. The set up is very simple, I have a simple text file with four entries, each entry has a fake name, phone number, street address, city, state and zip code on a separate line. In the same folder, I have a regex.rb file, and right now we simply load in the info.txt file.
The idea here is we have some data coming in from user inputs, we want to extract some useful information, like their names, phone numbers and addresses. Pretend for a second that we have no control over the format of these data, our job is to use regex to pull them out from the text file.
Let’s begin with the names. The good news for us is we can kind of break down the names into three distinct sections. So we have the names and their sections like this:
Pimento Thornwolf, Sr
Jane J. Wildfire
Celestial James Heathen
Mr. Parsley Densilated
- Pimento, Jane, Celestial, Mr, then follow by an optional dot, then a space
- Thornwolf, J, James, Parsley, follow by an optional dot or comma, then a space
- Sr, Wildfire, Heathen, Densilated
To translate into codes, at the beginning of the line, we have one or more letters, followed by zero or one dot, one space, one or more letters again, followed by zero or one dot or comma, one space, one or more letters at the end.
The addresses are formatted in two lines. The first line is the street number and name, the second line consists of a City, State, and zip code. Let’s break it down further:
- We know the the street number always begins with a digit, optionally there could be a dash, and one or more digits, a space.
- The names of the addresses start with one or more letters, then a space, then one or more occurrences of a word character (Den, of, Dr, 99).
- In the case of Hall of Nomads, we can use a wildcard plus one or more occurrences to close it off. In all other cases, the first part of the addresses end at the second step above already.
- After a newline, the City name starts with a letter, and afterwards we can use the same technique to finish the State and zip code off with a wildcard.
This is how to looks like translated to codes:
Now we are already pull out the phone numbers. We have the four numbers formatted like this:
We only have one common pattern amongst all four numbers, and that’s the three digits near the beginning. Starting with the fourth and most uncommon number, we can break down the steps:
- At the start of the line, there’s zero or one digit, follow by zero or one dash and left parenthesis.
- We have three consecutive numbers in all four cases.
- There could be one dash, a right parenthesis and space.
- In the first three cases, the middle parts are all numbers. And in the last case, we have letters. To denote one or more word characters(letter, digit, or underscore), we can use \w+
- In the first two and last case, we have a dash. And in the third case, we don’t have any. Then One or more word characters at the end.
As an exercise, try to add some uncommon email and website addresses to each entry yourself, and see if you can pull them out. Regular Expressions is a powerful yet somewhat cryptic tool for data manipulation, no matter the languages you are using, I am sure you will encounter it on your journey. Until next time, happy coding!