Diving deeper into the world of Regular Expressions in Ruby

James JingChao Yu
5 min readJun 28, 2021

Regular Expression, or Regex, is a sequence of special characters that you can customize to specify a search pattern. In today’s blog, I will be using Ruby as an example to demonstrate how it works, but other programming languages such as Python and JavaScript support Regex as well.

If you need an introduction or a refresher, I recommend checking out A Quick Guide to RegEx in Ruby by Gabriel Demes. In his article, he covers the characters, modifiers, anchors and flags syntax commonly used in Regex.

Let’s quickly go over some basics and methods associated with regex. In Ruby, to create a regular expression, we use the // (two forward slashes), the condition is placed between the slashes, and we can also place options following the second slash:

We can approve // is a regex by calling the class method on it

The =~ (equal sign and tilde symbol) will return the index position of the first match. The regex condition can either be on the left or right hand side of =~. For example, if we want to find the index position of the first letter e in phrase:

You can see both outputs from lines 3 and 4 are the same

Next we have the scan method, which returns not the index positions but all the matches. If we were to look for all the the words within phrase, we can do something like this:

The output returns all the the words found within our phrase

We also have the sub and gsub methods available. Both methods take in two arguments, the first is the pattern, the second is the replacement. The sub method will only replace the first occurrence of the match, whereas gsub will replace all occurrences of the match.

To substitute the first a in phrase with an *, we can use the sub method. And if we want to substitute every a in phrase with an *, we can use the gsub method.

Both the sub and gsub methods have a bang (!) version, and they both take regex as their first argument, as well as a regular string.

And finally, the match method will return a MatchData object, if the pattern you specify is not found, nil is returned.

Now the basics are out of the way, let’s dive in! You can download today’s code at my Github page. The set up is very simple, I have a simple text file with four entries, each entry has a fake name, phone number, street address, city, state and zip code on a separate line. In the same folder, I have a regex.rb file, and right now we simply load in the info.txt file.

We load in our info.txt file and output it to the screen

The idea here is we have some data coming in from user inputs, we want to extract some useful information, like their names, phone numbers and addresses. Pretend for a second that we have no control over the format of these data, our job is to use regex to pull them out from the text file.

Let’s begin with the names. The good news for us is we can kind of break down the names into three distinct sections. So we have the names and their sections like this:

Pimento Thornwolf, Sr

Jane J. Wildfire

Celestial James Heathen

Mr. Parsley Densilated

  1. Pimento, Jane, Celestial, Mr, then follow by an optional dot, then a space
  2. Thornwolf, J, James, Parsley, follow by an optional dot or comma, then a space
  3. Sr, Wildfire, Heathen, Densilated

To translate into codes, at the beginning of the line, we have one or more letters, followed by zero or one dot, one space, one or more letters again, followed by zero or one dot or comma, one space, one or more letters at the end.

The i after our / / regex is the option to ignore case

The addresses are formatted in two lines. The first line is the street number and name, the second line consists of a City, State, and zip code. Let’s break it down further:

  1. We know the the street number always begins with a digit, optionally there could be a dash, and one or more digits, a space.
  2. The names of the addresses start with one or more letters, then a space, then one or more occurrences of a word character (Den, of, Dr, 99).
  3. In the case of Hall of Nomads, we can use a wildcard plus one or more occurrences to close it off. In all other cases, the first part of the addresses end at the second step above already.
  4. After a newline, the City name starts with a letter, and afterwards we can use the same technique to finish the State and zip code off with a wildcard.

This is how to looks like translated to codes:

Now we are already pull out the phone numbers. We have the four numbers formatted like this:

718–704–5199

(347) 206–0091

9293365110

1–800-Call-Me

We only have one common pattern amongst all four numbers, and that’s the three digits near the beginning. Starting with the fourth and most uncommon number, we can break down the steps:

  1. At the start of the line, there’s zero or one digit, follow by zero or one dash and left parenthesis.
  2. We have three consecutive numbers in all four cases.
  3. There could be one dash, a right parenthesis and space.
  4. In the first three cases, the middle parts are all numbers. And in the last case, we have letters. To denote one or more word characters(letter, digit, or underscore), we can use \w+
  5. In the first two and last case, we have a dash. And in the third case, we don’t have any. Then One or more word characters at the end.
We are able to grab all four phone numbers

As an exercise, try to add some uncommon email and website addresses to each entry yourself, and see if you can pull them out. Regular Expressions is a powerful yet somewhat cryptic tool for data manipulation, no matter the languages you are using, I am sure you will encounter it on your journey. Until next time, happy coding!

--

--