Kangabru logo Kangabru logo text
Articles Portfolio

Practical Regex #2

Practical regex case study for developers
September 2020

Contents


As a developer you’re constantly working with large amounts of text like source code, logs, and data files. Often you need to extract, replace, or manipulate that text and regex can help you.

Today I present variants of a real case study where I personally used regex. These examples build upon my last post so check that out too.


Case study example

This is a snippet of this csv file which you can also play with here.

reddit_id,colorblind_comment,score,title,url,created
...
8cwcbu,False,101457,"Cause of Death - Reality vs. Google vs. Media [OC]",https://i.imgur.com/GtIzEok.gif,1523970172
8bzdr8,False,99626,"Gaze and foot placement when walking over rough terrain (article link in comments) [OC]",https://v.redd.it/h0f0m4v5nor01,1523628194
fpga3f,False,99488,"[OC] To show just how insane this week's unemployment numbers are, I animated initial unemployment insurance claims from 1967 until now. These numbers are just astonishing.",https://i.redd.it/tch0t0is32p41.gif,1585245693
i2vx78,True,98638,"The environmental impact of Beyond Meat and a beef patty [OC]",https://i.redd.it/jskjkodg3se51.png,1596456703
fxoxti,False,98067,"Coronavirus Deaths vs Other Epidemics From Day of First Death (Since 2000) [OC]",https://v.redd.it/yemjrb1p9rr41,1586422082
...

(Top 100 posts from r/dataisbeautiful on Reddit where 'colorblind' is mentioned in the comments)


Text extraction: Extract urls from data

There are a ton of things you could extract here but I’ll go with this:

Extract image urls from posts with colorblind comments

You could use excel or write a script here but this is why I would use regex:


Step #1: Select all lines where the colorblind column is True

.*,True,.*


Step #2: Limit to lines that have an image url

.*,True,.*http.*\.(png|jpg).*


Step #3: Match just the url:

(?<=.*,True,.*,)http.*\.(png|jpg)(?=.*)

Lets break up our regex so far into parts:

part_1 Before the url .*,True,.*
part_2 The url http.*\.(png|jpg)
part_3 After the url .*

We want to isolate part_2 so must exclude part_1 and part_3 like this:


Step #4: Extract the text

A modern IDE should now let you select your matches. In VS Code you can select and copy your matches like this:

Urls successfully extracted! You could take this further by running this over your entire repo using global search (Alt + Shift + f). Show me an excel script that does that!


Text replace: Reformat url structure

Sometimes we don’t want to extract text but rather update it. For this example we’ll update all image urls in the file like so:


Step #1: Match image urls (with ID match)

http.*/.*\.(png|jpg)


Step #2: Create regex groups

(http.*)/(.*)\.(png|jpg)

Everything wrapped with () creates a regex group which we can reference in our replace command.

Given the url https://i.redd.it/jskjkodg3se51.png we must create groups for the following parts to transform them as desired:

Let’s group those parts of our regex:


Step #3: Replace

(http.*)/(.*)\.(png|jpg) + $1/$3/$2

Open the replace UI (Ctrl + h in most IDEs)

We can now reference groups in replace commands as follows:

Our search and replace commands will therefore look like this:

That regex replace will change urls like so:

Nice huh? You can also do this globally via Alt + shift + h to search and replace over your whole repo. Very useful.


Powerful stuff eh? Once you learn how to build up a regex like this (hint: practice makes perfect) you can do this kind of stuff in seconds. Searching, extracting, and transforming text will become second nature.

Well that’s it for now! I’ve got 2 more sections of this case study ready to go but this article was getting long. So next week we’ll learn how to use multi-cursors to augment regex even further.

It’s seriously powerful stuff! Until then 👋