Practical Regex #2

Case study example
Text extraction: Extract urls from data
Text replace: Reformat url structure

As a developer you’re constantly working with large amounts of text like source code, logs, and data files. Often you need to extract, replace, or manipulate that text and regex can help you.

Today I present variants of a real case study where I personally used regex. These examples build upon my last post so check that out too.

Practical Regex #1

Practical Regex #1
Regex looks scary but it doesn't have to be. Learn by example from noob to guru.

Case study example

This is a snippet of this csv file which you can also play with here.

reddit_id,colorblind_comment,score,title,url,created
...
8cwcbu,False,101457,"Cause of Death - Reality vs. Google vs. Media [OC]",https://i.imgur.com/GtIzEok.gif,1523970172
8bzdr8,False,99626,"Gaze and foot placement when walking over rough terrain (article link in comments) [OC]",https://v.redd.it/h0f0m4v5nor01,1523628194
fpga3f,False,99488,"[OC] To show just how insane this week's unemployment numbers are, I animated initial unemployment insurance claims from 1967 until now. These numbers are just astonishing.",https://i.redd.it/tch0t0is32p41.gif,1585245693
i2vx78,True,98638,"The environmental impact of Beyond Meat and a beef patty [OC]",https://i.redd.it/jskjkodg3se51.png,1596456703
fxoxti,False,98067,"Coronavirus Deaths vs Other Epidemics From Day of First Death (Since 2000) [OC]",https://v.redd.it/yemjrb1p9rr41,1586422082
...

(Top 100 posts from r/dataisbeautiful on Reddit where 'colorblind' is mentioned in the comments)

Text extraction: Extract urls from data

There are a ton of things you could extract here but I’ll go with this:

Extract image urls from posts with colorblind comments

You could use excel or write a script here but this is why I would use regex:

Quick and easily done from your IDE
Not limited to CSVs - this could be source code which isn’t so easy via excel/scripts
Took me under a minute to do (with prior regex knowledge)

Step #1: Select all lines where the colorblind column is True

.*,True,.*

True match the word ‘True’
, match commas (so we don’t match ‘True’ in the title text)
.* match everything before and after

Step #2: Limit to lines that have an image url

.*,True,.*http.*\.(png|jpg).*

http.* match start of url (not necessary but needed for the next step)
\.(png|jpg) match .png or .jpg extensions
http.*\.(png|jpg) combine to match urls with an image extension
.*,True,.* + http.*\.(png|jpg) + .* combine step 1 regex, the image url, and everything after. Now it matches lines with colorblind comments AND image urls

Step #3: Match just the url:

(?<=.*,True,.*,)http.*\.(png|jpg)(?=.*)

Lets break up our regex so far into parts:

`part_1`	Before the url	`.,True,.`
`part_2`	The url	`http.*\.(png\|jpg)`
`part_3`	After the url	`.*`

We want to isolate part_2 so must exclude part_1 and part_3 like this:

(?<= ... ) ... exclude stuff before a match
... (?= ... ) exclude stuff after a match
(?<= part_1 ) + part_2 + (?= part_3 ) exclude part_1 and part_3
(?<=.*,True,.*)http.*\.(png|jpg)(?=.*) substitute in our actual regex

Step #4: Extract the text

A modern IDE should now let you select your matches. In VS Code you can select and copy your matches like this:

Alt + Enter to select matches
Ctrl + c to copy matches
Ctrl + v to paste matches (in a separate file)

Urls successfully extracted! You could take this further by running this over your entire repo using global search (Alt + Shift + f). Show me an excel script that does that!

Text replace: Reformat url structure

Sometimes we don’t want to extract text but rather update it. For this example we’ll update all image urls in the file like so:

from this: https://i.redd.it/jskjkodg3se51.png
to this: https://i.redd.it/png/jskjkodg3se51

Step #1: Match image urls (with ID match)

http.*/.*\.(png|jpg)

This the url regex we made before but with an image ID match that we’ll use in the next step
http.* match start of url and everything after
/.* match forward slash and everything after (matches the image ID)
\.(png|jpg) match .png or .jpg extensions

Step #2: Create regex groups

(http.*)/(.*)\.(png|jpg)

Everything wrapped with () creates a regex group which we can reference in our replace command.

Given the url https://i.redd.it/jskjkodg3se51.png we must create groups for the following parts to transform them as desired:

https://i.redd.it/ the domain part
jskjkodg3se51 the image ID part
png the image extension part

Let’s group those parts of our regex:

(http.*) groups the domain part
/(.*) groups the image ID part (note the slash is excluded)
\.(png|jpg) groups the image extension part (note this was already grouped and that . is excluded)

Step #3: Replace

(http.*)/(.*)\.(png|jpg) + $1/$3/$2

Open the replace UI (Ctrl + h in most IDEs)

We can now reference groups in replace commands as follows:

$0 is the entire match
$1, $2, $3 is the first, second, third group etc

Our search and replace commands will therefore look like this:

Search: (http.*)/(.*)\.(png|jpg)
Replace: $1/$3/$2

That regex replace will change urls like so:

https://i.redd.it/jskjkodg3se51.png -> https://i.redd.it/png/jskjkodg3se51
https://i.redd.it/mkdikcce8gh31.jpg -> https://i.redd.it/jpg/mkdikcce8gh31

Nice huh? You can also do this globally via Alt + shift + h to search and replace over your whole repo. Very useful.

Powerful stuff eh? Once you learn how to build up a regex like this (hint: practice makes perfect) you can do this kind of stuff in seconds. Searching, extracting, and transforming text will become second nature.

Well that’s it for now! I’ve got 2 more sections of this case study ready to go but this article was getting long. So next week we’ll learn how to use multi-cursors to augment regex even further.

It’s seriously powerful stuff! Until then 👋

Practical Regex #2

Contents

Case study example

Text extraction: Extract urls from data

Text replace: Reformat url structure