I’ve been reading Mastering Regular Expressions by Jeffrey E.F. Friedl, and since nobody in my life (aside from my wife) cares, I thought I’d share something I’m pretty proud of. My first set of regular expressions, that I wrote myself to manipulate the text I’m working with.
What’s I’m so happy about is that I wrote these expressions. I understand exactly what they do and the purpose of each character in each expression.
I’ve used regex in the past. Stuff cobbled together from stack overflow, but I never really understood how they worked or what the expressions meant, just that they did what I needed them to do at the time.
I’m only about 10% of the way through the book, but already I understand so much more than I ever did about regex (I also recognize I have a lot to learn).
I wrote the expressions to be used with egrep and sed to generate and clean up a list of filenames pulled out of tarballs. (movies I’ve ripped from my DVD collection and tarballed to archive them).
The first expression I wrote was this one used with tar and egrep to list the files in the tarball and get just the name of the video file:
tar -tzvf file.tar.gz | egrep -o '\/[^/]*\.m(kv|p4)' > movielist
Which gives me a list of movies of which this is an example:
/The.Hunger.Games.(2012).[tmdbid-70160].mp4
Then I used sed with the expression groups to remove:
- the leading forward slash
- Everything from
.[
to the end - All of the periods in between words
And the last expression checks for one or more spaces and replaces them with a single space.
This is the full sed command:
sed -Eie 's/^\///; s/\.\[[a-z]+-[0-9]+\]\.m(p4|kv)//; s/[^a-zA-Z0-9\(\)&-]/ /g; s/ +/ /g' movielist
Which leaves me with a pretty list of movies that looks like this:
The Hunger Games (2012)
I’m sure this could be done more elegantly, and I’m happy for any feedback on how to do that! For now, I’m just excited that I’m beginning to understand regex and how to use it!
Edit: fixed title so it didn’t say “regex expressions”
It does feel good! And thanks for that xkcd! That one’s new to me.
Ah…the days when perl was the shit and python was still a glimmer in the eye of some frustrated programmer.
I relearn regex from scratch every time I need to use it.
This is the way.
It is a great book, although a bit outdated. In particular, nowadays
egrep
is not recommended to use.grep -E
is a more portable synonim.Some notes on you script:
-
You don’t need to escape slashes in grep regex. In the sed
s///
command better use another character like##
so you also can leave slashes unescaped. -
You usually don’t need to pipe
grep
andsed
,sed -n
with regex address and explicit printing command gives the same result asgrep
. -
You could omit leading slash in your
egrep
regex, so you won’t need to remove it later.
So I would do the same with
tar -tzvf file.tar.gz | sed -En '/\.(mp4|mkv)$/{s#^.*/##; s#\.\[.*##; s#[^a-zA-Z0-9()&-]# #g; s/ +/ /g; p}'
-
Just adding my congrats. Good job, OP. Regex is super useful stuff.
Thank you!!!
Just to chip in because I haven’t seen it mentioned yet, but I fing LLMs like ChatGPT or Microsoft Copilot are really good at making regexes and also at explaining regexes. So if you’re learning them or just want to get the darned thing to work so you can go to bed those are a good resource.
You know, I haven’t yet used ChatGPT for anything, I might check it out for this reason.
I use it to tell me which page of the Pathfinder 1e manual I should look on for the rules I need.
Nice! Learning regular expressions is one of those things where it’s absurd but once you do it, you can solve problems that bedevil whole industries.
Thanks!
And it still kinda breaks my brain when I look at an expression. When I just look at it it looks like utter gibberish, but when I say to myself, “okay, what’s this doing?”
And go through it character by character, it turns into something I can comprehend.
“regex” means “regular expression”, so “regex expression” means “regular expression expression”.
Dang! I read through my post three times to make sure I didn’t do that and completely missed that I did it right in the title. (Now fixed).
I’ll have to check out this book. Just remember HTML cannot be parsed with regex
Well, technically it is possible with regex dialect that has lookarounds, but it is overcomplicated. There’s really no reason to do it.
Thanks for that link.
I think the most impressive part of this is that your wife cares.
…does she have a sister?
I’m currently seeing a girl I started dating after she had problems with her regex and I helped her out.
So far so good.
@sab @prowess2956 @harsh3466 now you have two problems, but you don’t know it yet
She does but, I’d stay away from the sister. 🤣
Give a man a regular expression and he’ll match a string… teach him to make his own regular expressions and you’ve got a man with problems. – yakugo in http://regex.info/blog/2006-09-15/247#comment-3022 (and yes, it is
http://
neverhttps://
for this domain)Guess I’ve got problems!
I highly recommend https://alf.nu/RegexGolf?world=regex&level=r00
That looks like a great way to practice
It’s definitely a way to get your regex-fu to the next level, especially if you have people to compete against.
Oh gosh. There are regex competitions out there, aren’t there.
Yup, including for the largest “in production” regular expression….
I stumbled upon this regex crossword puzzle a while back. I was never good enough to get it, but it seems like it could be fun.
That’s cool! Kudos!
My biggest project was to remove leading and trailing whitespaces but I think I failed twice 😅
🤣
I went though about 20 iterations to get all of this to work correctly.
Why spend 20 minutes manually changing text in a file, when you can spend 90 minutes figuring out a single RegEx to do it?
So much truth here.
I was wondering a few years ago how far you could get with implementing some simple markup syntax with just regex. Turns out, surprisingly far, but once stuff starts going wrong you’re in a less than ideal situation.
https://github.com/bwachter/awfulcms/blob/master/lib/AwfulCMS/SynBasic.pm
Congrats on your learning! I did a similar thing with music and converting all random songs to mp3