Arman's stuff
Regular Expressions

(Fri Aug 7 11:19:17 2009)

A regex is a very, very powerful way to search.

Say, for example, you had a document you wanted to change, say, "them" into "him". With a standard search-and-replace, you could easier just switch every occurrence of the word "them" with the word "him" - but that means instead of "anthem" and "theme" you get "anhim" and "hime".

What you need is a regex:
$string =~ s/them/him/g;
That means, "every time you find 'them', replace it with 'him'". The g at the end means "every time, not just the first one you find." But that's what we had before, so let's add a little bit:
$string =~ s/ them / him /g;
Now it says, "every time you find 'them' *with a space character in front and behind*, replace it with the word 'him', also with a space in front and behind. Tada! Now anthem is safe, because it doesn't start with a space, and theme is safe, because it doesn't end with a space. But, now we aren't replacing 'them' in this phrase: "look at them. How did"
Why not? Because in that sentence, 'them' ends with a period, not a space! Now what?
$string =~ s/\Wthem\W/ him /g;
That's a little better. \W means "any non-word character", which means everything that isn't a-z, 0-9, and underscore, or an apostrophe. Though now, it changes "look at them. How did" into "look at him How did" - it turned the period into a space! So, now, we have to remember what kind of character that \W really is:
$string =~ s/(\W)them(\W)/$1him$2/g;
The parenthesis say, "Remember this bit for later." The $1 and $2 hold what the parenthesis remembered. So, this will find the word 'them', with a non-word character before and after, and replace it with the word 'him', with the same non-word characters before and after. Whew! Finally, it works the way we want.

But what if you wanted to replace "them" and "theem" and "theeeeeeeem" with "him"? That's easy!
$string =~ s/(\W)the+m(\W)/$1him$2/g;
The only character I added was that plus sign. That means, "match 1 or more of the previous character." So now it will replace any number of the letter e. If you wanted to match "thm", as well, you could use * instead; it means "match 0 or more of these."

There are lots of useful things - say, for instance, you want to parse an HTML file and strip out all the HTML tags:
$string =~ s/<[^>]>//g;
That says, match everything that starts with <, has any number of any character except >, and ends with >, and replace it with nothing. Useful! But hard, too. So what's the hardest regex you could get? Don't ask. Really. There are ways to look ahead of the text, look behind the text, count word boundaries... it can get insane. For instance, here is a handy little regex:
$string =~ m/([a-z0-9&'+=_-]+(\.[a-z0-9&'+=_-]*)?)+\@([a-z0-9]+(-+[a-z0-9])?\.)+[a-z]{2,3}$/

The 'm' at the beginning means 'match', instead of 'search and replace' like 's' does. So what does that monster do? It makes sure an email address is up to standards. It says, match if the part before the @ contains a-z, 0-9, and any &'+=_-., but doesn't start or end with a dot, and doesn't have double dots, and the part AFTER the @ contains a-z, 0-9, dots, and dashes, but doesn't start or end with a dot or a dash, or have double dots, and ends with a dot and three letters or a dot and two letters. And that's still a pretty simple example!

<< BlagnotificationsFoons >>

This blag is tagged: Regex, All