[Ilugc] [TIP] perl tutorial IV regular expressions

From: girishvenkatachalam@xxxxxxxxx (Girish Venkatachalam)
Date: Tue, 6 Dec 2011 11:13:00 +0530

A regular expression is regular and is an expression of the real thing.

It is not the real thing.

It is regular because it is a pattern to express the real thing.

Let us look at an example.

$foo = "Girish is here";

if($foo =~ /\w/) {
        print "\$foo contains alphanumeric characters\n";
}

The =~ is a regex equality operator. It is like ne or eq in perl.

Or like == check in most languages.

And the // is the regex container.

Anything inside that is an expression. Like

\w
\d
\S
[a-z]
.*

It is a little difficult to learn but it is well worth it.

For instance at this minute you can easily memorize :

\w - Any word character(alphanumerc, 0-9a-z_ )

\d - Any digit 0123456789

\S - Any non space character(any character other than single space or
tab or newline)

[a-z] - It is a character class with characters a to z.

Now regular expressions like these are always used inside the two slash
characters like this. //.

/\w+/ is used to mean more than one of word character.

/\d+/ means contiguous digits.

First we will focus on regular expression matching. That is the tough
part. Tomorrow we will look at substitution with the s/// operator.

The match is actually the m// operator but usually nobody uses m. People
just use //.

So to match a sequence of digits, we have to write like this.

$s = "I am a string with 1234";

Now

if($s =~ /\d+/) {
        print "digits match."
}

There are two more twists.

The () operator is called the cluster operator inside a // regex
container.

So /\d+/ is the same as /(\d+)/ as far as matching is concerned and
as far as regex is concerned but with a difference.

The matched tokens are stored in a special variable.

This is the real power of regex.

It is stored in $1.

So if you print it you will get 1234.

if($s =~ /(\d+)/) {
   print $1;
}

I have to explain few more concepts and we will see more examples
tomorrow.

+

is a repetition operator.

There are other repetition operators:

{1,n} - 1 to n repetitions

But + is cool.  + is actually {1,} (more than 1 without bounds)

It is nice and can be used very often for most basic
needs.

\d+

as we saw above means get us all the digits that occur close together.

It will get us

"134343 is the string"

all of the digits here.

But not this.

"324s343 3r4 is 33"

In this case we need something else to match all the digits.

In fact it is very difficult.

So as I said always remember regular expressions need an ability to
think about what is regular in the input string.

Then only we can construct meaningful and correct regex.

Another concept is called anchors.

Most of the time we look for regex in a positional way.

We want to find any string occurring in the beginning of a line or after
a space or something.

Or at the end of the line.

So you say

/^string$/

to say that the line begins and ends with the literal "string".

^ is the beginning of the line.

$ is the end of the line.

This means we can say.

/^The/

for all lines beginning with The.

/end$/

for all lines ending with "end".

-Girish

--
G3 Tech
Networking appliance company
web: http://g3tech.in ?mail: girish at g3tech.in

[Ilugc] [TIP] perl tutorial IV regular expressions

Other related posts: