$ cat "

Emacs replace-regexp-fu

"

The time has come to take a deeper look at the super useful emacs function replace-regexp.

The scenario:

Let's say you have an XML document that looks something like this:

<Person FirstName='Steve' LastName='Smith' Phone='555-12345' Title='Mr.' BirthDate='1950-01-01' />

and you want to turn it into C# code, like this:


var p = new Person
{
FirstName = "Steve",
LastName = "Smith",
Phone = "555-12345",
Title = "Mr.",
BirthDate = DateTime.Parse("1950-01-01"),
}

So, perfect time to put our regex skills to the test. Since emacs is my editor of choice, replace-regexp is what I'll use to get the job done. The regex for extracting the values we are interested in will look something like this:

<Person FirstName='\(.*?\)' LastName='\(.*?\)' Phone='\(.*?\)' Title='\(.*?\)' BirthDate='\(.*?\)' />

Note that you have to escape the parens to create a capture group and not be literal. This is kind of backwards compared to most other regex implementations, but comes in handy when performing search & replace on Lisp code :)

The replace string will look like this:

var p = new Person
{
FirstName = "\1",
LastName = "\2",
Phone = "\3",
Title = "\4",
BirthDate = DateTime.Parse("\5")
}

Ok, so mission accomplished. However, a week later the format is extended to include the address of the person as well:


<Person FirstName='Steve' LastName='Smith' Phone='555-12345' Title='Mr.' BirthDate='1950-01-01'>
<Address PostalCode='12345' State='Florida' City='Jacksonville' Street='Some street' />
</Person>

and the desired C# code:

var p = new Person
{
FirstName = "Steve",
LastName = "Smith",
Phone = "555-12345",
Title = "Mr.",
BirthDate = DateTime.Parse("1950-01-01"),

Address = new Address
{
PostalCode = "12345",
State = "Florida",
City = "Jacksonville",
Street = "Some street"
}
}

Multi-line time! The new regex now includes line breaks. To enter these, you can either write the expression outside of the mini-buffer and yank it in when executing the replace-regex command, or you can enter a newline in the minibuffer by typing C-q C-j. Here is the regex:

<Person FirstName='\(.*?\)' LastName='\(.*?\)' Phone='\(.*?\)' Title='\(.*?\)' BirthDate='\(.*?\)'>
<Address PostalCode='\(.*?\)' State='\(.*?\)' City='\(.*?\)' Street='\(.*?\)' />
</Person>

and the replace expression:

var p = new Person
{
FirstName = "\1",
LastName = "\2",
Phone = "\3",
Title = "\4",
BirthDate = DateTime.Parse("\5"),

Address = new Address
{
PostalCode = "\6",
State = "\7",
City = "\8",
Street = "\9"
}
}

Still quite straight forward, as long as you get the newlines right in the search expression.

Ok, so yet another week goes by, and now there is one small addition to the format: there should be a "MiddleName" attriute added to the person element:


<Person FirstName='Steve' MiddleName='F.' LastName='Smith' Phone='555-12345' Title='Mr.' BirthDate='1950-01-01'>
<Address PostalCode='12345' State='Florida' City='Jacksonville' Street='Some street' />
</Person>

Here is the matching C# code:

var p = new Person
{
FirstName = "Steve",
MiddleName = "F."
LastName = "Smith",
Phone = "555-12345",
Title = "Mr.",
BirthDate = DateTime.Parse("1950-01-01"),

Address = new Address
{
PostalCode = "12345",
State = "Florida",
City = "Jacksonville",
Street = "Some street"
}
}

Only a minor change, so the regex should only need a small tweak. However, adding this field brings us up to 10 match groups. If we continue with the same pattern and just add the tenth group and reference, like this:


var p = new Person
{
FirstName = "\1",
MiddleName = "\2",
LastName = "\3",
Phone = "\4",
Title = "\5",
BirthDate = DateTime.Parse("\6"),

Address = new Address
{
PostalCode = "\7",
State = "\8",
City = "\9",
Street = "\10"
}
}

The output from the replace will then be:


var p = new Person
{
FirstName = "Steve",
MiddleName = "F.",
LastName = "Smith",
Phone = "555-12345",
Title = "Mr.",
BirthDate = DateTime.Parse("1950-01-01"),

Address = new Address
{
PostalCode = "12345",
State = "Florida",
City = "Jacksonville",
Street = "Steve0"
}
}

If you take a closer look at the Street value, you see that it is actually "Steve0" which is not remotely what you would have wanted it to be. Instead of referencing the 10:th capture group it is actually a reference to the 1:st capture group immediately followed by a zero. The reason for this is that emacs only allows a single digit following the backslash.

What to do now? We'll have to bring out the big guns. It's Lisp time!

Emacs lets you embed lisp code within your replace expression, by escaping it with "\,". A useful function for this case is match-string which takes an integer specifying the capture group number to reference. The new expression will then be:


var p = new Person
{
FirstName = "\,(match-string 1)",
MiddleName = "\,(match-string 2)",
LastName = "\,(match-string 3)",
Phone = "\,(match-string 4)",
Title = "\,(match-string 5)",
BirthDate = DateTime.Parse("\,(match-string 6)"),

Address = new Address
{
PostalCode = "\,(match-string 7)",
State = "\,(match-string 8)",
City = "\,(match-string 9)",
Street = "\,(match-string 10)"
}
}

Tada! Now we're up and rolling again.

For the fun of it, let's say that another week goes by and yet another attribute is added, this time it is NickName:

<Person FirstName='Steve' MiddleName='F.' LastName='Smith' NickName='Stevenizzle' Phone='555-12345' Title='Mr.' BirthDate='1950-01-01'>
<Address PostalCode='12345' State='Florida' City='Jacksonville' Street='Some street' />
</Person>

Since we've removed the limitation of 9 capture groups we can just modify the regex to add the new capture group and reference.

<Person FirstName='\(.*?\)' MiddleName='\(.*?\)' LastName='\(.*?\)' NickName='\(.*?\)' Phone='\(.*?\)' Title='\(.*?\)' BirthDate='\(.*?\)'>
<Address PostalCode='\(.*?\)' State='\(.*?\)' City='\(.*?\)' Street='\(.*?\)' />
</Person>
var p = new Person
{
FirstName = "\,(match-string 1)",
MiddleName = "\,(match-string 2)",
LastName = "\,(match-string 3)",
NickName = "\,(match-string 4)",
Phone = "\,(match-string 4)",
Title = "\,(match-string 5)",
BirthDate = DateTime.Parse("\,(match-string 6)"),

Address = new Address
{
PostalCode = "\,(match-string 7)",
State = "\,(match-string 8)",
City = "\,(match-string 9)",
Street = "\,(match-string 10)"
}
}

As you can see, since we reference the groups by reference this means that we need to increment a bunch of numbers. This is a typically boring thing to do. Since we're already in the regex mindset, let's go regex on our regex and add 1 to all the numbers that need incrementing. Fortunately we already know how to embed lisp in our replace expression, so all we need to do is to hack away at some lovely lisp code.

The search expression:

\([0-9]+\)

and the replace expression:

\,(+ 1 (string-to-number (match-string 1)))

By applying this regex and the replace expression above, from the second row referencing match group 4 and downwards, we get this beauty:

var p = new Person
{
FirstName = "\,(match-string 1)",
MiddleName = "\,(match-string 2)",
LastName = "\,(match-string 3)",
NickName = "\,(match-string 4)",
Phone = "\,(match-string 5)",
Title = "\,(match-string 6)",
BirthDate = DateTime.Parse("\,(match-string 7)"),

Address = new Address
{
PostalCode = "\,(match-string 8)",
State = "\,(match-string 9)",
City = "\,(match-string 10)",
Street = "\,(match-string 11)"
}
}

which in turn gives us the final result:


var p = new Person
{
FirstName = "Steve",
MiddleName = "F.",
LastName = "Smith",
NickName = "Stevenizzle",
Phone = "555-12345",
Title = "Mr.",
BirthDate = DateTime.Parse("1950-01-01"),

Address = new Address
{
PostalCode = "12345",
State = "Florida",
City = "Jacksonville",
Street = "Some street"
}
}

So with the ability to embed Lisp code into your regexes, you are only limited by your imagination and possibly your Lisp skills :)

Written by Erik Öjebo 2011-11-14 16:48

    Comments