March 12, 2015 by Daniel P. Clark

Substitution with Regex Groupings

As I continue to grow in experience I was looking into how I might do some in-place substitution that I had been accustomed to performing with array matching (split-map-join).  What I’m referring to isn’t just a letter for letter substitution, but something that would find a match and modify it.  For example; if I wanted to quote words that had a pound prefixed I used to do this:

str = 'A sentence referring to a #method and what it does.'

arr = str.split(' ')
arr = arr.map {|word|
  if word[0] == "#"
    '"' + word + '"'
  else
    word
  end
}
str = arr.join(' ')
# => 'A sentence referring to a "#method" and what it does.'

This was some of my earliest Ruby programming technique for parsing strings.  After learning a bit more my code would look more like this.

str = 'A sentence referring to a #method and what it does.'

arr = str.split(' ')
arr = arr.map {|word|
  !!word[0]["#"] ? "\"#{word}#\"" : word
}
str = arr.join(' ')
# => 'A sentence referring to a "#method" and what it does.'

And then after finding gsub; I didn’t take full advantage of it.

str = 'A sentence referring to a #method and what it does.'

matches = str.split(' ').select {|word| !!word[0]["#"] }
matches.each {|word|
  str = str.gsub(word, "\"#{word}\"")
}
puts str
# => 'A sentence referring to a "#method" and what it does.'

Needless to say, looking back, these ways of doing things are very wasteful and don’t need to be written out so long.  Learning regex, match data, and gsub more fully have simplified this down to one simple gsub command, and it works with regex groupings.  I’ll show you the style I like the most for and then I’ll detail some alternatives.

str = 'A sentence referring to a #method and what it does.'

str.gsub(/(?<foo>#\S*)/, '"\k<foo>"')
# => 'A sentence referring to a "#method" and what it does.'

The syntax I use above with ?<foo> is setting a group name with foo which matches to \k<foo>.  The regex that gets matched is whatever follows the ?<foo> within the outer () parenthesis and that gets put in place where the \k<foo> is at in the second parameter.  I’ve chosen to use the word foo here, and the group name syntax, as words that resemble English are usually easier to follow and learn.

Regex groups don’t have to be named, they can be numbered in the order of which the match was found.  Each match is determined by the () parenthesis.  Instead of \k<foo> you will simply use \1 for the first matcher (not just the first match).

str = "A #method and a #method with a :symbol together."

str.gsub(/#\S*/, '<>')
# => "A <#method> and a <#method> with a :symbol together."

And to get an idea of multiple matches I will use an or pipe | between to regex match () parenthesis sections.

str = "A #method and a #method with a :symbol together."

str.gsub(/(#\S*)|(:\S*)/, '<\1><\2>')
# => "A <#method><> and a <#method><> with a <><:symbol> together."

Notice that since the matching regex options were or’d that the output only had one item to replace in each of the <> sections and notice the order they printed in.

Now putting one inside the other we’ll match the 2nd one starting at o.

str = "A #method and a #method with a :symbol together."

str.gsub(/(#\S*(o\S*)\S*)/, '<\1><\2>')
# => "A <#method><od> and a <#method><od> with a :symbol together."

Notice that the output had a value for each match and how it worked? If you want multiple matchings with each of them having their own output then just append another gsub on it.  gsub targets a more specific replacement.

A cool feature available with gsub is being able to replace data with a Hash of Key-Value pairs which will substitute exact matches.  In my testing with this I found it didn’t work on complex results such as HTML/XML tag substitution.  But you can keep it simple.

mappers = {"o" => "oo", "a" => "ay"}

"Welcome to Canada".gsub(/[oa]/, mappers)
# => "Welcoome too Caynayday"

Going back to nested matches they may come in handy for something like a Pig Latin translator.

sentence = "I want to travel the world and see all the wonderful sights"

consonants = (('b'..'z').to_a - ['e','i','o','u']).join
# => "bcdfghjklmnpqrstvwxyz"

sentence.gsub(/(([#{consonants}])(\S*))/, '\3\2ay')
# => "I antway otay raveltay hetay orldway adnay eesay allay hetay onderfulway ightssay"

It’s not a perfect Pig Latin translator, but it’s passable.  Here there are three parenthesis (()()).  And we’re only using the inner two and mapping those with \2 and \3 .  The outer parenthesis make sure that all the inner ones match for the result to evaluate.  The first inner parenthesis is matching any one non-vowel character.  And the second inner parenthesis matches the rest of the word.  Then we just swap them and add ay to the end for Pig Latin.

Summary

gsub rocks!  I’ve only recently learned about group variables in regex matchers.  I found  it out while looking into a better way to do inline substitution for my new gem color_pound_spec_reporter.  I wanted colors in my test output, but I didn’t want to have to do some complex string splitting just to wrap the ANSI color methods around it with map.  But now with gsub that’s super easy.

e.g.

"this is true".gsub(/(?<foo>true)/, ANSI::Code.green('\k<foo>'))

And all looks lovely; true shows up in the color green.  Hopefully this will be as invaluable to you as it is to me.  Please feel free to comment, share, subscribe to my RSS Feed, and follow me on twitter @6ftdan!

God Bless!
-Daniel P. Clark

Image by Ian D. Keating via the Creative Commons Attribution 2.0 Generic License.

P.S. I’m still not a master at using regex and I realize that there may be better ways to do things.  I’ve only written this to illustrate substitution examples with gsub.

#gsub#in-place#match#match data#regex#ruby#string#substitution

Comments

  1. Roberto Decurnex
    March 12, 2015 - 1:01 pm

    On the first example you should return the word inside the map or you would get something like [nil, nil, …, ‘”word”‘, nil, …]

  2. Hettomei
    March 12, 2015 - 4:12 pm

    I think
    “`

    consonants = “bcdfghjklmnpqrstvwxyz”
    “`
    Is better than a range to an array minus another array all joined 😉

    • Daniel P. Clark
      March 12, 2015 - 5:52 pm

      I suppose running

      (('b'..'z').to_a - ['e','i','o','u']).join
      # => "bcdfghjklmnpqrstvwxyz"

      once in a console to grab the string and set the return value in the code would make more sense. So yes, you are right. I was thinking in terms of less typing, but I can accomplish that once and save the result for more efficient usage.

  3. Tom
    March 16, 2015 - 6:02 am

    Rather than: /([#{consonants}])…./, here’s a cool trick you can do with ruby regex:

    /[[a-z]&&[^aeiou]]/

    See the note in ruby docs about set intersection: http://ruby-doc.org/core-2.2.1/Regexp.html#class-Regexp-label-Character+Classes

    • Daniel P. Clark
      March 16, 2015 - 9:25 am

      Thanks for the tip! That’s awesome!

  4. pguardiario
    March 19, 2015 - 5:41 am

    Some of those groups are unnecessary,
    str.gsub(/(#S*)/, ”)
    is the same as:
    str.gsub(/#S*/, ”)

Leave a Reply

Your email address will not be published / Required fields are marked *