Wednesday, January 6, 2010

Regular Expression 1 : Basics and findAllIn

This is the first of a series on Regular expression use in Scala. There was a previous post related that is also worth looking at but the same tips will be revisited in this series: Matching Regular Expressions.

Perhaps the most important thing for regular expressions in Scala is to be aware of the raw string syntax:
  1. /* 
  2. normal strings treat the \ character as the escape character 
  3. so this fails
  4. */
  5. scala> val normalString = "\.+((xyz)|(abc))"
  6. < console>:1: error: invalid escape character
  7.        val normalString = "\.+((xyz)|(abc))"
  8.                             ^
  9. /*
  10. raw strings a great for regular expressions so you don't have 
  11. escape \ characters
  12. */
  13. scala> val rawString = """\.+((xyz)|(abc))"""
  14. rawString: java.lang.String = \.+((xyz)|(abc))

The next thing is to realize that one can easily create a Regex object from any string with the r method:
  1. scala> val regex = """\.+((xyz)|(abc))""".r
  2. regex: scala.util.matching.Regex = \.+((xyz)|(abc))

A regex has the standard matching methods one might expect but lets look at findAllIn and the associated MatchData for now. findAllIn returns a MatchIterator which is an Iterator[String] with MatchData. When iterating over the MatchIterator the full matched string will be returned, if you need the subgroups you will need to convert the MatchIterator to an Iterator[Match].
  1. // findAllIn returns an iterator over the matches.  
  2. scala> "l|he".r findAllIn "hello xyz" foreach {println _}
  3. he
  4. l
  5. l
  6. // Each match can have multiple groups
  7. // Note: Each element in MatchIterator are strings (no Match objects)
  8. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched mkString ",")}   
  9. h,e
  10. l
  11. l
  12. // to access subgroups use the matchData method
  13. // Note: there are 3 subgroups in the regex
  14. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println( m.subgroups mkString ",")}
  15. h,e,null
  16. null,null,l
  17. null,null,l
  18. /*
  19. if matched is called the full match is returned (as if you did not convert the iterator to an Iterator[Match])
  20. /*
  21. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched)}
  22. he
  23. l
  24. l
  25. /*
  26. The following demonstrates more of the methods on Match
  27. Essentially the elements are:
  28. (start index of match, end index of match, string before match, string after match, string the match was performed on)
  29. */
  30. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.start, m.end, m.before, m.after, m.source)}
  31. (0,2,,llo xyz,hello xyz)
  32. (2,3,he,lo xyz,hello xyz)
  33. (3,4,hel,o xyz,hello xyz)

The last methods to look at for this topic are findFirstIn, findFirstMatchIn and the Regex constructor:
  1. /*
  2. Groups names can be assigned if the Regex constructor is used
  3. */
  4. scala> val withNames = new util.matching.Regex("(h)(e)|(l)""h""e""l")
  5. withNames: scala.util.matching.Regex = (h)(e)|(l)
  6. scala> withNames findFirstIn "hello xyz"
  7. res28: Option[String] = Some(he)
  8. /*
  9. I know a match will be found so I am extracting the value from the Option by assigning it to Some(he)
  10. */
  11. scala> val Some(he) = withNames findFirstMatchIn "hello xyz"  
  12. he: scala.util.matching.Regex.Match = he
  13. scala> he.groupNames
  14. res29: Seq[String] = Array(h, e, l)
  15. scala> he.group("h")
  16. res30: String = h
  17. scala> he.group("e"
  18. res31: String = e
  19. scala> he.group(1)
  20. res32: String = h
  21. scala> he.group(2)
  22. res33: String = e
  23. // Uh oh. NullPointer warning!
  24. scala> he.group(3)
  25. res34: String = null
  26. scala> he.groupCount
  27. res35: Int = 3

5 comments:

  1. Thanks for posting this-I wasnt aware of scala's builtin regex or the raw string syntax.

    ReplyDelete
  2. By the way, I recently found a way to work around nullable subgroups in pattern matching:

    val pattern = "(h)(e)|(l)".r
    for {
    m <- pattern findAllIn "hello xyz" matchData
    } m match {
    case pattern(x: String, y: String, z) => println("First pattern")
    case pattern(x, y, z: String) => println("Second pattern")
    case _ => println("Weird pattern")
    }

    When one specifies the type expected -- in this case, String -- Scala won't succesfully match nulls against it.

    ReplyDelete
  3. Is this correct in your examples?

    "(h)(e)|(l)".r findAllIn "hello xyz" foreach { m => println(m.matched mkString ",")}


    It seems to me that m is a String in this case, so the call to matched would fail. Do you need to call matchData on the MatchIterator returned by findAllIn?

    ReplyDelete
  4. Thanks jeff I fixed the post. The correct line is:

    ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched mkString ",")}

    ReplyDelete