Perhaps the most important thing for regular expressions in Scala is to be aware of the raw string syntax:
- /*
- normal strings treat the \ character as the escape character
- so this fails
- */
- scala> val normalString = "\.+((xyz)|(abc))"
- < console>:1: error: invalid escape character
- val normalString = "\.+((xyz)|(abc))"
- ^
- /*
- raw strings a great for regular expressions so you don't have
- escape \ characters
- */
- scala> val rawString = """\.+((xyz)|(abc))"""
- rawString: java.lang.String = \.+((xyz)|(abc))
The next thing is to realize that one can easily create a Regex object from any string with the
r
method:- scala> val regex = """\.+((xyz)|(abc))""".r
- regex: scala.util.matching.Regex = \.+((xyz)|(abc))
A regex has the standard matching methods one might expect but lets look at findAllIn and the associated MatchData for now. findAllIn returns a MatchIterator which is an Iterator[String] with MatchData. When iterating over the MatchIterator the full matched string will be returned, if you need the subgroups you will need to convert the MatchIterator to an Iterator[Match].
- // findAllIn returns an iterator over the matches.
- scala> "l|he".r findAllIn "hello xyz" foreach {println _}
- he
- l
- l
- // Each match can have multiple groups
- // Note: Each element in MatchIterator are strings (no Match objects)
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched mkString ",")}
- h,e
- l
- l
- // to access subgroups use the matchData method
- // Note: there are 3 subgroups in the regex
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println( m.subgroups mkString ",")}
- h,e,null
- null,null,l
- null,null,l
- /*
- if matched is called the full match is returned (as if you did not convert the iterator to an Iterator[Match])
- /*
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched)}
- he
- l
- l
- /*
- The following demonstrates more of the methods on Match
- Essentially the elements are:
- (start index of match, end index of match, string before match, string after match, string the match was performed on)
- */
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.start, m.end, m.before, m.after, m.source)}
- (0,2,,llo xyz,hello xyz)
- (2,3,he,lo xyz,hello xyz)
- (3,4,hel,o xyz,hello xyz)
The last methods to look at for this topic are findFirstIn, findFirstMatchIn and the Regex constructor:
- /*
- Groups names can be assigned if the Regex constructor is used
- */
- scala> val withNames = new util.matching.Regex("(h)(e)|(l)", "h", "e", "l")
- withNames: scala.util.matching.Regex = (h)(e)|(l)
- scala> withNames findFirstIn "hello xyz"
- res28: Option[String] = Some(he)
- /*
- I know a match will be found so I am extracting the value from the Option by assigning it to Some(he)
- */
- scala> val Some(he) = withNames findFirstMatchIn "hello xyz"
- he: scala.util.matching.Regex.Match = he
- scala> he.groupNames
- res29: Seq[String] = Array(h, e, l)
- scala> he.group("h")
- res30: String = h
- scala> he.group("e")
- res31: String = e
- scala> he.group(1)
- res32: String = h
- scala> he.group(2)
- res33: String = e
- // Uh oh. NullPointer warning!
- scala> he.group(3)
- res34: String = null
- scala> he.groupCount
- res35: Int = 3
Thanks for posting this-I wasnt aware of scala's builtin regex or the raw string syntax.
ReplyDeleteBy the way, I recently found a way to work around nullable subgroups in pattern matching:
ReplyDeleteval pattern = "(h)(e)|(l)".r
for {
m <- pattern findAllIn "hello xyz" matchData
} m match {
case pattern(x: String, y: String, z) => println("First pattern")
case pattern(x, y, z: String) => println("Second pattern")
case _ => println("Weird pattern")
}
When one specifies the type expected -- in this case, String -- Scala won't succesfully match nulls against it.
I like that!
ReplyDeleteIs this correct in your examples?
ReplyDelete"(h)(e)|(l)".r findAllIn "hello xyz" foreach { m => println(m.matched mkString ",")}
It seems to me that m is a String in this case, so the call to matched would fail. Do you need to call matchData on the MatchIterator returned by findAllIn?
Thanks jeff I fixed the post. The correct line is:
ReplyDelete("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched mkString ",")}