Monday, January 11, 2010

Regular Expression 3: Regex matching

This post covers basically the same things as Matching Regular Expressions but goes into a bit more detail. I recommend reading both posts since there is unique information in each.

The primary new item I show here is that more advanced matching techniques can be used but more importantly all groups are matched even groups that are within another group.

Note: The examples use Scala 2.8. Most examples will work with 2.7 but I believe the last example is Scala 2.8 only.

  1. scala> val date = "11/01/2010"
  2. date: java.lang.String = 11/01/2010
  3. scala> val Date = """(\d\d)/(\d\d)/(\d\d\d\d)""".r
  4. Date: scala.util.matching.Regex = (\d\d)/(\d\d)/(\d\d\d\d)
  5. /*
  6. When a Regex object is used in matching each group is assigned to a variable
  7. */
  8. scala> val Date(day, month, year) = date          
  9. day: String = 11
  10. month: String = 01
  11. year: String = 2010
  12. scala> val Date = """(\d\d)/((\d\d)/(\d\d\d\d))""".r
  13. Date: scala.util.matching.Regex = (\d\d)/((\d\d)/(\d\d\d\d))
  14. /*
  15. This example demonstates how all groups must be assigned, if not there will be a matchError thrown
  16. */
  17. scala> val Date(day, monthYear, month, year) = date
  18. day: String = 11
  19. monthYear: String = 01/2010
  20. month: String = 01
  21. year: String = 2010
  22. scala> val Date(day, month, year) = date           
  23. scala.MatchError: 11/01/2010
  24. at .< init>(< console>:5)
  25. at .< clinit>(< console>)
  26. // but placeholders work in Regex matching as well:
  27. scala> val Date(day, _, month, year) = date
  28. day: String = 11
  29. month: String = 01
  30. year: String = 2010
  31. scala> val Names = """(\S+) (\S*)""".r             
  32. Names: scala.util.matching.Regex = (\S+) (\S*)
  33. scala> val Names(first, second) = "Jesse Eichar"
  34. first: String = Jesse
  35. second: String = Eichar
  36. /*
  37. If you want to use Regex's in assignment you must be sure the match will work.  Otherwise you should do real matching
  38. */
  39. scala> val Names(first, second) = "Jesse"       
  40. scala.MatchError: Jesse
  41. at .< init>(< console>:5)
  42. at .< clinit>(< console>)
  43. scala> val M = """\d{3}""".r
  44. M: scala.util.matching.Regex = \d{3}
  45. /*
  46. There must be a group in the Regex or match will fail
  47. */
  48. scala> val M(m) = "Jan"
  49. scala.MatchError: Jan
  50. at .< init>(< console>:5)
  51. at .< clinit>(< console>)

The following are a few more complex examples
  1. scala> val Date = """((\d\d)/(\d\d)/(\d{4}))|((\w{3}) (\d\d),\s?(\d{4}))""".r         
  2. Date: scala.util.matching.Regex = ((\d\d)/(\d\d)/(\d{4}))|((\w{3}) (\d\d),\s?(\d{4}))
  3. /*
  4. The Regex has an or in it.  So only 1/2 of the groups will be non-null.
  5. If the first group is a String then it is non-null and the next three elements
  6. the pattern will be day/month/year
  7. Otherwise if the 5th group is a String then the patter will be month day, year
  8. Lastly a catch all
  9. */
  10. scala> def printDate(date:String) = date match {                                      
  11.      | case Date(_:String,day,month,year,_,_,_,_) => (day,month,year)                 
  12.      | case Date(_,_,_,_,_:String,month,day,year) => (day,month,year) // process month
  13.      | case _ => ("x","x","x")                                                        
  14.      | }
  15. printDate: (date: String)(StringStringString)
  16. scala> printDate("Jan 01,2010"
  17. res0: (StringStringString) = (01,Jan,2010)
  18. scala> printDate("01/01/2010"
  19. res1: (StringStringString) = (01,01,2010)
  20. /*
  21. A silly example which drops the first element of the date string
  22. not useful but this demonstrates that we are matching agains a sequence so 
  23. the _* can be used to match the rest of the groups
  24. */
  25. scala> def split(date:String) = date match {         
  26.      | case d @ Date(_:String ,_*) => d drop 3       
  27.      | case d @ Date(_,_,_,_,_:String,_*) => d drop 4
  28.      | case _ => "boom"                              
  29.      | }
  30. split: (date: String)String
  31. scala> split ("Jan 31,2004")
  32. res5: String = 31,2004
  33. scala> split ("11/12/2004"
  34. res6: String = 12/2004
  35. /*
  36. This is just a reminder that the findAllIn returns an iterator which (since it is probably a short iterator) can be converted to a sequence and processed with matching
  37. */
  38. scala> val Seq(one,two,_*) = ("""\d\d/""".r findAllIn "11/01/2010" ).toSeq  
  39. one: String = 11/
  40. two: String = 01/
  41. scala> val Seq(one,two) = ("""\d\d/""".r findAllIn "11/01/2010" ).toSeq   
  42. one: String = 11/
  43. two: String = 01/
  44. // drop the two first matches and assign the rest to d
  45. scala> val Seq(_,_,d @ _*) = ("""\d\d/""".r findAllIn "11/01/20/10/" ).toSeq
  46. d: Seq[String] = ArrayBuffer(20/, 10/)

9 comments:

  1. What version of Scala are you using? The last set of examples doesn't work for me in the 2.7.5 REPL. I get the following for the very last example(assuming you meant "d : _*" instead of "d @ _*):

    error: ')' expected but identifier found.

    And for the two examples prior to that, I get this error:

    error: recursive value x$1 needs type

    So I wrote it as

    val Seq(one:String, two:String) = ...

    But then I got this:

    error: value toSeq is not a member of scala.util.matching.Regex.MatchIterator

    So my first thought was that I might need 2.8 to make this work. Is that true?

    ReplyDelete
  2. I am using Scala 2.8. I will update the post

    ReplyDelete
  3. What exactly do you mean by:
    /*
    If you want to use Regex's in assignment you must be sure the match will work. Otherwise you should do real matching
    */

    I am trying to use this type of assignment matching with the RegEx object thinking that a non-match will just not match and continue, but Scala exits and from what I've read, you're not supposed to catch MatchErrors -- does this mean assignment with the RegEx object should only EVER be used if you're certain there will be a match?

    ReplyDelete
  4. i want to find if it a comments /*....*/ how can i find match * chractor?

    ReplyDelete
  5. @Tong: It is standard Regex stuff nother special with Scala for escaping * and other regex characters. The standard escape is \ but you can use quoting as well like:

    \Qthis is a special quoted regex section *\E

    Of course it is better to use triple quoted regexes so you don't have to escape your escapes like in Java

    ReplyDelete
  6. @Eric. If you do somthing like:

    val RegEx = "(h.*?)".r

    val RegEx(hParam) = "what's up"

    the previous line will throw an exception because "what's up" doesn't start with an 'h'. If you are not absolutely sure you should do:

    val hParam = "what's up" match {
    case RegEx(h) => h
    case _ => "doesn't match h"
    }

    to handle the case where the regex doesn't apply.

    ReplyDelete
  7. i tried to what Tong asked...

    here is my code: ("/\\*.\\*/").r

    val s = "/**something*/"
    s.r

    and my error is Dangling meta character "*" near....

    and I fix my String like that:
    val s = "/*\\*something*/"
    s.r

    it works, but i don't know how it work. Can't you explain me, Jesse?

    ReplyDelete
  8. As I said before there is nothing really scala related to this question. If you google about regex in java you will see how you need to do escapes.

    However in this case you are trying to turn your string s into a regular expression. there is no matching going on. If you want to match: /** something */ you might do something like:


    scala> val regex = """/\*.*\*/""".r
    regex: scala.util.matching.Regex = /\*.*\*/

    scala> val Comment = """/\*\*(.*)\*/""".r
    Comment: scala.util.matching.Regex = /\*\*(.*)\*/

    scala> val Comment(c) = "/** hello */"
    c: String = " hello "

    Comment is a regex based extractor. I used """ so I don't need to use double \\ for escaping each *. the quotes are around .* because that matches anything. So it will capture every thing between a /** and */.

    ReplyDelete
  9. I want to write st like that
    m = [0-9]*
    n = \.[0-9]*
    o = mn?

    How can I do in Scala?

    ReplyDelete