Showing posts with label regex. Show all posts
Showing posts with label regex. Show all posts

Wednesday, February 3, 2010

Regex ReplaceAllIn

Note: Updated on Feb 13th for the newer API on Scala 2.8 trunk. (This is life on the bleeding edge, thanks Daniel).

A couple new methods have just been added to Scala 2.8 Regex. You will need to download a version of Scala 2.8 more recent than Scala2.8-Beta1.

The methods are related to replacing text using a regular expression and to say they are useful is an understatement. Lets take a look:
  1. scala> val quote = """I don't like to commit myself about heaven and hell - you see, I have friends in both places. 
  2.      | Mark Twain"""                                                                                                
  3. quote: java.lang.String = 
  4. I don't like to commit myself about heaven and hell - you see, I have friends in both places. 
  5. Mark Twain
  6. scala> val expr = "e".r    
  7. expr: scala.util.matching.Regex = e
  8. /* 
  9. This first method is not new or is it interesting.  But the new methods are both related
  10. so lets start with the basic form of replaceAllIn
  11. */
  12. scala> expr.replaceAllIn(quote, "**")
  13. res1: String = 
  14. I don't lik** to commit mys**lf about h**av**n and h**ll - you s****, I hav** fri**nds in both plac**s. 
  15. Mark Twain
  16. // this does the same thing
  17. scala> quote.replaceAll("e","**")
  18. res2: java.lang.String = 
  19. I don't lik** to commit mys**lf about h**av**n and h**ll - you s****, I hav** fri**nds in both plac**s. 
  20. Mark Twain
  21. /*
  22. Now things get interesting.  Using this form of replaceAllIn we can determine the replacement on a case by case basis.
  23. It provides the Match object as the parameter so you have complete access to all 
  24. the matched groups, the location of the match etc...
  25. The method takes a Match => String function.  Very, very powerful.
  26. */
  27. scala> expr.replaceAllIn(quote, s => if(util.Random.nextBoolean) "?" else "*")
  28. res5: String = 
  29. I don't lik? to commit mys?lf about h?av?n and h?ll - you s*?, I hav? fri*nds in both plac*s. 
  30. Mark Twain
  31. /*
  32. Another example using some of the matcher functionality
  33. */
  34. scala> expr.replaceAllIn(quote, m => m.start.toString)                        
  35. res6: String = 
  36. I don't lik11 to commit mys26lf about h37av40n and h48ll - you s5960, I hav68 fri73nds in both plac90s. 
  37. Mark Twain
  38. /*
  39. Another crazy useful method is the replaceSomeIn.  It is similar to the replaceAllIn that takes a function except that the function in replaceSomeIn returns an Option.  If None then there is no replacement.  Otherwise a replacement is performed.  Very nice when dealing with complex regular expressions.
  40. In this example we are replacing all 'e's start are before the 50th character in the string with -
  41. */
  42. scala> expr.replaceSomeIn(quote, m => if(m.start > 50) None else Some("-"))
  43. res3: String = 
  44. I don't lik- to commit mys-lf about h-av-n and h-ll - you see, I have friends in both places.
  45. Mark Twain

Monday, January 11, 2010

Regular Expression 3: Regex matching

This post covers basically the same things as Matching Regular Expressions but goes into a bit more detail. I recommend reading both posts since there is unique information in each.

The primary new item I show here is that more advanced matching techniques can be used but more importantly all groups are matched even groups that are within another group.

Note: The examples use Scala 2.8. Most examples will work with 2.7 but I believe the last example is Scala 2.8 only.

  1. scala> val date = "11/01/2010"
  2. date: java.lang.String = 11/01/2010
  3. scala> val Date = """(\d\d)/(\d\d)/(\d\d\d\d)""".r
  4. Date: scala.util.matching.Regex = (\d\d)/(\d\d)/(\d\d\d\d)
  5. /*
  6. When a Regex object is used in matching each group is assigned to a variable
  7. */
  8. scala> val Date(day, month, year) = date          
  9. day: String = 11
  10. month: String = 01
  11. year: String = 2010
  12. scala> val Date = """(\d\d)/((\d\d)/(\d\d\d\d))""".r
  13. Date: scala.util.matching.Regex = (\d\d)/((\d\d)/(\d\d\d\d))
  14. /*
  15. This example demonstates how all groups must be assigned, if not there will be a matchError thrown
  16. */
  17. scala> val Date(day, monthYear, month, year) = date
  18. day: String = 11
  19. monthYear: String = 01/2010
  20. month: String = 01
  21. year: String = 2010
  22. scala> val Date(day, month, year) = date           
  23. scala.MatchError: 11/01/2010
  24. at .< init>(< console>:5)
  25. at .< clinit>(< console>)
  26. // but placeholders work in Regex matching as well:
  27. scala> val Date(day, _, month, year) = date
  28. day: String = 11
  29. month: String = 01
  30. year: String = 2010
  31. scala> val Names = """(\S+) (\S*)""".r             
  32. Names: scala.util.matching.Regex = (\S+) (\S*)
  33. scala> val Names(first, second) = "Jesse Eichar"
  34. first: String = Jesse
  35. second: String = Eichar
  36. /*
  37. If you want to use Regex's in assignment you must be sure the match will work.  Otherwise you should do real matching
  38. */
  39. scala> val Names(first, second) = "Jesse"       
  40. scala.MatchError: Jesse
  41. at .< init>(< console>:5)
  42. at .< clinit>(< console>)
  43. scala> val M = """\d{3}""".r
  44. M: scala.util.matching.Regex = \d{3}
  45. /*
  46. There must be a group in the Regex or match will fail
  47. */
  48. scala> val M(m) = "Jan"
  49. scala.MatchError: Jan
  50. at .< init>(< console>:5)
  51. at .< clinit>(< console>)

The following are a few more complex examples
  1. scala> val Date = """((\d\d)/(\d\d)/(\d{4}))|((\w{3}) (\d\d),\s?(\d{4}))""".r         
  2. Date: scala.util.matching.Regex = ((\d\d)/(\d\d)/(\d{4}))|((\w{3}) (\d\d),\s?(\d{4}))
  3. /*
  4. The Regex has an or in it.  So only 1/2 of the groups will be non-null.
  5. If the first group is a String then it is non-null and the next three elements
  6. the pattern will be day/month/year
  7. Otherwise if the 5th group is a String then the patter will be month day, year
  8. Lastly a catch all
  9. */
  10. scala> def printDate(date:String) = date match {                                      
  11.      | case Date(_:String,day,month,year,_,_,_,_) => (day,month,year)                 
  12.      | case Date(_,_,_,_,_:String,month,day,year) => (day,month,year) // process month
  13.      | case _ => ("x","x","x")                                                        
  14.      | }
  15. printDate: (date: String)(StringStringString)
  16. scala> printDate("Jan 01,2010"
  17. res0: (StringStringString) = (01,Jan,2010)
  18. scala> printDate("01/01/2010"
  19. res1: (StringStringString) = (01,01,2010)
  20. /*
  21. A silly example which drops the first element of the date string
  22. not useful but this demonstrates that we are matching agains a sequence so 
  23. the _* can be used to match the rest of the groups
  24. */
  25. scala> def split(date:String) = date match {         
  26.      | case d @ Date(_:String ,_*) => d drop 3       
  27.      | case d @ Date(_,_,_,_,_:String,_*) => d drop 4
  28.      | case _ => "boom"                              
  29.      | }
  30. split: (date: String)String
  31. scala> split ("Jan 31,2004")
  32. res5: String = 31,2004
  33. scala> split ("11/12/2004"
  34. res6: String = 12/2004
  35. /*
  36. This is just a reminder that the findAllIn returns an iterator which (since it is probably a short iterator) can be converted to a sequence and processed with matching
  37. */
  38. scala> val Seq(one,two,_*) = ("""\d\d/""".r findAllIn "11/01/2010" ).toSeq  
  39. one: String = 11/
  40. two: String = 01/
  41. scala> val Seq(one,two) = ("""\d\d/""".r findAllIn "11/01/2010" ).toSeq   
  42. one: String = 11/
  43. two: String = 01/
  44. // drop the two first matches and assign the rest to d
  45. scala> val Seq(_,_,d @ _*) = ("""\d\d/""".r findAllIn "11/01/20/10/" ).toSeq
  46. d: Seq[String] = ArrayBuffer(20/, 10/)

Friday, January 8, 2010

Regular Expression 2 : The rest Regex class

This is the second installment of Regular expressions in Scala. In the first installment the basics were shown and a few of the methods in the Regex class were inspected. This topic will look at the rest of the methods in the Regex class.

Regex.findPrefixMatchOf
  1. /*
  2. returns the match if the regex is the prefix of the string
  3. */
  4. scala> "(h)(e)|(l)".r findPrefixMatchOf "hello xyz"  
  5. res2: Option[scala.util.matching.Regex.Match] = Some(he)
  6. scala> "lo".r findPrefixMatchOf "hello xyz"  
  7. res3: Option[scala.util.matching.Regex.Match] = None
  8. /*
  9. The method is essentially the same as adding the boundary regex character
  10. */
  11. scala> "^ab".r findFirstMatchIn "ababab"
  12. res8: Option[scala.util.matching.Regex.Match] = Some(ab)
  13. scala> "^ab".r findFirstMatchIn "hababab"
  14. res9: Option[scala.util.matching.Regex.Match] = None
  15. /*
  16. findPrefixOf is the same but returns the matched string instead
  17. */
  18. scala> "ab".r findPrefixOf "haababab"       
  19. res11: Option[String] = None
  20. scala> "ab".r findPrefixOf "ababab"    
  21. res12: Option[String] = Some(ab)

Regex.replaceAllIn -- Essentially the same as using String.replaceAll
Regex.replaceFirstIn -- Essentially the same as using String.replaceFirst
  1. scala> "(h)(e)|(l)".r replaceAllIn ("hello xyz","__")
  2. res13: String = ______o xyz
  3. scala> "hello xyz" replaceAll ("(h)(e)|(l)","__")    
  4. res14: java.lang.String = ______o xyz
  5. scala> "hello xyz" replaceFirst ("(h)(e)|(l)","__")  
  6. res16: java.lang.String = __llo xyz
  7. scala> "(h)(e)|(l)".r replaceFirstIn ("hello xyz","__")
  8. res17: String = __llo xyz

This next section is not Scala specific but because Regex does not provide a way to set the flags CASE_INSENSITIVE, DOTALL, etc... The section is useful to demonstrate how to do it as part of the standard regex syntax.
  1. // examples based on java blog at: <a href="http://www.javaranch.com/journal/2003/04/RegexTutorial.htm#flags">http://www.javaranch.com/journal/2003/04/RegexTutorial.htm#flags</a>
  2. scala> val input = """Hey, diddle, diddle,      
  3.      | |The cat and the fiddle,                 
  4.      | |The cow jumped over the moon.           
  5.      | |The little dog laughed                  
  6.      | |To see such sport,                      
  7.      | |And the dish ran away with the spoon.""".stripMargin
  8. input: String = 
  9. Hey, diddle, diddle,
  10. The cat and the fiddle,
  11. The cow jumped over the moon.
  12. The little dog laughed
  13. To see such sport,
  14. And the dish ran away with the spoon.
  15. // by default regex is case sensitive
  16. scala> """the \w+?(?=\W)""".r findAllIn input foreach (println _)
  17. the fiddle
  18. the moon
  19. the dish
  20. the spoon
  21. /* the (?i)  makes the match case insensitive the complete set of options are:
  22. (?idmsux)
    • i - case insensitive
    • d - only unix lines are recognized as end of line
    • m - enable multiline mode
    • s - . matches any characters including line end
    • u - Enables Unicode-aware case folding
    • x - Permits whitespace and comments in pattern
  23. */
  24. scala> """(?i)the \w+?(?=\W)""".r findAllIn input foreach (println _)
  25. The cat
  26. the fiddle
  27. The cow
  28. the moon
  29. The little
  30. the dish
  31. the spoon

Wednesday, January 6, 2010

Regular Expression 1 : Basics and findAllIn

This is the first of a series on Regular expression use in Scala. There was a previous post related that is also worth looking at but the same tips will be revisited in this series: Matching Regular Expressions.

Perhaps the most important thing for regular expressions in Scala is to be aware of the raw string syntax:
  1. /* 
  2. normal strings treat the \ character as the escape character 
  3. so this fails
  4. */
  5. scala> val normalString = "\.+((xyz)|(abc))"
  6. < console>:1: error: invalid escape character
  7.        val normalString = "\.+((xyz)|(abc))"
  8.                             ^
  9. /*
  10. raw strings a great for regular expressions so you don't have 
  11. escape \ characters
  12. */
  13. scala> val rawString = """\.+((xyz)|(abc))"""
  14. rawString: java.lang.String = \.+((xyz)|(abc))

The next thing is to realize that one can easily create a Regex object from any string with the r method:
  1. scala> val regex = """\.+((xyz)|(abc))""".r
  2. regex: scala.util.matching.Regex = \.+((xyz)|(abc))

A regex has the standard matching methods one might expect but lets look at findAllIn and the associated MatchData for now. findAllIn returns a MatchIterator which is an Iterator[String] with MatchData. When iterating over the MatchIterator the full matched string will be returned, if you need the subgroups you will need to convert the MatchIterator to an Iterator[Match].
  1. // findAllIn returns an iterator over the matches.  
  2. scala> "l|he".r findAllIn "hello xyz" foreach {println _}
  3. he
  4. l
  5. l
  6. // Each match can have multiple groups
  7. // Note: Each element in MatchIterator are strings (no Match objects)
  8. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched mkString ",")}   
  9. h,e
  10. l
  11. l
  12. // to access subgroups use the matchData method
  13. // Note: there are 3 subgroups in the regex
  14. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println( m.subgroups mkString ",")}
  15. h,e,null
  16. null,null,l
  17. null,null,l
  18. /*
  19. if matched is called the full match is returned (as if you did not convert the iterator to an Iterator[Match])
  20. /*
  21. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched)}
  22. he
  23. l
  24. l
  25. /*
  26. The following demonstrates more of the methods on Match
  27. Essentially the elements are:
  28. (start index of match, end index of match, string before match, string after match, string the match was performed on)
  29. */
  30. scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.start, m.end, m.before, m.after, m.source)}
  31. (0,2,,llo xyz,hello xyz)
  32. (2,3,he,lo xyz,hello xyz)
  33. (3,4,hel,o xyz,hello xyz)

The last methods to look at for this topic are findFirstIn, findFirstMatchIn and the Regex constructor:
  1. /*
  2. Groups names can be assigned if the Regex constructor is used
  3. */
  4. scala> val withNames = new util.matching.Regex("(h)(e)|(l)""h""e""l")
  5. withNames: scala.util.matching.Regex = (h)(e)|(l)
  6. scala> withNames findFirstIn "hello xyz"
  7. res28: Option[String] = Some(he)
  8. /*
  9. I know a match will be found so I am extracting the value from the Option by assigning it to Some(he)
  10. */
  11. scala> val Some(he) = withNames findFirstMatchIn "hello xyz"  
  12. he: scala.util.matching.Regex.Match = he
  13. scala> he.groupNames
  14. res29: Seq[String] = Array(h, e, l)
  15. scala> he.group("h")
  16. res30: String = h
  17. scala> he.group("e"
  18. res31: String = e
  19. scala> he.group(1)
  20. res32: String = h
  21. scala> he.group(2)
  22. res33: String = e
  23. // Uh oh. NullPointer warning!
  24. scala> he.group(3)
  25. res34: String = null
  26. scala> he.groupCount
  27. res35: Int = 3

Matching in for-comprehensions

At a glance a for-comprehension appears to be equivalent to a Java for-loop, but it is much much more than that. As shown in post: for-comprehensions, for-comprehensions can have guards which filter out which elements are processed:
  1. scala> for ( x <- 1 to 10; if (x >4) ) println(x)
  2. 5
  3. 6
  4. 7
  5. 8
  6. 9
  7. 10

They can be used to construct new collections:
  1. scala>?for(?i?<-?List(?"a",?"b",?"c")?)?yield?"Word:?"+i
  2. res1:?List[java.lang.String]?=?List(Word:?a,?Word:?b,?Word:?c)

They can contain multiple generators:
  1. scala> for {x <- 1 to 10                           
  2.      |      if(x%2 == 0)
  3.      |      y <- 1 to 5} yield (x,y)
  4. res1: scala.collection.immutable.IndexedSeq[(Int, Int)] = IndexedSeq((2,1), (2,2), (2,3), (2,4), (2,5), (4,1), (4,2), (4,3), (4,4), (4,5), (6,1), (6,2), (6,3), (6,4), (6,5), (8,1), (8,2), (8,3), (8,4), (8,5), (10,1), (10,2), (10,3), (10,4), (10,5))

What has not been covered is that the assignments also does pattern matching:
  1. scala> for ( (x,y) <- (6 to 1 by -2).zipWithIndex) println (x,y) 
  2. (6,0)
  3. (4,1)
  4. (2,2)

This is not surprising as this also occurs during normal assignment. But what is interesting is that the pattern matching can act as a guard as well. See Extractor examples and Assignment and Parameter Objects for more information of pattern matching and extractors.
  1. scala> val args = Array( "h=2""b=3")
  2. args: Array[java.lang.String] = Array(h=2, b=3)
  3. scala> val Property = """(.+)=(.+)""".r 
  4. Property: scala.util.matching.Regex = (.+)=(.+)
  5. scala> for {Property(key,value) <- args } yield (key,value)
  6. res0: Array[(String, String)] = Array((h,2), (b,3))
  7. scala> Map(res0:_*)
  8. res1: scala.collection.immutable.Map[String,String] = Map(h -> 2, b -> 3)
  9. scala> res1("h")
  10. res3: String = 2

Now just for fun here is a similar example but using symbols instead of strings for the key values:
  1. scala> val args = Array( "h=2""b=3")
  2. args: Array[java.lang.String] = Array(h=2, b=3)
  3. scala> val Property = """(.+)=(.+)""".r 
  4. Property: scala.util.matching.Regex = (.+)=(.+)
  5. scala> for {Property(key,value) <- args } yield (Symbol(key),value)
  6. res0: Array[(Symbol, String)] = Array(('h,2), ('b,3))
  7. scala> Map(res0:_*)
  8. res1: scala.collection.immutable.Map[Symbol,String] = Map('h -> 2, 'b -> 3)
  9. scala> res1('h)
  10. res2: String = 2

Wednesday, September 16, 2009

Matching Regular expressions

This topic is derived from the blog post: Using pattern matching with regular expressions in Scala

The Regex class in Scala provides a very handy feature that allows you to match against regular expressions. This makes dealing with certain types of regular expression very clean and easy to follow.

What needs to be done is to create a Regex class and assign it to a val. It is recommended that the val starts with an Uppercase letter, see the topic of matching about the assumptions matching makes based on the first letter of the Match case clause.

There is nothing like examples to help explain an idea:
  1. // I normally use raw strings (""") for regular expressions so that I don't have to escape all \ characters
  2. // There are two ways to create Regex objects.
  3. // 1. Use the RichString's r method
  4. // 2. Create it using the normal Regex constructor
  5. scala> val Name = """(\w+)\s+(\w+)""".r
  6. Name: scala.util.matching.Regex = (\w+)\s+(\w+)
  7. scala> import scala.util.matching._
  8. import scala.util.matching._
  9. // Notice the val name starts with an upper case letter
  10. scala> val Name = new Regex("""(\w+)\s+(\w+)""")
  11. Name: scala.util.matching.Regex = (\w+)\s+(\w+)
  12. scala> "Jesse Eichar"match {
  13.      | case Name(first,last) => println("found: ", first, last)
  14.      | case _ => println("oh no!")
  15.      | }
  16. (found: ,Jesse,Eichar)
  17. scala> val FullName = """(\w+)\s+(\w+)\s+(\w+)""".r
  18. FullName: scala.util.matching.Regex = (\w+)\s+(\w+)\s+(\w+)
  19. // If you KNOW that the match will work you can assign it to a variable
  20. // Only do this if you are sure the match will work otherwise you will get a MatchError
  21. scala> val FullName(first, middle, last) = "Jesse Dale Eichar"
  22. first: String = Jesse
  23. middle: String = Dale
  24. last: String = Eichar