Monday, August 6, 2012

Scala-IO Core: Long Traversable

The LongTraversable trait is one of the most important objects in Scala IO. Input provides a uniform way of creating views on the data (as a string or byte array or LongTraversable of something like bytes.)

LongTraversable is a scala.collection.Traversable with some extra capabilities. A few of the salient points of LongTraversable are:
  • It is a lazy/non-strict collection similar to Stream. In other words, you can perform operations like map, flatmap, filter, collect, etc... without accessing the resource
  • Methods like slice and drop will (if possible for the resource) skip the dropped bytes without reading them
  • Each usage of the LongTraversable will typically open and close the underlying resource.
  • Has methods that one typically finds in Seq.  For example: zip, apply, containsSlice
  • Has methods that take or return Longs instead of Ints like ldrop, lslice, ltake, lsize
  • Has limitFold method that allows fold like behaviour with extra features like skip and early termination
  • Can be converted to an AsyncLongTraversable which has methods that return Futures instead and won't block the program
  • Can be converted to a Process object for advanced data processing pipelines
Example usage:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import scalax.io._
import java.net.URL
 
val file1 = Resource.fromURL(new URL("http://www.scala-lang.org"))
val file2 = Resource.fromURL(new URL("http://www.camptocamp.com"))
 
// scala-io versions > 0.4.1 will have method
//   Resource.fromURLString("http://xyz.com")
// but earlier versions use an overloaded method:
//   Resource.fromURL("http://www.scala-lang.org")
 
// A simple example of comparing all bytes in
// one file with those of another.
// combining zip with sliding is a good way to perform operations
// on sections of two (or more) files
val zipped = file1.bytes.zip(file2.bytes) map {
  case (file1Byte, file2Byte) =>
    file2Byte < file1Byte
}
 
// take the first 5 results and load them into memory
val fiveBytes = zipped.take(5).force
 
// for debug in REPL lets print them out
fiveBytes mkString ","
 
// Add a line number to each line in a file
//
// Note:  Since methods in a Input object return LongTraversableView objects
// all zip examples do not open the file.  To do that you must call
// force or some other method that forces a read to take place.
val addedLineNumbers = file1.lines().zipWithIndex.map {
  case (line,idx) => idx+" "+line
}
// print out second group of 5 lines
 
addedLineNumbers.drop(5).take(5) foreach println
 
// check if file 1 startsWith file 2
file1.bytes.startsWith(file2.bytes)
 
// The number of consecutive lines starting at 0 containing <
file1.lines().segmentLength(_ contains "<",0)
 
// check if all lines in file1 are the same as in file2 ignoring case
file1.lines().corresponds(file2.lines())(_ equalsIgnoreCase _)
 
// Check if file1has the same bytes as file2
file1.bytes.sameElements(file2.bytes)
 
// silly example but shows that value
// being compared can be any traversable
file1.bytes.sameElements(1 to 30)
 
// use sliding to visit each 1008 bytes.
// map splits the window into two parts, block and checksum
val blocks = file1.bytes.sliding(1008,1008).map{_ splitAt 1000}
 
// grouped is sliding(size,size) so the following is equivalent
val blocks2 = file1.bytes.grouped(1008).map{_ splitAt 1000}
 
blocks2 foreach {
  case (block,checksum) =>
    // verify checksum and process
    println(block take 5)
}

The limitFold method can be quite useful to process only a portion of the file if you don't know ahead of time what the indices of the portion are:
1
2
3
4
5
6
7
8
9
10
11
12
13
import scalax.io._
import java.net.URL
 
val in:Input = Resource.fromURL(new URL("http://www.camptocamp.com"))
 
/**
 * Skip first 10 bytes and sum a random number of bytes up
 * to 20 bytes
 */
in.bytes.drop(10).take(20).limitFold(10) {
  case (acc, next) if util.Random.nextBoolean => End(acc + next)
  case (acc, next) => Continue(acc + next)
}

1 comment:

  1. I think
    // print out second 5 characters
    should be:
    // print out second 5 lines

    ReplyDelete