21 6 / 2013

scoobi 0.7.0

This is a greatly refactored and enhanced version!

New Features

  • new API for persisting DLists and DObjects
  • "Checkpoints" to save intermediary results (with customisable expiry policy)
  • Scoobi REPL to run experiments with your cluster
  • Reduction datatype and combinators to create combiner functions
  • Hadoop Counters can be used in parallelDo operations
  • works on Scala 2.10 
  • refactored the ScoobiApp trait with several command-line options
  • "map-only" jobs don’t use a reducer


  • new DList methods: shuffle, isEqualTo, stratify, zipWithIndex, diff, distinctDiff
  • DObjects can be saved and loaded
  • joinFullOuter operation in the relational library
  • a Scoobi job fails when an intermediate Hadoop job fails
  • added more logging and a url to the job on the job tracker when there is a failure
  • added type/schema checking in Sequence and Avro datasources
  • a DList can now be created with Text elements (or any Writable)
  • input checks can be customised when reading files
  • #85 set the minimum and maximum number of reducers
  • #129: added the ability to set the input size threshold per reducer
  • #193: a DList can be used in a for comprehension
  • #235: new WireFormat for Generic Avro records


  • Grouping definitions must now return scalaz.Ordering values instead of just ints
  • combine operations now take in a Reduction object instead of a function. See the Reduction object for a list of combinators to create Reductions


  • fixed the computation of splits when creating a ChannelInputFormat
  • #183: fixed DList.distinct
  • #200: Enable use ‘\001’ for separator of TextDelimitedFile
  • #211: better display of file sizes
  • #239: FileUtil.copy in mixed S3/HDFS

scoobi is a productivity library for writing Hadoop jobs in Scala.

For more information visit: http://nicta.github.com/scoobi.


Permalink 1 note