Spark Summit East 2017: Spark + Parquet In Depth

This is a talk I gave at Spark Summit East 2017 with my boss/mentor Robbie Strickland.

Parquet is a big data storage format that wag integral to our analytics workflow. This talk details the ins and outs of the format in connection with Apache Spark.

Spark Summit EU 2017: Spark-Bench

spark-bench is an open-source benchmarking tool, and it’s also so much more.

spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases.

This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.


Spark-Bench is a ground-up rewrite I did of a benchmarking suite pioneered by folks at IBM Research.

While researchers are great at researching, they also have a strong affinity for bash and Java. The ground up rewrite moved multiple independent maven projects into one, central Scala project built using SBT. All configuration was moved from a scattering of bash variables to one, centralized config file.

Spark-bench is now more than just a benchmarking suite: it’s a flexible platform for simulation and comparison of Spark use cases of all sorts.


Streamsx.Cassandra is a toolkit for operators that connect IBM Streams to a Cassandra cluster.

It is in use for production applications at The Weather Company.


The HokeyPokeyTree is a generic tree structure useful for many different applications.

The HokeyPokeyTree API gives users the ability to insert right child nodes, take right child nodes out, insert right child nodes, and shake them all about. After insertion and removal and insertion of nodes, the tree randomizes such that nodes are randomly assigned new parents and children.

Indeed, this is what it’s all about.