Query2 and differential dataflow

Along with timely-dataflow, Frank McSherry develops and maintains another library focusing on efficient data processing: differential-dataflow. This week, we’ll have a look at what it brings to the table.

Query2 in timely dataflow

Last week, we have established that timely-dataflow rocks. We have shown it was allowing us to crunch data with one order of magnitude cost-efficiency that Redshift or Spark on EC2.

Timely is great, but it can be a bit intimidating. It’s lower-level than Spark, bringing us a bit to the Hadoop manual map/reduce era. So this week, we will take the time to translate step by step our good old Query 2 friend to its timely-dataflow implementation.

rusted gears

Embrace the glow cloud

This is part #5 of a series about a BigData in Rust experiment.

We are working on a simple query nicknamed Query2 and comparing our results to the BigData benchmark.

Hashes to hashes

Okay. So this post was supposed to be about running on a cluster. I promise we will come to that eventually, but this week I got a bit side-tracked. Serendipity happened! We will have to dive into Rust HashMaps characteristics.

Let's optimize

I/O and CPU

We have seen in part #1 that my laptop is processing 30GB of deflated CSV in about 11 minutes. If we want to do better, the first step is find out what is our bottleneck. The code was presented in part #2.

For years, we have worked under the assumption that IO where the limiting factor in data processing. With SSD and PCIe disks, all this has changed. Believe me, or re-run the bench and look at top, it’s very obvious that we are now CPU-bound.