Frequently I’m asked what all the hype behind this Big Data nonsense is about. To some extent I think dropping the ‘Big’ would get more buy in from developers. Streaming architectures supported by technologies such as Kafka, Storm, Spark and Flink can really scale to the Big in Big Data, however we’re missing the real use case. These architectures enable us to alert end users in real time of significant events, for example consider a use case where we alert users when air pollution becomes unacceptable.
What we used to do (and sometimes still do …)
Throughout the day we feed our sensor readings through an ingest pipeline into our relational database. Periodically we run an SQL query something like the following:
But what’s wrong with this?
- Sure we’ll eventually figure out that the air pollution went above an acceptable level but not until our next execution of the query.
- Relational databases typically sit on one machine; there is a limit to the amount of storage and compute we can put on this box. I.e. we can’t scale to IoT levels of sensors. Also before that we’ll hit a point where we need multiple ingest processes that we have to deploy and manage.
What innovating companies are currently doing
Kafka is a message broker that scales across machines and provides resilience, lose a machine and you don’t lose any of your data currently in the pipeline. Combine this with a processing engine such as Apache Spark and you can scale to IoT proportions. Spark Streaming allows you to micro batch enabling you to perform calculations over the past time window and alert immediately if the pollution is high. This core Spark logic looks something like the below:
But again what’s wrong with this?
- Spark micro batches on processing time, i.e. the time the event is observed by the machine processing it, not event time, i.e the time the event actually originated. This has big implications for our pollution level monitoring. Let’s say a bunch of sensors go offline for a few seconds and then suddenly come back up and start sending buffered data from that down period. This could get processed along with a later window’s data giving us erroneous air quality readings.
- While Spark can scale to IoT proportions, its scaling ability is not independent of the window size. If you have high data throughput, you need to measure in windows of at least 10s of seconds. This doesn’t seem right, sometimes we want small windows.
Take Spark Streaming, layer on logic for processing windows based on event time and remove the dependence on window size for throughput and you’ve got Apache Flink. It has a similar relationship with Kafka and it looks and feels like Spark. The ideas behind processing based on event time came out of Google Cloud Data Flow but with Flink you get the open source implementation. Let’s specify that ordinarily sensor readings don’t arrive more than 2 seconds late:
And then the Flink program is something like the following:
Note that the code to calculate the mean is a bit more verbose than Spark. Just as Flink is advancing its Table API and SQL, Spark has caught onto Flink’s advances and now enables processing based on event time in its new Structured Streaming API.
If your use case demands you know the answer as things are happening then streaming could be a great fit regardless of how ‘Big’ your data is. Kafka, Spark and Flink are all great tools to achieve this. The full example code can be found on GitHub.