In the past I’ve worked on many systems which process data in batches. Typically the data would have been loaded real-time into relational databases optimised f...
In the past I’ve worked on many systems which process data in batches. Typically the data would have been loaded real-time into relational databases optimised for writes and then at periodic intervals (or overnight) the data would be extracted, transformed and loaded into a data warehouse which was optimised for reads. Usually these transformations would involve denormalisation and/or aggregation of the data to improve the read performance for analytics after loading.
In many cases this approach still holds strong today, particularly if you are working with bounded data i.e. the data is known, fixed and unchanging. Usually it will be looking at what happened historically, processing the batch after that point in time has been collected in its entirety without little or no late data expected. It also relies on you having the time to process batches, e.g. if your batch runs overnight, but it takes more than 24h to process all the data then you’re constantly falling behind! Batching can also be incredibly slow to gather the insights from your data as it’s generally only processed and available long after the data was originally collected. A typical use case for batching could be a monthly/quarterly sales report for example.
However, in today’s world much of our data is unbound, it’s infinite, unpredictable and unordered. There is also an ever increasing demand to gain insights from data much more quickly. For example, think of all the telemetry logs being generated by your infrastructure right now, you probably want to detect potential problems and worrying trends as they are developing and react proactively not after the fact when something has failed.
The major downside to a streaming architecture is generally the computation part of your pipeline may only see a subset of all data points in a given period. It doesn't have a complete picture of the data and hence depending on your use case it may not be completely accurate. It's essentially providing higher availability of data at the expense of completeness / correctness. For example, it may show a total number of activities for the up until ten minutes ago, but it may not have seen all that data yet.
Much unbound data can be thought of as an immutable, append only, log of events and this gave birth to the lambda architecture which attempts to combine the best of both batch and streaming worlds. A "fast" stream which processes in near real-time availability and a "slow" batch which sees all the data and corrects any discrepancies in the stream computations. The problem now is that we've got two pieces to code, maintain and keep in sync. Critics argue that the lambda architecture was created because of limitations in existing technologies. That it's a hybrid approach to making two or more technologies work together. Hence a simplification evolved in the form of the kappa architecture where we dispense of the batch processing system completely. In Part 1 we described such an architecture. The kappa architecture will have a canonical data store for the append only, immutable logs, in our case user behavioural events were stored in Google Cloud Storage or Amazon S3. These logs are fed through a streaming computation system which populates a serving layer store (e.g. BigQuery). For our purposes we considered a number of streaming computation systems inc. Kinesis, Flink and Spark, but Apache Beam was our overall winner!
Apache Beam originated in Google as part of its Dataflow work on distributed processing. We won’t cover the history here, but technically Apache Beam is an abstraction, a unified programming model for developing both batch and streaming pipelines. That alone gives us several advantages. Firstly, we don’t have to write two data processing pipelines, one for batch and one for streaming in the case of a lambda architecture. We can reuse the logic for both and change how it is applied. Apache Beam essentially treats batch as a stream, like in a kappa architecture.
Secondly, because it’s a unified abstraction we’re not tied to a specific streaming technology to run our data pipelines. Apache Beam supports multiple runners inc. Google Cloud Dataflow, Apache Flink and Apache Spark (see the Capability Matrix for a full list). We can take advantage of the common features of streaming technologies without having to learn with the nuances of any particular one. Over time as new and existing streaming technologies develop we should see their support within Apache Beam grow too and hopefully we’ll be able to take advantage of these features through our existing Apache Beam code, rather than an expensive switch to a new technology, inc. rewrites, retraining etc.. Hopefully over time the Apache Beam model will become the standard and other technologies will converge on that, something which is already happening with the Flink project
Apache Beam is open source and has SDKs available in Java, Python and Go. Everything we like at Bud! It also supports a number of IO connectors natively for connecting to various data sources and sinks inc. GCP (PubSub, Datastore, BigQuery etc.), AWS (SQS, SNS, S3), Hbase, Cassandra, ElasticSearch, Kafka, MongoDb etc. etc. Again the SDK is continually expanding and the options increasing. Where there isn't a native implementation of a connector is very easy to write your own.
To give one example of how we used this flexibility, initially our data pipelines (described in Part 1) existed solely in Google Cloud Platform. We used the native Dataflow runner to run our Apache Beam pipeline. When we deployed on AWS we simply switched the runner from Dataflow to Flink. This was so easy we actually retrofitted it back on GCP for consistency. There’s also a local (DirectRunner) implementation for development.
Apache Beam differentiates between event time and processing time and monitors the difference between them as a watermark. This allowed us to apply windowing and detect late whilst processing our user behaviour data. For example, take the problem where a user goes offline to catch an underground train, but continues to use your mobile application. When they resurface much later, you may suddenly receive all those logged events. The processing time is now well ahead of event time, but Apache Beam allows us to deal with this late data in the stream and make corrections if necessary, much like the batch would in a lambda architecture. In our case we even used the supported Session windowing to detect periods of user activity and release these for persistence to our serving layer store, so updates would be available for analysis for a whole "session" after we detected that session had complete or a period of user inactivity.
Looking at the downsides, Apache Beam still a relatively young technology (Google first donated the project to Apache in 2016) and the SDKs are still under development. For example, we discovered that some of the windowing behaviour we required didn’t work as expected in the Python implementation so we switched to Java to support some of the parameters we needed. Also, it's currently lacking in a large community or mainstream adoption, so it can be difficult to find help when the standard documentation or API aren't clear. As mentioned above, I often found myself reading the more mature Java API when I found the Python documentation lacking. I also ended up emailing the official Beam groups on a couple of occasions. You won't find any answers on StackOverflow just yet!
It can also be difficult to debug your pipelines or figure out issues in production, particularly when they are processing large amounts of data very quickly. In these cases I can recommend using the TestPipeline and write as many test cases as possible to prove out your data pipelines and make sure it handles all the scenarios you expect.
Overall though these minor downsides will all improve over time so investing in Apache Beam is still a strong decision for the future.
I hope you enjoy these blogs. We have many more interesting data engineering projects here at Bud and we're currently hiring developers. Please take a look at the current open job roles on our careers site
We put out a newsletter roughly once a month with highlights from the blog and updates on new roles. Sign up if that's your thing.
Bud® is authorised and regulated by the Financial Conduct Authority under registration number 765768 + 793327.