The talk is about how Apache Pulsar can have topic backlogs of unlimited size, opening up a whole array of Big Data use-cases that are not possible with other messaging systems. We also delve into tiered storage, which can make these massive backlogs very cheap.
Messaging systems are an essential part of any real-time analytics engine. A common pattern is to feed a user event stream into a processing engine, show the result to the user, capture feedback from the user, push the feedback back into the event stream, and so on. The quality of the result shown to the user is often a function of the amount of data in the event stream, so the more your event stream scales, the better you can serve your users.
Messaging systems have recently started to push into the field of long-term data storage and event stores, where you cannot compromise on retention. If data is written to the system, it must stay there.
Infinite retention can be challenging for a messaging system. As data grows for a single topic, you need to start storing different parts of the backlog on different sets of machines without losing consistency.
In this talk, I will describe how Pulsar uses Apache BookKeeper in its segment oriented architecture. BookKeeper provides a unit of consensus called a ledger. Pulsar strings together a number of BookKeeper ledgers to build the complete topic backlog. Each ledger in the topic backlog is independent of all previous ledgers with regards to location. This allows us to scale the size of the topic backlog simply by adding more machines. When the storage node is added to a Pulsar cluster, the brokers will detect it, and gradually start writing new data to the new node. There’s no disruptive rebalancing operation necessary.
Of course, adding more machines will eventually get very expensive. This is where tiered storage comes in. With tiered storage, parts of the topic backlog can be moved to cheaper storage such as Amazon S3 or Google Cloud Storage. I will also discuss the architecture of tiered storage, and how it is a natural continuation of Pulsar’s segment oriented architecture.
Finally, if you start storing data for a long time in Pulsar, you may want a means to query it. I will introduce our SQL implementation, based on the Presto query engine, which allows users to easily query topic backlog data, without having to read the whole thing.