You may hear the phrase that the world is moving from batch to real-time a lot. While traditional “business intelligence” has come a long way in the past 20 years, the world of real-time analytics is still in its early days. Traditional BI had its Renaissance moments with the advent of Big Data technologies such as Hadoop, and then cloud data lakes and warehouses have brought everyone to the Modern era.
But these traditional BI tools are built for assisting strategic decision making at the executive level. When product teams, marketing teams and other business operations teams are looking to make data-driven decisions in real-time, in the moment, these traditional BI tools fall short and there is a growing need for a more modern set of tools that can power the world of “operational intelligence” . The need of the hour is to empower various business operations teams with real-time answers and systems that help with tactical decision making so that they can do their job better. This is what real-time analytics is all about. If batch analytics made your exec team strategize better, real-time analytics will enable every team in your company to make better decisions.
I saw this happen first hand at facebook from 2007 to 2015. When I discuss this topic with friends, most people ask me how facebook’s product managers and growth teams made data-driven decisions on a daily basis to launch successful products and accelerate facebook’s growth. There are so many factors that contributed to this and in this post, I will discuss one real-time analytics tool that exemplifies the point in more depth. The real-time analytics tool is called Deltoid, which is facebook’s A/B experiments platform. It is a great example of a tool that made all facebook product managers data driven on a daily basis.
Deltoid powered by Scuba & Laser
Deltoid was Itamar Rosenn’s brainchild . Itamar is one of the most prolific data scientists that I have ever had the pleasure of working with and I am sure whatever he is working on now, the world will be looking for it 4-5 years from now. If you are interested in learning more about Deltoid and have 20 mins to spare, I strongly encourage you to listen to this excellent tech talk by Itamar from back in 2014. This is the best public presentation about Deltoid that I could find:
Itamar’s talk describes the goals of a powerful A/B experiments framework, the backend data management challenges associated with it and what an ideal solution would look like. The talk is also possibly the best argument I can put forth on why powerful next-gen real-time apps, such as A/B experiments systems, should be built in the cloud and not on traditional data management tools and open-source technologies such as Apache Druid or Elasticsearch.
Deltoid was built on top of data management systems called Scuba and Laser that I helped build and scale at facebook. If you ever come across an ex-facebook product manager or developer and ask them what tool they miss the most from facebook, you will invariably get either Deltoid or Scuba as the answer. It should be no surprise to anyone that Rockset is heavily inspired by both Scuba and Laser, amongst other things that Rockset’s founding team had previously worked on.
An A/B experiments platform is a perfect example of a real-time analytics tool, and we will look a bit closer at the system’s requirements to understand why traditional big data management tools don’t cut it.
Requirements for an ideal A/B experiments platform
- Speed with scalable real-time ingest: This will help product teams make decisions in days instead of weeks. This is really important, since the faster the results arrive, the more experiments they will run. This will have a direct and immediate impact on how quickly your product and growth teams move to reach their goals. Itamar talks about the big impact of increased iteration speed at length in his talk.
- Multi-dimensional data from multiple sources: Almost every part of A/B testing analysis involves combining the real-time event stream with one or more fact tables, such as users, products, devices or experiments data, which often come from different data sources. Each of those data sources themselves are constantly evolving too – so, any A/B experiments platform needs to bring in data from multiple different sources in real-time.
- Sub-second queries with interactive slicing & dicing: Product teams are not just making pass/fail judgments on their A/B experiments. They need to drill-down and interrogate the data in an interactive fashion to build new hypotheses, construct better ideas and design follow up experiments.
First attempt using streaming JOINs failed
Facebook’s first attempt was quite traditional. The idea was to heavily denormalize the input event stream using streaming JOINs and then just load it into an in-memory analytics system called Scuba.
This architecture did not work. As Itamar said in the talk, “The reason this architecture doesn’t work is due to data explosion.” By duplicating all the details of the 3 dimension tables (users, devices and experiments) with the real-time event stream, which is the fact table, the data explosion is so massive that even facebook could not afford it.
Real-time analytics needs full SQL support
Facebook solved the issue by pre-sharding all the data sets on the JOIN key which is the “user id” in this case. While that helped make the problem tractable, it wasn’t flexible enough for all of their needs. Itamar’s talk ends with a dream real-time analytics stack that has the following:
- Full-featured SQL
- Built-in long-term retention
With the advent of real-time analytics solutions like Rockset, six years after the talk was originally presented, this is no longer just a dream. Anyone can build a world class A/B experiments platform or similar class of real-time apps on Rockset with built in real-time ingest and full featured SQL at massive scale in the cloud.
If you are interested in hearing more about Rockset or have a question, I’d love to hear from you. You can also join us on our upcoming tech talk to learn more about what it takes to build a real-time A/B experiments platform at massive scale.