Venrock

1.0

3.0

The Biggest Challenges in Serving 50 Million Daily Active Users with Anupam Singh, VP Engineering at Roblox

Venrock partner Ethan Batraski speaks with Anupam Singh, VP of engineering, growth & ML at Roblox, about the biggest challenges in serving Roblox’s 50 million daily active users. Singh also discusses the strategy behind configuring and scaling a bare metal infrastructure and the biggest challenges in adopting and deploying a commercial solution internally.

[00:01:25] Anupam’s start in the industry
[00:03:40] The biggest challenges in serving 50 million daily active users
[00:07:10] Dealing with unpredictable loads
[00:10:02] The strategy behind configuring and scaling of bare metal infrastructure
[00:15:30] The biggest challenges in adopting and deploying a commercial solution internally
[00:17:59] Rapid fire questions

Subscribe:

Full Transcript

[00:00:46] Ethan Batraski: In this episode, we’ll be talking to Anupam Singh, VP of engineering at Roblox, about his experiences building support over a billion hours a week of activity. We’ll take a peek under the hood of Roblox’s analytics infrastructure, we’ll go into details as to why they decided to build on bare metal and more. Stay tuned for a fascinating conversation. Before we jump in, I’d love to give you an opportunity to introduce yourself to the listeners.

[00:01:16] Anupam Singh: I’m Anupam Singh. I’m VP of engineering, at Roblox, responsible for growth, machine learning, and analytics.

[00:01:25] Ethan Batraski: Before we jump in, we’ve got a lot of fun topics that we can cover. Why don’t we do a little icebreaker? Listeners would love to know, what was your first job? What started Anupam in the industry?

[00:01:36] Anupam Singh: That brings back memories. I was the engineer responsible for the SQL grammar at Informix, a database company. That was my first job.

[00:01:48] Ethan Batraski: And what did that entail? What did database look like then and what was your day-to-day?

[00:01:54] Anupam Singh: So interestingly, some of it has not changed. So, there was this, you know, giant SQL grammar file, which tried to make sure that the syntax matched SQL. So, it was the entry point of the SQL compiler, which meant anybody who wanted to change anything in the SQL language for that company, for that database, would have to do a code review with me, because I owned the grammar. So that was a fun challenge for a new engineer who has to interact with every group in the company that needs to change the syntax of SQL. And at that time, databases were so new that you couldn’t even agree on what creating a table looks like. So that was a fun time in databases.

[00:02:42] Ethan Batraski: How often do you add syntax versus remove?

[00:02:44] Anupam Singh: At that time, it was all additive, and then fighting with the NC-SQL committee that this syntax should be standardized. So every vendor was trying to do proprietary SQL extensions, then trying to push it back into the Standards Committee so that they could have differentiators in the market.

[00:03:03] Ethan Batraski: Love it. And, and how much of that syntax still exists today?

[00:03:07] Anupam Singh: Almost all of it, which is scary, because SQL is supposed to be dead multiple times, right? We’ve called it NoSQL, we’ve called it Big Data, and now machine learning, and yet SQL still has, from what I remember, 21 million SQL programmers exist on the planet. And I’m sure we’re undercounting.

[00:03:31] Ethan Batraski: So right. Now there’s NewSQL, there’s the SQLite and … the language that will never die, and it keeps running through a resurgence.

[00:03:39] Anupam Singh: Every few years.

[00:03:40] Ethan Batraski: [laughs]. Switching topics, I’d love to chat a little bit more about Roblox. Today, Roblox has 50 million or more daily active users. As you think about the scale and scope that Roblox serves and offers, what are the biggest challenges and differences that you think about when you have to build and serve at this scale?

[00:04:02] Anupam Singh: So going further on those numbers, an interesting one that we publicly talked about is a billion hours of activity on our platform every week, and that’s happening right now. So imagine capturing all that activity, which is a 3D immersive world. So imagine this call was going on inside of our platform. Our avatars would be talking to each other, we would be shaking hands, maybe even high-fiving each other. Capturing all of that activity and finding value, economic value, civility signals, growth and personalization signals, is the job of the Roblox data team. We create a platform so that our various partners, including our external developers, can access the data in a privacy protected manner, but also a highly performing manner.

[00:05:08] Ethan Batraski: A billion hours a week.

[00:05:10] Anupam Singh: A billion hours.

[00:05:10] Ethan Batraski: Incredible figure.

[00:05:13] Anupam Singh: And every human, every five to 10 seconds, actually changes something in their interactions. So you might be waving to me, you might be walking over to me, you might be excited. We had an Elton John concert, and of course, people were jumping around. And the job of the analytics platform is to capture almost that emotion at a second-to-second level, and then publish it out to our developers so that they can use it to build better experiences.

[00:05:43] Ethan Batraski: So give us a sense of what’s happening under the hood to capture that level of events that’s happening and I assume in some level of real time.

[00:05:53] Anupam Singh: Yeah. So under the hood, there are real time use cases, the batch use cases. It is essentially a giant data lake, a very big data lake, let’s just say, without going into the numbers.

[00:06:06] Anupam Singh: And on top of that data lake, there are Spark, Hive, and Presto clusters, there are elastic clusters that are used to transform the data, enrich the data, and then publish it as dashboards or analytical endpoints. Now there’s this new thing that’s happening over the last one or two years where many of these pipelines are feeding machine learning models. So if you ask me today, where we’re seeing growth in our usage is almost every software engineer has some machine learning use cases. So they use the data pipeline as an endpoint to start their machine-learning journey. And that itself is sometimes real-time, sometimes it’s a training process that looks like five years’ worth of data. So the machine learning use cases are where all the action is right.

[00:07:10] Ethan Batraski: And at this scale, as you deploy Spark, Hive, Presto in this distribute fashion, do these technologies naturally scale or is there a lot of infrastructure you have to build around it in order to scale and adapt to what seems to be unpredictable loads that come in as games? New games are played, increased number of users come in, come out?

[00:07:35] Anupam Singh: Yeah. So we have many places where thundering herds are very important because it is part of our experiences. When the Elton John concert starts, a million people might want to join in at the same time. So it’s not even a game where you need to find somebody to play with. It’s literally a music concert that a million people are simultaneously attending. That puts a lot of bad pressure on both the real-time and the vast pipelines. Now, here’s some good news for data professionals, in the end, it is still a badly written join.

[00:08:13] Anupam Singh: That is consuming too much data without filter. So those parts are still the same. What has changed for the data professional is that the underlying hardware is almost too elastic. The underlying hardware keeps going up and down, and autoscaling. And we were all told three or five years ago that autoscaling is going to take away that data profession for database administrator who had to tune queries. We don’t need that anymore because all these black box solutions will magically configure clusters for us. They will maximize CPU. When needed, they will create spot instances and, and run on them first.

[00:09:01] Anupam Singh: Our experience is that these black box solutions just hide inefficiencies and then come back as very large cloud expenditure bills. So bad join can actually cost you, not just performance, but it can cost you a massive amount of budget. So that’s what’s changed for data professionals, that mistakes in performance tuning can actually kill your budget.

[00:09:30] Ethan Batraski: I think you just said what every data professional wants to describe, but didn’t know exactly how to describe.

[00:09:37] Anupam Singh: Yep. Runaway joins, badly written aggregations are now going to cost you hundreds of thousands of dollars. Earlier, it would just cost you customer satisfaction.

[00:09:50] Anupam Singh: So to me, the need for performance tuning, for configuring your data warehouse properly, has actually gone up, not down.

[00:10:02] Ethan Batraski: So true. So Roblox is known for solving this problem by taking the more bare metal approach to their infrastructure. It makes it clear what prompted that move. What’s the strategy behind configuring and scaling of bare metal infrastructure?

[00:10:19] Anupam Singh: So firstly our needs are very unique. When you press play and get into our world, you want to go experience with your friends and family and even new people that you meet online. And there’s an image of satisfaction that you want when you press play. If there is lag, if there’s buffering, you’re not going to enjoy that music concert. You’re not going to enjoy kicking around a soccer ball with me, imagine, if the soccer ball just hangs in the air. It’s not entertaining for you.

[00:10:58] Anupam Singh: So think about that, that part, and now think about doing it at even the scale that we have publicly published. So all of these people are co-experiencing the internet, co-experiencing actually real world. And so, our hardware profile, our networking profile is unique. And for us to do that on public cloud would be prohibitive for our developers. We’ll not be able to support the million developers that create experiences on our platform. And second, our users and players would find the experience laggy, which, again, is not going to endear our product to millions of people. So that’s why we’ve had to create data centers around the world that are specific to our player satisfaction.

[00:11:58] Ethan Batraski: I love that focus on player satisfaction and responsiveness. I think that makes so much sense. What are some of the challenges that you had to overcome in having to now replicate what the clouds tend to offer around performance tuning and elasticity. Were there certain or unique approaches that Roblox took in making this kind of infrastructure work for the responsiveness required?

[00:12:25] Anupam Singh: Yeah. So some of that is just making sure that the dependency graph of microservices is managed well. So we have to call 60 odd microservices. And we have to make sure that if one or two of them are having a blip, the entire homepage doesn’t go away. So that’s how core reliability works. At our scale, it’s very interesting. So that’s part one. Part two of it is pushing almost every technology further than it has gone before. So our data pipeline orchestrating system is something that you’re familiar with, Airflow, will push the boundary to the point that we might be one of the biggest installations on the planet for Airflow. So that’s been the second challenge for us. We use a lot of open source, but we push it to its limits. And then we make it work for our scale, and then contribute it back to the community. So that’s, that’s our philosophy with using open source on top of bare metal.

[00:13:39] Ethan Batraski: I love that the community gets to benefit from a scale that it probably won’t have to face two, three, four, five years from now. But you’re effectively paving the path and the trail for what they can come to expect, um, when they need that scale.

[00:13:51] Anupam Singh: Yeah. And then, for cloud services, I will play the contrarian here that cloud services almost suffer from too much elasticity. So to give an example, we have a very large, let’s say, 400 node cluster for Spark, and it was expanding and contracting every 15 minutes. Whenever it saw events coming in, it would expand and it would contract. But it was almost chasing its own tail because more time was being spent in ela- … expanding the cluster, making sure there were spot instances available to expand it, and then, uh, contracting it. And in the end, we realized that a 30 node fixed cluster instead of 400 node elastic cluster would achieve the same throughput for our customers.

[00:14:45] Ethan Batraski: Wow.

[00:14:46] Anupam Singh: So elasticity is very promising on marketing materials. But in production, sometimes elasticity is actually hurtful than beneficiary.

[00:14:56] Ethan Batraski: Yeah. I think there’s a false understanding around the benefits of elasticity. And frankly, it feels like most teams just set an arbitrary number of what they believe and need, you know, I need 10 MF5 larges, but not realizing how much, how memory intensive their job is, or how interconnected the matrices are and so on.

[00:15:20] Anupam Singh: And as the services have become more obfuscated, you don’t even know what to measure, to say whether your query was efficient or not. So that’s something that we live with every day.

[00:15:30] Ethan Batraski: When you think about the potential in working with commercial solutions, what are the biggest challenges in adopting and deploying a commercial solution internally, given your scale?

[00:15:41] Anupam Singh: Yeah. So I, I’ve been a vendor most of my career, and one thing that I advise my vendor partners is if I’m your largest customer, accept it and be very candid about it. Don’t let me discover it during an upgrade.

[00:16:01] Anupam Singh: So that’s, that’s the first question that I have for, around let’s say the, give or take, 15 vendors that we work with in the data space.

[00:16:13] Anupam Singh: Some of them are very clear and accepting that we are their biggest installation, and therefore, they work very closely with us. Before they come to us with a feature, they actually come to us during their planning stages so that we can inject our scaling requirements into their planning. So that will be one-third of the vendors. One-third of the vendors help us whenever we have to upgrade or whenever we have to tune. So they work closely with us. We don’t inject our requirements into their road map. But when we are moving to a new release or a new feature, they have … their engineering teams directly work with us.

[00:16:55] Anupam Singh: There’s … Sadly, I would say there’s the last one-third who over-promises. And then when their 600 node cluster goes down, they’re not able to help us. And we might have a three or four hour outage because they are scrambling to go from level one support to level two support, to level three while we are almost immediately starting to think about, are we the biggest on this open source platform? And should we be considering our vendor suit?

[00:17:31] Ethan Batraski: That’s very sage advice, engaging Roblox early and deeply in, on a road map, providing re- very clear hands-on support, and under-promising and really being ready to scale. And it pushes on most early-stage companies when they wonder why a large company won’t adopt their platform. It’s the reason you highlighted. The risk is not worth the reward.

[00:17:58] Anupam Singh: Yeah.

[00:17:59] Ethan Batraski: I feel like we can talk about this for hours. And why don’t we end with a fun quick fire? So I’m gonna ask you a question between two things. And you tell me what comes to mind first. Postgres or MySQL?

[00:18:13] Anupam Singh: Postgres.

[00:18:13] Ethan Batraski: Postgres all the way. I love it., Rust or Golang?

[00:18:15] Anupam Singh: Go, Golang.

[00:18:18] Ethan Batraski: Okay. ETL or ELT?

[00:18:21] Anupam Singh: Would love to be ELT, but really ETL.

[00:18:23] Ethan Batraski: Okay. [laughs]. Batch or real-time?

[00:18:27] Anupam Singh: Real-time.

[00:18:29] Ethan Batraski: A favorite Roblox game?

[00:18:31] Anupam Singh: Doors. It’s a horror genre.

[00:18:33] Ethan Batraski: Doors. I have seen that one. My sons played it actually this weekend.

[00:18:37] Anupam Singh: Yeah, it’s scary.

[00:18:39] Ethan Batraski: Snowflake or Databricks?

[00:18:40] Anupam Singh: None.

[00:18:42] Ethan Batraski: Build and test locally or remotely?

[00:18:45] Anupam Singh: Remotely.

[00:18:46] Ethan Batraski: Okay. And then last, but not least, Kubernetes. For or against?

[00:18:50] Anupam Singh: For, very strongly for.

[00:18:52] Ethan Batraski: Strongly for. I love it. I love the strong stance.

[00:18:55] Anupam Singh: Mm-hmm.

[00:18:55] Ethan Batraski: Anupam, this has been great. Thank you so much for sharing your sage wisdom and advice, and insights. Everyone listening to this podcast has certainly learned a ton from this, so hope we get to do this again soon.

[00:19:07] Anupam Singh: Yes, absolutely. This was fun, making me think about vendors, how do we partner with our open source community, and getting reminded of our scale, which is always fun to think about.

[00:19:22] Ethan Batraski: You guys are building a generational company and you have the scale to prove it, so excited to see where you take it next.

[00:19:30] Anupam Singh: Thank you, Ethan. This was fun.

Follow Ethan: https://twitter.com/ethanjb

Follow Ganesh: https://twitter.com/gan3sh