👋 Hi, this is Gergely with a free issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. If you’ve been forwarded this email, you can subscribe here. Building Bluesky: a Distributed Social Network (Real-World Engineering Challenges)Bluesky is built by around 10 engineers, and has amassed 5 million users since publicly launching in February this year. A deep dive into novel design decisions, moving off AWS, and more.Before we start: AI tooling for software development feels like it has hit "peak hype" across mainstream media. We would like to do a "reality check" and find out how engineers and teams are using these tools (and which tools/use cases are genuinely efficient). Please help us by filling out this survey. We will share the full report with all of you who share detailed insights. Thanks for your help! ‘Real-world engineering challenges’ is a series in which we interpret interesting software engineering or engineering management case studies from tech companies. Bluesky is known as a Twitter-alternative. It launched two years ago, with an invite-only beta launch last year. It’s already grown to an impressive 5.5 million registered users. Interestingly for software engineers, Bluesky is also a fascinating engineering project unlike any other mainstream social network. Martin Kleppman, author of the Designing Data Intensive Applications book, is involved as a technical advisor, and has published a paper outlining the novel approaches Bluesky has taken. The biggest differences between Bluesky and other large social networks:
Other social networks have achieved some of these things; such as Mastodon allowing users to own their data and identity, and Meta achieving eye-catching growth by getting 100 million users in just a week. Still, only Bluesky has pulled off them all. Today, we dive into how Bluesky is built, sitting down with its two founding engineers: Daniel Holmgren and Paul Frazee. They take us through:
1. Development timelineBluesky has been in development for just over 2 years, and has been publicly available for around 12 months. Here’s the timeline: Adding in the three phases we’ll discuss below: Phase 1: ExperimentationThe first 10 months of the project between January and October 2022 were all about exploration, and the team started to work fully in the open after 4 months. The first project the team open sourced was Authenticated Data Experiment (ADX), an experimental personal data server and a command-line client, accompanied by a network architecture overview. In April 2022, heavy Twitter user, Elon Musk, raised the prospect of potentially acquiring the site, which created interest in alternatives to the bird app, as any major change in a market-leading social network does. The first commit for the Bluesky mobile app was made in June 2022, and Paul Frazee worked on it. It started as a proof-of-concept to validate that the protocol worked correctly, and to aid protocol development via real-world use. Conventional wisdom says that prototypes are thrown away after serving their purpose. However, in this case this mobile app that a single person had built, became the production app, following the unforeseen spike of interest in it caused by takeover news at Twitter. This is a good reminder that real world events can push conventional wisdom out of the window! In October 2022, the team announced the Authenticated Transfer Protocol (AT Protocol) and the app’s waitlist, just a few days after news that Elon Musk was to acquire Twitter. This led many tweeters to seek alternative social networks, and drove a major signup spike for Bluesky’s private beta. This development put pressure on the Bluesky team to seize the unexpected opportunity by getting the protocol and app ready for beta users. See details on the AT Protocol. Phase 2: invite-only launch and the first 1M usersIn October 2022, Bluesky consisted solely of Jay Graber CEO, and two software engineers; Daniel and Paul. Engineer #3, Devin, joined the same month. Announcing the AT Protocol and waitlist generated some media buzz and Bluesky attracted more interest during this period. In March 2023, the company was confident that the protocol and mobile app were stable enough to invite more users by sending invites. “Blocking” was implemented in a single night. After the app opened up to more users, there was an influx of offensive posts and of users verbally harassing other accounts. This made it clear that implementing blocks to restrict individual accounts from viewing and commenting on a user’s posts, was urgently-needed functionality. The three earliest developers – Paul, Devin and Daniel – jumped on a call, then got to work. In the community, developers saw the pull requests (PRs) on this feature appear on GitHub, and started to point out bugs, and cheer on the rapid implementation. They wrapped it up and launched the feature by the end of the same day. To date, this is the most rapidly-built feature, and is still used across the protocol and the app! In June 2023, Bluesky passed the 100,000-users milestone when the team numbered 6 developers, who’d shipped features like custom feeds, blocking and muting, moderation controls, and custom domains. A web application built on React Native was also in production. In September 2023, Bluesky passed 1 million users – a 900,000 increase in just 3 months! Phase 3: Preparing for public launchIn the 6 months following the 1 million-user milestone, the focus was on preparing to open up Bluesky to the public with no waitlist or throttling of invites. Federation (internal.) To prepare for “proper” federation, the team made architecture changes to enable internal federation of Bluesky servers. Federation is a key concept in distributed networks. It means a group of nodes can send messages to one another. For Bluesky, it meant that – eventually – users should be able to run their own PDS instances that host their own user information (and user information of users on that server.) And the Bluesky network operates seamlessly with this distributed backend. A new logo and a reference to Twitter. The team prepared a new logo for launch, and announced it in December 2023: The butterfly logo is intended as a symbol of freedom and change. Existing centralized social media platforms – like X (formerly Twitter,) Instagram, TikTok, and Youtube – are platforms that want to lock users into their website and apps. Bluesky, on the other hand, offers its protocol, but doesn’t dictate which apps or websites people use. It doesn’t even want to dictate the hosting of content: Let’s dive into each phase of the building process. 2. Experimentation phaseDuring Bluesky’s first 9 months (January-September 2022) two software engineers built the protocol and apps – Daniel Holmgren and Paul Frazee – and Jay the CEO signed off design decisions. The first couple of months were about experimenting and tech “spiking,” which means timeboxing the time and effort spent building and trying out ideas. Here’s Paul:
When the direction wasn’t clear, the team kept trying out new approaches, says Daniel:
Development principlesThe still-small team set up principles to ensure continuous progress:
Approach to building a new, novel decentralized protocolThe team prioritized flexible design choices in order to not lock themselves into a technology, until they knew exactly what they were building. Not coupling the data layer too closely with Postgres is an example of this. See below. Building for flexibility, not scalability, was deliberate. The idea was to swap this approach to prioritize scale once everyone knew exactly what to build. The knowledge that decisions are hard to undo made the team’s own decision-making more thorough, Daniel reflects:
Inventing new approaches was never a goal. The original idea was to take a protocol or technology off the shelf, and push it as far as possible to reveal a requirement that didn’t quite fit. For example, Lexicon – the schema used to define remote procedure call (RPC) methods and record types – started out as JSON schemas. The team tried hard to keep it lightweight, and stuck to JSON schemas. But they ended up bending over backwards to make it work. In the end, the team decided to fork off from JSON schemas and added features to it, which is how Lexicon was born. Bluesky gets criticism for inventing new approaches which are non-standard across decentralized networks. Paul explains it like this:
Bluesky takes inspiration from existing web technologies. As Daniel puts it:
3. v1 architecture: not really scalable and not federated – yetInfrastructure choicesPostgreSQL was the team’s database of choice when starting development. Postgres is often called the “Swiss Army knife of databases” because it’s speedy for development, great for prototyping, with a vast number of extensions. One drawback is that Postgres is a single bottleneck in the system, which can cause issues when scaling to handle massive loads that never materialize for most projects. For the team, using Postgres worked really well while they were unsure exactly what they were building, or how they would query things. Paul’s summary of the choice to use Postgres:
AWS infrastructure was what the team started with because it’s quick to set up and easy to use, says Daniel:
The first infra hire at Bluesky, Jake Gold, iterated on the AWS setup:
To facilitate deployments on AWS, the team used infrastructure-as-code service, Pulumi. Modularizing the architecture for an open network was an effort the team kicked off early. The goal of modularization was to spin out parts of the network which users could host themselves. Daniel says:
Personal Data ServerAt first, the architecture of Bluesky consisted of one centralized server, the PDS (Personal Data Server.) The strategy was to split this centralized service into smaller parts and allow for federation, eventually. Bluesky being a federated network means individual users can run their own “Bluesky instance” and curate their own network. The feed generatorIn May 2023, the Bluesky team moved the feed generator to its own role. This service allows any developer to create a custom algorithm, and choose one to use. Developers can spin up a new Feed Generator service and make it discoverable to the Bluesky network, to add a new algorithm. Bluesky also allows users to choose from several predefined algorithms. The Feed Generator interface was the first case of Bluesky as a decentralized network. From then, the Bluesky network was not solely the services which the Bluesky team operated, it was also third-party services like Feed Generator instances that plugged into the Bluesky network. Dedicated “Appview” serviceFor the next step, the view logic was moved from the PDS, to an “Appview” service. This is a pretty standard approach for backend systems, to move everything view-related to its own service, and not to trouble other systems with presenting data to web and mobile applications. Relays to crawl the networkIn the future, there could be hundreds or thousands of PDSs in the Bluesky network. So, how will all the data be synchronized with them? The answer is that a “crawler” will go through all these PDSs. In preparation for this crawl the team introduced a Relay service: 4. v2 architecture: scaleable and federatedThe v1 architecture needed to evolve in order to support full federation, and the team always planned to move on from it. But they expected v1 to last longer than only 6 months. FederationFederation sandbox. Before shipping a first version of federation, the team built a Federation Sandbox to test the architecture, as a safe space to try new features like modulation and curation tooling. Internal federation. To prepare for federation proper, the next refactoring was to add support for multiple Personal Data Servers. As a first step, the Bluesky team did this internally. Users noticed nothing of this transition, which was intentional, and Bluesky was then federated! Proving that federation worked was a large milestone. As a reminder, federation was critical to Bluesky because it made the network truly distributed. With federation, any user can run their own Bluesky server. The “internally federated” PDS servers worked exactly like a self-hosted PDS. Bluesky made one addition, to wrap the internal PDS servers into a new service called “Entryway,” which provides the “bsky.social” identity to the PDSes. Entryway will become the “official” Bluesky OAuth authorization server for users who choose bsky.social servers, and one operated as a self-hosted server. Later, Bluesky increased the number of internal PDS servers from 10 to 20 for capacity reasons, and to test that adding PDS servers worked as expected. External federation. With everything ready to support self-hosted Personal Data Servers, Bluesky flipped to switch, and started to “crawl” those servers in February 2024: To date, Bluesky has more than 300 self-hosted PDSs. This change has made the network properly distributed, anyone wanting to own their data on Bluesky can self-host an instance. Over time, we could also see services launch which self-host instances and allow for full data ownership in exchange for a fee. Appview: further refactoringRecently, Bluesky further refactored its Appview service, and pulled out the moderation functionality into its own service, called Ozone: Users can run their own Ozone service – meaning to be a moderator in the Bluesky system. Here are details on how to self-host this service, and more about Ozone. An architectural overview, with Martin KleppmanMartin is the author of the popular software engineering book, Designing Data Intensive Applications, and he also advises the Bluesky team in weekly calls. Martin and the Bluesky team published a paper describing the Bluesky system, Bluesky and the AT Protocol: Usable decentralized social media. In it, they offer a detailed overview of the architecture: The diagram above shows how data flows occur in the application:
5. Scaling the database layerScaling issues with PostgresScaling issues emerged 2-3 months after the public beta launch in mid-2023.
As a reminder, the team was still tiny when all these scaling challenges emerged. There were only 6 developers (Daniel, Devin, Bryan and Jake on the backend, and Paul and Ansh on the frontend). Then in summer 2023, Daniel had a dream:
ScyllaDB replacing PostgresThe team knew they needed a horizontally scalable data storage solution, with fine-grained control of how data is indexed and queried. ScyllaDB was an obvious choice because it supports horizontal scalability due to being a wide-column database (a NoSQL type.) Wide-column databases store data in flexible columns that can be spread across multiple servers or database rows. They can also support two rows having different columns, which gives a lot more flexibility for data storage! The biggest tradeoffs:
The team was satisfied with their early choice of Postgres, says Daniel:
SQLiteScyllaDB is used for the Appview, which is Bluesky’s most read-heavy service. However, the Personal Data Servers use something else entirely: SQLite. This is a database written in the C language which stores the whole database in a single file on the host machine. SQLite is considered “zero configuration,” unlike most other databases that require service management – like startup scripts – or access control management. SQLite requires none of this and can be started up from a single process with no system administrative privileges. It “just works.” Daniel explains why SQLite was ideal for the PDSs:
Migrating the PDSs from Postgre to SQLite created fantastic improvement in operations, Daniel adds:
6. Infra stack: from AWS to on-premBluesky’s infrastructure was initially hosted on Amazon Web Services (AWS) and the team used infrastructure-as-a-code service, Pulumi. This approach let them move quickly early on, and also to scale their infra as the network grew. Of course, as the network grew so did the infrastructure bill. Move to on-premCost and performance were the main drivers in moving on-prem. The team got hardware that was more than 10x as powerful as before, for a fraction of the price. How was this decision made? A key hire played a big role. Bluesky’s first hire with large-scale experience was Jake Gold, who joined in January 2023, and began a cost analysis of AWS versus on-prem. He eventually convinced the team to make this big change. But how did the team forecast future load, and calculate the hardware footprint they’d need? Daniel recalls:
Becoming cloud-agonistic was the first step in moving off AWS. By June 2023, six months after Jake joined, Bluesky’s infrastructure was cloud agonistic. Bluesky always has the option of using AWS to scale if needed, and is designed in a way that it would not be overly difficult to stand up additional virtual machines on AWS, if the existing infrastructure has capacity or scaling issues. Today, the Personal Data Servers are bare-metal servers hosted by cloud infrastructure vendor, Vultr. Bluesky currently operates 20 and shards them so that each PDS supports about 300,000 users. Bluesky’s load by the numbersCurrently, Bluesky’s system sees this sort of load:
7. Reality of building a social networkTo close, we (Gergely and Elin) asked the teams some questions on what it’s like to build a high-growth social network. What is a typical firefighting issue you often encounter?
What were the events referred to as “Elon Musk?”
How are outages different for a social network?
The whole developer team is on Bluesky, and actively responding to user feedback. How do you do this, and why?
TakeawaysGergely here. Many thanks to Daniel and Paul for part one of this deep dive into how Bluesky works! You can try out Bluesky for yourself, learn more about Bluesky’s AT Protocol, or about its architecture. And I’m also on Bluesky. Decentralized architectures require a different way of thinking. I’ll be honest, I’m so used to building and designing “centralized” architecture, that the thought of servers being operated outside of the company is very alien. My immediate thoughts were:
I’m delighted we did a deep dive about Bluesky because it has forced me to think more broadly. A server drawing on a diagram no longer just means “a group of our servers,” it can also mean “plus, a group of external servers.” Once this is understood, it’s easy. And this skill of designing distributed and federated systems may be useful in the future, as I expect the concept of distributed architecture to become more popular. It’s impressive what a tiny team of experienced engineers can build. I had to triple-check that Bluesky’s core team was only two engineers for almost nine months, during which time they built the basics of the protocol, and made progress with the iOS and Android apps. Even now, Bluesky is a very lean team of around 12 engineers for the complexity they build with and the company’s growth. In the next part of this deep dive into Bluesky, we cover more on how the team works. Owning your own infrastructure instead of using the cloud seems a rational choice. Bluesky found large savings by moving off AWS once they could forecast the type of load they needed. Jake Gold, the engineer driving this transition, has been vocal about how cloud providers have become more expensive than many people realize. Speaking on the podcast, Last Week in AWS, he said:
Don’t forget, it’s not only Bluesky which rejects cloud providers for efficiency. We previously did a deep dive into travel booking platform Agoda, and why it isn’t on the cloud. I’m slowly changing my mind about decentralized and federated social networks. I also tried out Mastodon, which is another federated social network, when it launched. At the time, Mastodon felt a lot more clunky in onboarding than Bluesky. You had to choose a server to use, but different servers have different rules, whereas Bluesky was much smoother. Still, as a user, I was blissfully unaware of how different these social networks are from the dominant platforms. It was only by learning about Bluesky’s architecture that I appreciated the design goals of a decentralized social network. Currently, mainstream social networks are operated exclusively by the company that owns them. But a decentralized network allows servers to be operated by other teams/organizations/individuals. This might not seem like a big deal, but it means a social network is no longer dependent on the moderation policies of a parent company. Decentralized social networks also allows users to use custom algorithms, websites and mobile apps, which creates opportunities for developers to build innovative experiences. In contrast, you cannot build a custom third-party client for X, Threads, or LinkedIn. I’m still unsure how much mainstream appeal decentralized social networks hold for non-technical people, but I’m rooting for Bluesky, Mastodon, and the other decentralized social apps. Perhaps they can challenge Big Tech’s dominance of social media, or at least change people’s understanding of what a social network can be. In a follow-up issue, we’ll look deeper into the engineering culture at Bluesky: the company culture, a deeper look at the tech stack, and how they are building seemingly so much with a surprisingly small team and company. I suspect we can all learn a lot in how a dozen engineers help a startup scale to more than 5 million users. Enjoyed this issue? Subscribe to get this newsletter every week 👇 Featured Pragmatic Engineer JobsImportant update: this jobs board will be retired on 27 April - so this is the last newsletter issue it is shown in. Read the reasons why I decided sunset this job board. Featured Pragmatic Engineer jobs:
You’re on the free list for The Pragmatic Engineer. For the full experience, become a paying subscriber. Many readers expense this newsletter within their company’s training/learning/development budget. This post is public, so feel free to share and forward it. |
Search thousands of free JavaScript snippets that you can quickly copy and paste into your web pages. Get free JavaScript tutorials, references, code, menus, calendars, popup windows, games, and much more.
Building Bluesky: a Distributed Social Network (Real-World Engineering Challenges)
Subscribe to:
Post Comments (Atom)
Why Carousels Are a Poor Way to View Product Images
Introducing the scrolling image stream ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ...
-
code.gs // 1. Enter sheet name where data is to be written below var SHEET_NAME = "Sheet1" ; // 2. Run > setup // // 3....
No comments:
Post a Comment