👋 Hi, this is Gergely with a subscriber-only issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. If you’ve been forwarded this email, you can subscribe here. Startups on hard mode: Oxide. Part 1: HardwareWhat is tougher than building a software-only or hardware-only startup? Building a combined hardware and software startup. This is what Oxide is doing, as they build a “cloud computer.” A deepdive.What does an early-stage startup look like? Usually, there’s an office with devs working on laptops, a whiteboard with ideas, and lots of writing of code. Sometimes, there’s no physical office because team members work remotely, sharing ideas on virtual whiteboards. But what about companies which aren’t “pure” software startups, and which focus on hardware? I recently got a glimpse of one in San Francisco, at the offices of Oxide Computer Company, which is a startup building a new type of server. I was blown away by the working environment and the energy radiating from it. This is their office: Some things are hard to describe without experiencing them, and this includes being at a hardware+software startup right as the first product is being finished, with the team already iterating on it. In today’s issue, we cover hardware at Oxide:
The bottom of this article could be cut off in some email clients. Read the full article uninterrupted, online. 1. Why build a new type of cloud computer?If you want to build an application or service for millions of users, there are two main options for the infrastructure:
In data centers, the unit of measurement is “one rack.” A rack is a storage unit that can hold a few dozen servers; often referred to as “pizza box servers” because of their shape. Thicker types are called “rack servers.” Side note: an alternative to pizza box servers is blade servers, inserted into blade enclosures as building blocks in data centers. Here’s a “pizza box server” that online travel booking platform Agoda utilized heavily: And here’s some commodity servers in Oxide’s office: The rack in the image above was operating during my visit. It was loud and generated a lot of heat as expected, and there were lots of cables. It’s messy to look at and also to operate: the proprietary PC-era firmware causes security, reliability and performance issues. “PC-era” refers to the 1980s – early-2000s period, before x86 64-bit machines became the servers of choice. Elsewhere, Big Tech companies have manufactured their own highly optimized racks and servers, but these aren’t for sale. The likes of Meta, Google, and Amazon no longer use traditional racks, and have “hyper-scaled” their servers to be highly energy efficient, easier to maintain, and with few cables. Joe Kava, VP of Google's Data Center Operations, described these racks back in 2017:
Back to Oxide, whose vision is to build a cloud computer that incorporates the technological advances of Big Tech’s cloud racks, but makes them available to all. What if smaller tech companies could purchase energy-efficient servers like those that Meta, Amazon, Google and Microsoft have designed for themselves, and which customers of the big cloud providers like AWS, GCP, and Azure use – but without being locked in? This is what the Oxide Computer offers, and I’ve seen one of its first racks. It appears similar in size to a traditional rack, but the company says it actually occupies 33% less space than a traditional rack, while offering the same performance. It’s much quieter than an everyday commodity server; in comparison the Gigabyte and Tyan servers are ear-splitting, and there are hardly any cables compared to a typical server. The benefits of the Oxide computer compared to traditional racks:
Oxide’s target customer is anyone running large-scale infrastructure on-prem for regulatory, security, latency, or economic reasons. The Oxide rack comes with 2,048 CPU cores (64 cores per “sled,” where one sled is Oxide’s version of a “rackmount server”,) 16-32TB of memory (512GB or 1TB of memory per sled) and 1PB (petabyte) of storage (32TB storage per sled). See full specification. This kind of setup makes sense for companies that already operate thousands of CPU cores. For example, we previously covered how Agoda operated 300,000 CPU cores in its data centers in 2023; at such scale investing in racks like Oxide’s could make sense. Companies in the business of selling virtual machines as a service might also find this rack an interesting investment to save money on operations, compared to traditional racks. An interesting type of customer are companies running thousands of CPU cores in the public cloud, but which are frustrated by network latencies. There’s a growing sense that multi-tenancy in public clouds; where one networking switch servers several racks and customers, causes worse latency which cannot be debugged or improved. In contrast, an Oxide rack offers dedicated rack space in data centers. Using these servers can also considerably reduce network latencies because the customer can choose the data center they use, based on their own regional needs. Customers also get full control over their networking and hardware stack – something not possible to do when using a cloud provider. Oxide doesn’t target smaller startups that only need a few hundred CPU cores. For these businesses, using cloud providers, or buying/renting and operating smaller bare metal servers is the sensible solution. 2. Building a networking switch from scratchIn server manufacturing, where does innovation come from? I asked Bryan:
Oxide had to design two pieces of hardware from scratch: the switch and the server. Why build a switch instead of integrating a third-party switch?Oxide’s mission is to build their own cloud computer. Building a custom server usually means taking off-the-shelf components for a system and integrating it all together, including the server chassis, a reference design system board, and a separately-developed network switch. A “reference design” is a blueprint for a system containing comprehensive guidance on where to place its elements, that’s been certified to work as intended: it should not overheat, or cause unexpected interference. However, Oxide also needed to build their own networking switch, as well as build a custom server – which is quite an undertaking! This need came from the constraint that Oxide wanted to control the entire hardware stack, end-to-end. A networking switch is a “mini-computer” in itself. So in practice, they designed and built two computers, not just one. Producing a separate switch meant integrating a switching application-specific integrated circuit (ASIC), management CPU, power supplies, and physical network ports. Oxide’s goals for this switch were:
Building the custom networking switch took around 2 years, from designing it in March 2020, to the first unit being assembled in January 2022. Building custom hardware almost always comes with unexpected challenges. In the case of the networking switch, the team had to work around an incorrect voltage regulator on the board, marked with yellow tape in the image above. 3. Using proto boards to build hardware fasterProto boards is short for “prototype printed circuit boards,” which help the company test small components to ensure they work independently. Once validated, those components can be used as building blocks.
First prototype boardFor the initial prototype printed circuit board, the team started with the service processor. This was a well understood, very critical part of the server and the switch. The team decided to build it from two off-the shelf microcontrollers, and built a prototype board around this: Starting out, the company took inspiration from robotics. Founding engineer, Cliff L. Biffle, shares: The team found that It’s possible to bite off more than you can chew, even with prototype circuit boards, The first board was a success in that it worked: upon start up, the hardware and software “came up” (in electrical engineering, “coming up” refers to the successful powering up and initializing of an electronic system board.) But the time it took to develop was longer than the team wanted. It turns out this prototype board was too highly integrated, with too many moving parts and too many I/O pins. There was simply too much on the board for it to be productive. The team learned that progres would be faster with multiple, simpler boards as pluggable hardware modules, instead of one complicated board with lots of functionality and many fixed functions. As engineer Rick Altherr – who worked on the board – noted:
A more modular approachBefore building the service processor board, the team separated out the “root of trust” (RoT) functionality onto a separate board. The RoT hardware is the foundational base upon which the security and trust of the entire system are built. A RoT has “first instruction integrity,” guaranteeing exactly which instructions run upon startup. The RoT sets up the secure boot and permanently locks the device to ensure ongoing secure operation. Below is the prototype of Oxide’s RoT module: Other modules the Oxide team built included a power module, a multiplexer (a device with multiple input signals, which selects which signal to send to the output): Over time, the team found the right balance of how much functionality the prototype board needed. Below is an evolved prototype board version of the service processor: The Oxide team calls this board a “workhorse” because they can plug in so many modules, and do so much testing and hardware and software development with it. Here’s an example: A prototype board is unit testing for hardware. In software development, unit tests ensure that components continue to work correctly as the system is modified. Oxide found that prototype boards come pretty close to this approach, and allowed Oxide to iterate much faster on hardware design, than by manufacturing and validating test devices. Using smart workarounds to iterate faster...Subscribe to The Pragmatic Engineer to read the rest.Become a paying subscriber of The Pragmatic Engineer to get access to this post and other subscriber-only content. A subscription gets you:
|
Search thousands of free JavaScript snippets that you can quickly copy and paste into your web pages. Get free JavaScript tutorials, references, code, menus, calendars, popup windows, games, and much more.
Startups on hard mode: Oxide. Part 1: Hardware
Subscribe to:
Post Comments (Atom)
When Bad People Make Good Art
I offer six guidelines on cancel culture ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏...
-
code.gs // 1. Enter sheet name where data is to be written below var SHEET_NAME = "Sheet1" ; // 2. Run > setup // // 3....
No comments:
Post a Comment