Penguin Computing is Meta’s architecture and managed services partner. Can you talk about what kind of work that entails with the new AI Research SuperCluster?
Let me walk you back memory lane just a little bit, because I’d like to give context. We’ve been a provider to Meta for many years. Actually, some of the early workstations or servers that Mark Zuckerberg used [for Facebook] were Penguin Computing-branded, and we’ve got some pictures that show it, so that that is how it all started.
But it really started with their [first-generation cluster] back in 2017. And it’s mentioned in their public announcements: Out of a request for information of over a dozen responders, we were selected as the key and unique partner for the architecture, design and deployment of their Facebook AI Research cluster — FAIR is the acronym. And the reason why we got selected was really because we had a very different approach to solving for the ask. A lot of other manufacturers will think about what you need as building blocks without getting into the details of the application. We spent more time — like we knew through our HPC background, 20 years of it — understanding the workload and the bottlenecks that they had experienced. And from there, we made a pretty innovative proposal to them that piqued their interest. So we got the nod back in 2017. And that started the partnership.
In the last four years, we built up more than just a seller-buyer relationship. It was a very embedded collaboration on the FAIR project and cluster, where we looked at the expansion from day one. We built in a lot of automation together as a joint team, a lot of coding for the deployment, utilization and management of the cluster — and all this with a keen eye to uptime, performance and [return on investment] for Meta. Because this type of resource is a sizeable investment, and if you don’t carefully design the architecture, you can over-design in areas that you don’t bring any value to your end users. And FAIR was really for both companies a good success story. I wanted to tell you the history, because it didn’t happen yesterday. It was four years in the making.
But then come to today and the announcement and explaining to you a little bit more what we do: So we became very aware of their need based on their workload. And for [the AI Research SuperCluster], we designed the whole cluster with Meta, which included several components, and Penguin Computing is really integrating components from the industry. We integrate a lot of components that are already existing from partners like Nvidia or Pure Storage, but also others in the industry.
In the case of Meta, those were the two that were selected, plus some of our own elements for the cache storage. But that was because it seemed to be the right component for Meta. So we designed the whole solution for them. We made sure that we performance-tested against their requirements, and that’s an important part: The requirements were understood through the four years at FAIR, and we were able to get through this performance assessment fairly rapidly. Then once we had that, we had to assemble all the components, build them in our factories, deliver them, deploy them [and] just the very mundane aspect of cabling, making sure all the networking is done right, the setup is right.
But really what’s unique about this deployment is the scale when you deploy that many nodes and that big of a storage footprint at that scale. I would say a lot of companies in the industry can do it at a smaller scale, even at a reasonable scale. My experience over the past few years has demonstrated that there are a lot of players in that space. But you get to that scale integrated with a team that has really important requirements, and you deliver it in a short period of time that we did integrate with the operation team, which is not a simple task.
When you operate such a cluster, you have to understand the requirements around security, around performance, around availability and around all the unique aspects of a company like Meta that need to be taken into consideration. That’s been our value to Meta. And as we look at how this is evolving over time, as the announcement described, we’re looking at the next phase of this Research SuperCluster, which will be even more scaled compared to what we a