Breaking Dawn: Inside the UK’s newest AI supercomputer
To visit the University of Cambridge’s West Cambridge campus is to be immersed in cutting-edge science and technology.
Entering the site, it is impossible to miss the enormous gleaming white structure of the Ray Dolby Centre, soon to be the new home of the Cavendish Laboratory, the university department where the neutron and the electron were discovered. If you’re planning a trip from one lab to another, you can take a (very slow) ride in one of the driverless shuttle buses that can often be found traversing the campus.
Tucked away in a quiet corner near the M11 motorway, on the appropriately named Ada Lovelace Road, you will also find what purports to be the UK’s fastest AI supercomputer.
Housed in the West Cambridge data center, the Dawn supercomputer has been set up as part of a government push to boost the UK’s national compute power and support scientific projects across the country.
Dawn, er, broke in February with the official launch of the machine, but it is the second phase of the project, which could see its capabilities increase tenfold, where the benefits could be truly transformational for researchers.
A new Dawn
Dawn is managed through the Cambridge Open Zettascale Lab, part of Cambridge University’s Research Computing Services department. The department provides AI, data storage, and High-Performance Computing (HPC) services to the university’s research teams and other academic institutions around the country.
“Most of our business comes through the funding councils to provide national services,” explains Dr.
Paul Calleja, director of Research Computing. “Prior to Dawn, we probably had 3,000 x86 servers of various generations. We normally have three generations of Intel chips – at the moment we have Cascade Lake, Ice Lake, and Sapphire Rapids – and in terms of AI, we have a large estate of Nvidia GPUs.
“We’ve seen demand for HPC services growing at a fairly predictable rate, but AI has really rocketed in the last couple of years.”
We’ve seen demand for HPC services growing at a fairly predictable rate, but AI has really rocketed in the last couple of years Paul Calleja
Indeed the AI boom of the last 18 months, driven initially by the popularity of OpenAI’s ChatGPT generative AI system, caught many businesses and governments around the world off guard. But by the time generative AI emerged, the UK was already evaluating its processing capabilities as part of the Future of Compute review, commissioned by the government in the summer of 2022.
The review, published in March 2023[1], found that the UK’s compute ecosystem was not meeting the needs of users and “limiting the UK’s scientific capability and inhibiting scientific breakthroughs.”
“We had to do something because for the last 10-15 years we’ve been radically underfunding this space in the UK,” says Dr.
Calleja, whose department contributed to the review process. “It was always going to say we needed to dramatically increase our spend on AI and HPC.”
An open approach to AI hardware
And so it was that Dawn was announced in November, to coincide with the UK’s AI safety summit, held at Bletchley Park. The machine was part of a GBP300 million (£374m) plan to build what ministers describe as a new national Artificial Intelligence Research Resource (AIRR). This also involves constructing another AI supercomputer, the Isambard-AI, at the University of Bristol.
Built by Intel and Dell in partnership with UK vendor StackHPC, Dawn contains 512 4th Generation Intel Xeon Scalable processors and 1,024 Intel Data Center GPU Max 1550 accelerators on 256 Dell PowerEdge XE9640 server nodes, offering up to 128 gigabytes of high bandwidth memory.
The system benefits from direct liquid cooling.
Dr. Calleja says the decision to use Intel’s hardware over that of GPU market leader Nvidia was driven by two factors; the vendor’s commitment to open architecture and a desire to diversify the UK’s hardware supply chain. The fact that Intel, along with Dell, offered to fully fund phase one of the machine’s development probably didn’t hurt, either.
“Nvidia’s direction is clear,” Dr.
Calleja says. “Every move it makes is to make its ecosystem more proprietary so there’s more lock-in and it can keep its share price at the current level. It’s madness and it has to change because the ecosystem needs competition.
“As a nation, we have to be mindful of supply chain diversity for price, lead time, and security reasons. So if the UK is looking to invest hundreds of millions of pounds in hardware, my argument is that you invest two-thirds of that in the current market leader, and invest one-third in growing the supply chain.
“The Intel ecosystem currently has a smaller number of large users than the Nvidia one, but we’re used to that with novel architectures.
They’ve invested significantly in phase one of this system, and it’s a win-win because they get their technology out there, and the government wins because it gets access to an innovative system.”
Dr. Calleja says Intel’s “long history” of open software standards around its x86 processor ecosystem places it in good stead to make in-roads in AI, despite Nvidia’s current dominance of the market. “The x86 ecosystem is so rich because of Intel’s commitment to being open,” he says. “98 percent of scientific workloads are run on x86, and Intel invests in that and is employing a similar philosophy with GPUs.”
Where Nvidia has its Cuda development framework for AI, Intel has oneAPI, a cross-platform set of tools. “This means that if you develop on oneAPI, you can run that code on an Nvidia system, an Intel system, or an AMD system,” Dr. Calleja says. “That, I believe, will eventually win out and I already have clients who use Nvidia but don’t want to develop on Cuda because they don’t want to get locked in.
“Big scientific software projects are on 30-year investment cycles, whereas hardware cycles are more likely to be four years, so it’s much better if you can be cross-platform.
For me, it’s a no-brainer, but Intel needs momentum and Nvidia has a ten-year headstart.”
The Dell PowerEdge racks feature direct liquid cooling, meaning liquid is sprayed onto a cold plate attached directly to the hot components. Dr. Calleja says this is in keeping with the ethos of the West Cambridge data center. “The data center is ten years old, and when we built it we wanted to be ready to use water [for cooling],” he says. “So all our racks have water-cooling rear doors by legacy.
“For a long time, we stayed away from having water in the servers, because it wasn’t necessary, but now it is.
We have a hybrid set-up using water and air, and our last generation of non-water cooled chips will be the Cascade Lakes, which are going in now. The newer ones are all water-cooled.”
Dr. Calleja says that thanks to its hybrid cooling set-up, the data center as a whole achieves a Power Usage Effectiveness (PUE) score of 1.14.
Through the partnership with StackHPC, researchers wishing to access Dawn can do so through a single cloud-based control pane, known as Scientific OpenStack, which also allows them to utilize the rest of the compute power in the Research Computing Services estate.
“It’s optimized for HPC and AI, and it allows us to do many different things in a secure environment,” explains Dr.
Calleja. “On top of that, we can deploy software-defined research platforms for our customers. We’ve moved completely into the DevOps way of working and that’s really revolutionized how we operate in terms of usability, portability, and security.”
Beautiful Dawn
Whether Dawn’s claim to be the UK’s fastest AI supercomputer stacks up at this point is dubious.
Data released by Intel at the SC23 conference in Denver, Colorado, last year shows that the system achieved a peak of 19 petaflops of benchmarked FP64 performance. This puts it on a par with the UK’s current top supercomputer, the Archer 2 system, which has peak benchmarked performance of 20 petaflops.
Archer2 is in position 39 in the Top 500 list of the world’s most powerful supercomputers, and Dawn phase one’s stated specs would put it 41st in the list.
However, while this measurement indicates the system’s general performance levels, it does not measure its ability to run AI workloads. So far Intel does not appear to have published FP8 or FP16 performance information, which could show how Dawn compares to other systems for these specific tasks.
Full details for phase two of Dawn, which will be delivered later this year, have yet to be released either. This will be funded by the UK government, which has promised GBP500 million (£626m) for the project, and Dr.
Calleja expects that, when commissioned, it will offer a 10x performance increase, far outstripping the power of Archer 2. A ten-fold improvement on 20 petaflops would rank Dawn Phase Two in the top ten of the current Top500 global list of most powerful supercomputers.
The current phase of Dawn is already being put through its paces by researchers in Cambridge and beyond. One of the first publicly revealed uses for the system is a joint project by the UK Atomic Energy Authority (UKAEA) and Cambridge University, which will see researchers develop a simulation of a planned fusion reactor to speed up the development of the technology.
The two agencies are using Dawn to create a digital twin of the Spherical Tokamak for Energy Production prototype fusion power plant, which is scheduled to create a “burning plasma” by 2035 and net electricity production by 2040.
Nuclear fusion has the potential to provide limitless sustainable power by mimicking conditions in the Sun, fusing light atoms into heavier ones.
However, current working assumptions suggest the technology remains decades away from reality. Dawn could help speed up this process.
Speaking in February, Dr. Rob Akers, director of computing programs at UKAEA, said: “Having access to powerful systems like Dawn is pivotal to positioning the UK at the forefront of an emerging technology and industry.
“The ultimate prize will be ‘bottling a star’ – harnessing fusion energy here on Earth, and shifting the needle towards a carbon-free world.”
It is also hoped that it can be used in drug discovery, to enable the development of personalized medicines based on an individual’s DNA, and to help scientists model climate change and its impact.
Dr.
Calleja expects Dawn, along with the other UK supercomputing infrastructure that forms AIRR, to be put to use at Whitehall, too. “Away from science, another big driver for this is the government’s own use cases,” he says. “It’s looking at how we make these technologies available to government departments so that they can make efficiency gains.”
Having been in post for 17 years, he has witnessed several big shifts in the HPC landscape, and believes AI and AI-focused supercomputers like Dawn are about to change the course of the development of so-called exascale machines, the next generation of HPC which promise to offer exponentially more compute power.
An exascale system, Frontier, already exists in the US, and China reportedly has two in operation, though it has not submitted them for benchmarking. The EU has two in development, one of which will be hosted in Germany, and in October the UK revealed that Edinburgh University had been selected as the location for the UK’s first exascale installation. It is hoped work on the system will begin in 2025 and that, once complete, it will offer a 50x performance increase compared to Archer2.
However, Dr.
Calleja argues these machines may turn out to be “yesterday’s solution to tomorrow’s problem.” He explains: “Exascale is a challenge, those machines are large and difficult to operate, and getting code to run at that scale is problematic.
“The promise of exascale is that you can solve bigger problems by essentially brute-forcing these problems with a bigger machine. Now you can train an AI model to do it for you instead.”
He adds: “AI is going to have a big impact on the traditional HPC market. The community is smart and has had to adapt to changes in the past, and we’re at another epoch-changing moment now.
I would argue we’re already in the post-exascale era because with AI you can get a lot more bang for your buck.”
More in HPC & Quantum
- 29 Feb 2024
More in UK & Ireland
- 01 Mar 2024
- 14 Mar 2024
References
- ^ March 2023 (www.gov.uk)