Deep learning has revolutionized the way humans interact with data across virtually every field imaginable. However, the tip of the spear continues to be scientific research, where high-performance computing (HPC) and artificial intelligence (AI) have practically merged to become powerful drivers of innovation. On the leading edge of this revolution are organizations pushing the boundaries of what deep learning can achieve, like the National Energy Research Scientific Computing Center (NERSC) at the Department of Energy’s Lawrence Berkeley National Laboratory. NERSC is one of the leading supercomputing centers dedicated to supporting energy, science, and technology research.
With that in mind, Wahid Bhimji, Group Lead for Data & AI Services (and Division Deputy for AI and Science) at NERSC, offered some insights into the challenges and opportunities that deep learning at scale offers, as well as what others can learn from their efforts. At November’s SC24 conference, Bhimji is leading a tutorial, “Deep Learning at Scale,” co-hosted by experts from NVIDIA and Oak Ridge National Laboratory. The session will dive into the current strategies and approaches being used to solve the increasingly challenging problems that researchers face as they scale their deep learning workloads.
The Growing Pains of AI: Efficiency, Reuse, and Scale
The evolution of AI in science and technology has been gradual, punctuated by moments of groundbreaking innovation, according to Bhimji.
“Machine learning has been used in the sciences for decades, but the recent revolution has been driven by deep learning and other modern AI techniques,” he explained.
This shift became significant at NERSC with projects on the now-retired Cori supercomputer, where AI transitioned from proof-of-concept to a critical research tool.
“Now we see HPC and AI have come together more seamlessly. We see it in areas like large language models with industry really pushing the envelope,” Bhimji added, highlighting the progress in both scientific and industrial applications.
However, scaling deep learning models introduces new complexities. “It’s not as simple as taking something that works on a single GPU and scaling it up onto a large HPC machine,” Bhimji noted. The scaling process – whether by distributing data, tasks, or model components – varies significantly based on the specific problem, making it resource-intensive.
Reusing the model efficiently after training presents another challenge, especially for models consuming substantial HPC resources. This issue is exacerbated by the scarcity of tools designed for use beyond large language models.
Bhimji emphasized: “Different use cases and models require different approaches, and that diversity adds to the challenge.” He pointed out the limited availability of tools facilitating model reuse across various domains, underscoring the need for more versatile solutions in the field.
Optimization Strategies: Start Small, Scale Smart
Addressing scaling challenges requires a multi-layered approach to optimization. Balancing computational efficiency with scalability is the key to optimizing deep learning models.
“Here, it’s important to ensure the HPC system is well-configured for distributed learning,” Bhimji explained.
Optimization begins with fine-tuning on a single GPU, using profilers to identify bottlenecks and track improvements. “Once that’s tuned, you can begin to think about scaling to larger systems,” he said.
This stepwise approach ensures efficiency before moving to the complexities of distributed learning and parallelization. Once the model is optimized at this level, scaling to larger HPC systems requires careful adjustments, including parallelization techniques and distributed learning configurations.
As models grow, techniques like parallelization and mixed precision become essential to maximize GPU architecture efficiency without overburdening resources.
“You often have to adjust and fine-tune the model’s settings, which can be both costly and time-consuming,” Bhimji explained.
Another key aspect of optimization is hyperparameter tuning, which becomes increasingly complex at scale. “Hyperparameters that work well on a single GPU don’t necessarily scale seamlessly to larger systems,” Bhimji noted. As models grow larger, the need for smarter hyperparameter search strategies and automated tools becomes even more critical to ease the burden of tuning at scale.
The Power of Collaboration
While the challenges of scaling deep learning models and optimizing them across HPC systems are significant, they also present a unique opportunity for collaboration, which Bhimji and partners are doing with their upcoming tutorial at SC24. Getting a full picture of these complexities (and how to overcome them) requires the collective expertise of scientists, engineers, and researchers across various fields.
Bhimji noted: “Bringing these techniques to a broader audience is part of what our tutorial is about – expanding these approaches beyond just large language model frameworks to something more universally applicable.” Through this type of collaboration, breakthroughs in AI and HPC are achieved, driving the field forward.
SC24 is a crucial gathering for this type of shared learning and innovation. Bhimji has seen firsthand the power of such events, collaborating with industry partners on SC tutorials about deep learning since SC18. By bringing together experts from around the world, SC24 offers a platform where the latest ideas and advancements can be exchanged, leading to new solutions for today’s challenges. From practical applications in AI to emerging technologies like quantum computing, the SC24 Conference provides the tools and insights needed to tackle the most pressing issues in supercomputing.
Join Us in Atlanta
Collaboration and continuous learning are key to realizing supercomputing’s full potential. SC24 offers an opportunity to expand your knowledge and experiences within the HPC community.
Attendees engage with technical presentations, papers, workshops, tutorials, posters, and Birds of a Feather sessions – all designed to showcase the latest innovations and practical applications in AI and HPC. The conference offers a unique platform where experts from leading manufacturers, research organizations, industry, and academia come together to share insights and advancements that are driving the future.
Join us for a week of innovation at SC24 in Atlanta, November 17-22, 2024, where you can discover the future of quantum, supercomputing, and more. Registration is open!