Semiconductor Engineering sat down to discuss the challenges of designing and testing multi-die systems, including how to ensure they will work as expected, with Bill Mullen, Ansys fellow; John Ferguson, senior director of product management at Siemens EDA; Chris Mueth, senior director of new markets and strategic initiatives at Keysight; Albert Zeng, senior engineering group director at Cadence; Anand Thiruvengadam, senior director and head of AI product management at Synopsys. What follows are excerpts of that discussion, which was held in front of a live audience at ESD Alliance 2025. To view part one of this discussion, click here.
L-R: Ansys’ Mullen; Siemens’ Ferguson; Keysight’s Mueth; Cadence’s Zeng; Synopsys’ Thiruvengadam.
SE: In order to manage the heat in 3D-ICs, you have to know what’s going on inside a chip or chiplets at any point in time, right?
Zeng: Yes, and it”s a very challenging problem. First, you have to bring the hardware and software together. You need a very sophisticated tool management system to understand what is coming in. You can have a sophisticated heat map that shows, ‘For this type of load, what would be the power model?’ And then, based on that, you adjust the thermal controls.
Thiruvengadam: But you do have to model that. Thermal management and power management are not new, but this has gotten a lot more complex. We used to get by with just a few sensors scattered around, and a simple power management or thermal management system that would inform on the power states. But now you need distributed sensors and more sophisticated software to process and dynamically calibrate the power states.
Mueth: The complication is because of the integration. In the old days you would just put on a bigger heat sink. But you can’t do that here. You have size constraints from the get-go. That’s why you’re doing chiplets in the first place.
Zeng: There are two challenges here. One, now you need to think about where you place the thermal sensors. You need to do some analysis. That’s a thermal mitigation problem. For the typical thermal management system, you’ve got control in the millisecond time frame. Second, you need to start running really long thermal simulations to validate your thermal management system. That’s a lot of challenges for thermal management.
SE: And it goes even further, right? You’ve got to add redundancy on the interconnects, because the interconnects are not lasting as long. You’ve got time-dependent dielectric breakdown, which is accelerated in some of these devices. You also have things that have been developed and worked with individually, but never actually put together in the same design and tested them over time.
Ferguson: The idea of looking at a device’s lifetime has been around for a long time. But this new aspect of what we do for lifetime management and analysis is accelerating. There’s a lot of work being done to figure out how exactly we nail that down. There’s pretty good academic research, but how we take that and apply it to real-world scenarios — and ensure we’re getting the right answers — is the next step.
Mullen: Thermal is at the center of a lot of the reliability concerns. Electromigration has been important for a long time, and it’s directly related to temperature restricting the lifecycle. But now that you have stacked dies, you get thermal cycles. The actual physical integrity of the system is challenged. You can make bump connections or bonds break or get high resistance. There are a lot of failure modes in these systems that are very challenging.
SE: Another piece of this is how to test these designs. Everybody wants everything to shift left, but we now have to design a DFT strategy all the way at the beginning and figure out where we’re going to be testing, what we’re going to be testing, and how we’re going to interpret those results. How do we do that?
Ferguson: The diagnosis part is not too difficult. We have some of the standards already in place. And for the most part, if you’re LVS-clean, and you’ve designed everything carefully, then from an input pin to an output pin you can do the diagnosis anywhere within that three-dimensional system. There are some challenges around how to do the physical test. You can put a standalone die or chiplet on a test bench, and if it’s a known-good die you can put this into the system and everything’s going to work great. The problem is that when you put it into the system, it’s getting hot, it’s working, it’s being stressed, and it’s not going to behave the same. Now the question is whether it’s within my specs. That’s a whole new question. We need to figure that out. So how do you do it? Some companies will do known-good die, known-good stacks, and known-good packages, and they’ll put those together and get it a little bit better. But there’s still a problem. Even if you want to take the whole 3D-IC assembly and say, ‘The whole thing is known-good, and I can use this in whatever I’m putting it into,’ you still have a problem. It’s gone through this manufacturing process where you’ve heated it and it has warpages. Now I put on the test bench and probe it, but is it actually connecting to the right things? Probably not. So now you’ve got a whole new issue. There probably are some ways to do it right. We can do modeling of the warpages, and we can tell you where you may maybe need to have a longer one in this location and a shorter one in this location. But this still needs to be solved. It’s an outstanding issue.
Mueth: You can break it down to a couple classes of problems. One is the performance testing. It’s hard to get test points on a chiplet, but following some industry standards, test equipment companies will innovate around that. If you looking at how to get to a one-test solution, where you have a magic wand system-level test that will exercise the chip in such a way that I know I have a good assembly, that’s easier said than done. Built-in self-tests are important for these kinds of assemblies because you can’t probe everything. If you’re looking at the structural integrity of the die, which is the main bottleneck, how do you test that? You can test it with a TDR (time domain reflectometer) system, which allows you to actually peer into the package itself by probing on pins on the package. You can probe and essentially look inside like an X-ray machine, and you can deduce defects inside that package with a TDR system. So that’s one way of doing it. And of course, the ultimate shift left is to simulate this stuff up front rigorously so that you have fewer concerns about the package integrity on the back end.
Thiruvengadam: The problems are compounded when you look at the analog portions of chips. A lot of the standard DFT techniques are applicable to digital circuits. But when you go to analog, you start to lose that visibility and coverage. You can still have analog tests and ATBs (analog test buses) and all that. But that problem is a lot more complex when it comes to the analog portions of the chip. That’s where alternative techniques like defect simulation, injecting defects, and other methods can be applied.
Mullen: Complexity is a challenge. Techniques like built-in self-test or boundary scan can help us test individual dies and make sure that they’re well covered, but you’ve got to control and observe each individual fault to be able to detect a problem, whether it’s a delay, a stuck-at fault, or something else. The challenge with 3D-ICs is the bump density is in the millions, and very soon there will be billions of these connections. And they need to be very high speed, so you’re not going to want to have a lot of test overhead associated with them. Still, there has to be a way to test each and every bump connection from the outside.
Zeng: For traditional EMIR analysis, people run a lot of DFT patterns. With 3D-ICs, the complexity of the testing will grow exponentially. There’s a chance that if we are able to create an AI system that can go along with these test patterns and quickly provide feedback, and at the same time test new patterns, that could be a benefit. But it’s very difficult.
SE: We’ve seen a continual decline in first-time silicon success. Given that verification has always constituted the lion’s share of the design cycle, do you foresee this changing? Can we take a little longer on the verification to improve the rate of success? Some of these chips cost $100 million-plus to create.
Ferguson: This is where AI will help a lot because it can give us some opportunities to figure out how to do things faster. Ultimately, that’s what AI is good for — not so much to do things always correctly, but to get you close to an answer faster.
Mueth: There are two areas where that can come in. One is using AI to look at your test coverage and determine which tests or simulations you can you can get rid of in order to minimize your testing. The other is to reduce iterations by looking at the data and predicting failures through AI. The kinds of failures you can predict might even point to a root cause. It depends on how you set up the model metadata that you collect, but you can point to root causes. That’s been demonstrated.
SE: But root cause analysis is getting a lot more complicated these days, right?
Mueth: Yes, as the products get more complex. It’s multi-dimensional, and it’s not easy.
Thiruvengadam: AI will have the biggest impact here because it applies to pretty much the entire design flow. For verification it can provide faster turnaround time for human coverage, or higher coverage overall. You can achieve both those goals with AI. And, of course, compute efficiency is another big important focus for AI. On the test side, similar gains can be achieved. For example, you can do a test pattern reduction with AI. AI will have a fairly disruptive impact in a positive way on all aspects of the design flow. But I don’t think you can ever extend the verification cycle because the pressure is on to reduce that as much as you can.
Zeng: I totally agree. AI can be used on verification to reduce the space significantly. There already are some products on the market as proof of that.