How Supercomputing Touches the World(s)

Notes
Transcript

From the plastic case protecting your phone to rovers on Mars to vaccines — supercomputers have played a role in just about everything around us. And many of those projects have rolled through one of the biggest supercomputing centers in the world — the Texas Advanced Computing Center (TACC). In this episode, we talk to undercover superhero Dan Stanzione, executive director of TACC, about the many discoveries and innovations his supercomputers have had a role in, and what it’s like to oversee it all. Whether it be Rommie Amaro’s recent COVID-19 breakthroughs or assisting emergency responders after a hurricane, Dan and TACC are making a real difference behind the scenes of society.

Credits

Interview with Dr. Dan Stanzione, Executive Director, Texas Advanced Computing Center (TACC), UT-Austin
Producers: Taylore Ratsep, Jolie Hales
Hosts: Ernest de Leon, Jolie Hales
Writer: Ernest de Leon
Editor: Jolie Hales

Follow Dan Stanzione and TACC

Take a 3D Tour of the TACC Data Center

*TACC Supercomputer Stampede (Source: TACC)*

Referenced on the Podcast

Mars Rover Pics

Mars Curiosity Image Gallery

*Mars Rover Curiosity Selfie (source: NASA)*

Mars Spirit and Opportunity Image Gallery

*Mars Rover Opportunity at Rock Abrasion Target ‘Potts’ (Source: NASA)*

The Martian movie trailer

Ernest de Leon:
If this podcast is getting too serious, I need to stop. Hello, everyone. I’m Ernest de Leon.

Jolie Hales:
And I’m Jolie Hales. Welcome to the Big Compute Podcast. Here, we celebrate innovation in a world of virtually unlimited compute. We do it one important story at a time. We’re talking about those stories behind scientists and engineers who are embracing the power of high performance computing to better the lives of all of us.

Ernest de Leon:
From the products we use every day to the technology of tomorrow, high performance computing plays a direct role in making it all happen, whether people know it or not.

Jolie Hales:
So, for this episode, we’re going to do something a little bit different than we usually do.

Ernest de Leon:
Yes, because we have the opportunity to talk to a who’s who in supercomputing.

Jolie Hales:
Rather than just listen to my voice explain a topic today, we thought that we’d let our expert do most of the talking.

Ernest de Leon:
Yeah, considering he’s worked with high performance computing projects in every category and doesn’t have just one focus.

Jolie Hales:
Right, people we’ve talked to recently specialize in an aspect of COVID or they specialize in tsunamis. Whereas Dan, since he’s a who’s who in supercomputing, has pretty much worked with all of that. Though, as I understand it, before we even get going, Ernest, you have something you want to ask me first?

Ernest de Leon:
Yes. Jolie, do you love Mars?

Jolie Hales:
Mars like the planet or the candy bar? Wait, is that the candy bar company?

Ernest de Leon:
I think it’s actually both, now that you bring that up. I was primarily referencing the red planet.

Jolie Hales:
Do I love Mars? I’ve never been. I hear it’s wonderful in the winter. I’m sure it would be really cool. So, yes. Yes, Ernest. I love Mars.

Ernest de Leon:
Excellent.

Jolie Hales:
Tell me more about Mars. I think that’s where we’re going.

Ernest de Leon:
Have you seen the movie The Martian?

Jolie Hales:
I have seen the movie The Martian.

Mark Watney:
This will come as quite a shock to my crewmates and to NASA and to the entire world, but I’m still alive. Surprise.

Ernest de Leon:
So, let me tell you a little story.

Jolie Hales:
Okay. I like stories.

Ernest de Leon:
In the movie The Martian, the protagonist, Mark Watney, ends up stranded on Mars due to an unfortunate series of catastrophic events on the Martian surface.

Jolie Hales:
Who’s the actor that played that? Was it Matt Damon?

Ernest de Leon:
It was Matt Damon. That’s right.

Jolie Hales:
Cool, sorry to interrupt.

Ernest de Leon:
No worries.

Jolie Hales:
Just got to get that picture in my head, especially if it’s with Matt Damon.

Ernest de Leon:
Andy Weir is the author of the novel if anyone wants to read it. I’ve read the novel. It’s excellent. So, I won’t ruin the movie for you or the novel, which I highly recommend you read before you see the movie.

Jolie Hales:
Too late.

Ernest de Leon:
But let’s just say that Mark Watney is in a pretty bad spot.

Jolie Hales:
Uh-oh.

Mark Watney:
Now, you can either accept that, or you can get to work.

Ernest de Leon:
The telecom equipment that he has is unusable at a certain point. His communication with Earth is now dead silent.

Jolie Hales:
Oh, that’s so uncomfortable.

Ernest de Leon:
Yes. After some time, Mark Watney realizes that there is one other place that has telecom equipment he could maybe use. I emphasize the maybe here. So, he sets out to find NASA’s original Mars rover, Sojourner, but more importantly, the lander named Pathfinder. Pathfinder has the telecom equipment Mark needs to reestablish communication with Earth. Unfortunately, the original Pathfinder and Sojourner Team ceased communicating with Earth back in September of 1997.

Jolie Hales:
Is that a real thing?

Ernest de Leon:
That’s a real thing. Of course, this movie takes place even in the future from now. So, you’re talking even more decades between when this original thing went offline as opposed to the movie timeline. We’ll get into that here in a second. Mars rovers have a long and storied history starting with Pathfinder and Sojourner and continuing with Spirit, Opportunity and Curiosity.

News Clip:
Breaking news this morning, Curiosity is a smashing success. The NASA Mars rover touched down this morning right there on the red planet. A daring mission with more to come.

News Clip:
The landing cap to journey that lasted more than eight months and covered more than 350 miles.

Ernest de Leon:
The original Pathfinder and Sojourner Team had a total service life of 83 sols. Let me interject here and note that a sol is equal to one revolution of Mars similar to an Earth Day. On Mars however, that revolution takes approximately 24 hours, 39 minutes, and 35 seconds by Earth time measurements.

Jolie Hales:
Wait, because I know sol is sun in Spanish. That’s what I know of a sol. Okay, So, a sol is a measurement of time.

Ernest de Leon:
Right.

Jolie Hales:
It’s equal to one revolution of Mars.

Ernest de Leon:
Right. So, here on Earth, we call that a day or a day-night cycle, right?

Jolie Hales:
Yeah.

Ernest de Leon:
We got 24 hours. However, Mars takes-

Jolie Hales:
A little bit longer.

Ernest de Leon:
… a little bit longer.

Jolie Hales:
Okay. So, 83 sols is like 83 days and change.

Ernest de Leon:
Exactly.

Jolie Hales:
Okay. That’s not very much.

Ernest de Leon:
No. There’s a reason for that, right? So, the design of these things has evolved over time. The Spirit rover landed on Mars on January 4th, 2004 and sent data back to NASA for over six years. Well, it’s sent its last communication on March 22nd, of 2010.

Jolie Hales:
That was the first Mars rover?

Ernest de Leon:
No, the first was Sojourner. This is the second one. Yes.

Jolie Hales:
Got you.

Ernest de Leon:
The Opportunity rover landed on the Martian surface on January 25th, 2004. Thus far, it has surpassed all previous service lives at 5,352 sols, which is approximately 5,498 Earth days, which is almost 15 Earth years or 8 Martian years.

Jolie Hales:
What?

Ernest de Leon:
Yes.

Jolie Hales:
How come it lasted so long?

Ernest de Leon:
Well, again, the technology is getting better and better as these things evolve.

News Clip:
When this little rover landed, the objective was to have it be able to move 1,100 yards and survive for 90 days on Mars, 90 sols. Instead, here we are, 14 years later, after 28 miles of travel. Today, we get to celebrate the end of this mission.

Ernest de Leon:
Opportunity sent its last communication to Earth on June 10th, 2018. So, you’re seeing a common thread here. The original three units are no longer in operation.

Jolie Hales:
Uh-huh (affirmative).

Ernest de Leon:
The Curiosity rover landed on Mars on August 6th, 2012. As of December 2020, when we were recording this podcast, over eight Earth years later, the Curiosity rover is still in service and still sending data back to Earth.

Jolie Hales:
Interesting. So, Curiosity is actually in use today. And then the other rovers are just hanging out on Mars not doing anything now.

Ernest de Leon:
Right, they’re currently defunct.

Jolie Hales:
I guess I always pictured us taking them back to Earth, but that wouldn’t make sense to do so.

Ernest de Leon:
Right, we can’t retrieve them. Now, keep in mind that Curiosity is not the longest serving rover. It’s 8 years. Opportunity was almost 15 years in Earth time. So, a little over half, but it’s still in service. So, imagine for a minute if NASA had the technology to extend the lifespan of the rover significantly, right? So not 8 years, not 15 years, but let’s say 50 years or 100 years. Instead of Mark Watney risking his life to go find a dead rover and lander, NASA could have located the nearest functional rover to Mark and redirected it to go to him before he lost his original telecom equipment as a precautionary measure. Yes, so extending the distance traveled and overall service life of future Mars rovers is just one of the many problems that NASA scientists are trying to solve with the help of supercomputers.

Jolie Hales:
I see where you’re going with this. Okay.

Ernest de Leon:
Yes, more specifically, the supercomputers at the Texas Advanced Computing Center or TACC for short. Much of the computational heavy lifting is done within supercomputers at the TACC, but NASA is looking into onboard high performance computing for rovers. The TACC is helping lead the charge. This isn’t the first time the TACC has surfaced in recent stories about supercomputing, some on this very podcast.

Jolie Hales:
I was going to say, I remember talking about the TACC. We talked about it on one of our COVID episodes.

Ernest de Leon:
Yes. So, we wanted to delve a little deeper into what the TACC is and why it is so important. Enter.

Jolie Hales:
Dan, is it Dan Stanzione? Is that how you say your name?

Dan Stanzione:
That is correct. There’s a broad family split between Stanzione, Stanzione, and the Italian Stanzione, but I’m a Stanzione. Yes.

Ernest de Leon:
I love it when there are family disputes on how to spell or pronounce last names. My name has been butchered my entire life. So, one day, I’ll share that story.

Jolie Hales:
Suspense.

Ernest de Leon:
Dan is a pretty interesting guy. When asked about what he does for fun outside of work, he said.

Dan Stanzione:
There’s stuff beyond work in life. I wasn’t aware of that.

Jolie Hales:
I know, isn’t that crazy.

Dan Stanzione:
It’s Election Day. So, today, it’s reloading election results over and over again. Hitting the reload button on various and sundry websites over and over again.

Ernest de Leon:
I bet our listeners can tell when we recorded this with Dan.

Dan Stanzione:
No, I’m pretty committed to this stuff, spend some time on a boat here and there. When I’m not doing this or living in Texas, I obviously watch a lot of football.

Ernest de Leon:
What is your title? What are your responsibilities?

Dan Stanzione:
The titles are long and some more important than they probably are, but I’m the Associate Vice President for Research at the University of Texas at Austin. Universities, if you’re an Associate Vice President, that means you have a good parking spot. That’s really the only purpose. But then beyond that, I’m the Executive Director of the Texas Advanced Computing Center, run a center here of about 180 people who are involved in building really big computers and then finding ways to use large computers to do science and engineering work for both the University of Texas and for other people doing unclassified research all around the country in the world.

Ernest de Leon:
Having a good parking spot definitely means something.

Jolie Hales:
I didn’t have one at my last company. Let’s just say I had a couple people hit and run on my car, because it was parked on the street. That was fun.

Ernest de Leon:
That sucks.

Jolie Hales:
I know.

Ernest de Leon:
The TACC is doing some really great work in the world today. But before we delve into that, what is the day in the life of Dan look like?

Dan Stanzione:
I’m a scientist who’s never had to pick a field. One day, we might be working on hurricane forecasts and impacts on buildings and structures and doing simulation for that. The next day, it’s genomics and drug discovery. When you’re involved in computing, you’re involved in every facet of science and engineering around the world. A lot of it is the less exciting part of keeping everybody paid and running reports, but the fun parts are getting to work with the really cool scientists that we work with all around the world and getting to design and build the next generation of the world’s biggest machines.

Jolie Hales:
That is so interesting. I mean, I’m curious, what did you study to land you in this place? Because it sounds like now, you’re probably very well versed in everything from physics to chemistry to life sciences. You probably know at least a little bit about all of it at this point.

Dan Stanzione:
A little bit is probably the functional phrase there. So, I have to talk people in those things. By training, I’m an electrical engineer as an undergrad for my bachelor’s degree. And then my graduate degrees are actually in computer engineering. I wanted to design and build really fast computer chips. What I do now was not a thing that really existed when I was going through school. I just got sucked into it. By the time I was doing my PhD, I was working with a genomics center and chemistry and material center being the computing guy for the scientists. It started out as something I did in grad school and then started doing projects with the folks here at TACC and started doing more and more. I came over here as the Deputy Director in early 2009 and took over as Director in 2014.

Jolie Hales:
It’s always interesting to see where people land and what they go through to get to where they are in their careers, because it’s never what we’d think.

Ernest de Leon:
Never. Even in my case where it’s very close, it’s not what I originally thought.

Jolie Hales:
It’s crazy. My mentor when I was getting my master’s in film-

Ernest de Leon:
Peter Jackson?

Jolie Hales:
How did you know? He does have an Academy Award in screenwriting for writing The Sting. But the whole way that he launched his career is he got in a car accident with somebody who worked for a film company. And then he ended up giving a script to that person. That launched his career. So, he was a great mentor. But when it came to actual, “How do you jumpstart your career?” advice, it was not very useful, because I couldn’t figure out who to crash into.

Ernest de Leon:
Yeah, it was literally luck.

Jolie Hales:
Yeah.

Ernest de Leon:
Dan also spoke about the evolution of supercomputing from something that only government and military really had access to because of the cost to where it is today.

Dan Stanzione:
This just wasn’t a thing. The National Science Foundation started investing in this in the mid-80s. The rise of microprocessors and then the cloud, it just made it much more accessible to so many more researchers and types of researchers. These centers have grown and spread. So, it’s been an interesting time.

Ernest de Leon:
What do you enjoy most about what you do?

Dan Stanzione:
We get to work with some really fantastic scientists around the world and do fascinating, fascinating things. I think at almost any job, the thing that is most rewarding is the relationships and the people that you get to deal with. Graduate fellowship deadline for NSF was last week. I was writing reference letters for just remarkable young people who’ve done amazing things we’ve gotten to work with. You learn about their lives and stuff like that, but at the same time, we get to do really impactful science.

Dan Stanzione:
I mean, we’re not the ones necessarily out there doing it, but we’re making sure it can be done. We’ve assisted on Nobel prizes. We’ve assisted on really some groundbreaking discoveries, some that are more theoretical and basic science. Some that are just fun like better spaceship engines. This year, it’s been a huge amount of work around COVID vaccines and the structure of the virus and things that are going to have in the very short term, real impact on people’s lives. I get to play with really big computers and show them off to thousands of people every year. So, there’s lots of things that you enjoy in this kind of job, but it’s got huge variety. Like I said, in the end, it’s the people that make it worth doing.

Ernest de Leon:
It’s always about the people we work with, isn’t it?

Jolie Hales:
Hands down, absolutely. But for our listeners, why don’t we go into more detail on what the TACC is? What do they do?

Ernest de Leon:
Good idea.

Dan Stanzione:
So, TACC or the Texas Advanced Computing Center, we are, in my humble opinion, the best of the academic supercomputing centers and one of the largest in the world. We run the largest university-based supercomputer in the world. We run a bunch of other computing and data systems for folks. But really, we exist to help people do scalable things for the challenges we face in science and society, right? If you’re working on a problem and almost every scientific problem has a computational piece now, whether it’s simulation or data or AI that you’re dealing with and eventually, you’re going to scale off your laptop, and that’s where we get involved. It’s the hardware and the people that make that happen.

Ernest de Leon:
Got it. I already know the answer to this question, but I’m going to ask it anyway for our listeners. Where are you located?

Dan Stanzione:
We are here in Austin, Texas. We’re part of the University of Texas at Austin. We actually live at the J. J. Pickle Research Campus. So, we’re about eight or nine miles north of Downtown Austin, but we really serve users all around the country and around the world.

Jolie Hales:
This might be a dumb question, but are the supercomputers actually physically located there as well?

Dan Stanzione:
We are essentially one of the cloud providers for academic supercomputing. So, most of our users don’t ever actually see or touch the machines, but we have the actual physical data centers and physical machines in the building I’m sitting in right now, where we can supply about 10 million watts of power to keep them going.

Ernest de Leon:
How and when was the TACC founded?

Dan Stanzione:
It was founded in 2001. We really got on the map when we won the Ranger system, which was one of the big National Science Foundation systems. That was the number four machine in the world when it first came up. That was in 2008. That’s when TACC became one of the real leaders in providing things not just here at UT but around the country.

Ernest de Leon:
Now, Dan, we interview a lot of undercover superheroes on this podcast and we think you may know one, Rommie Amaro.

Dan Stanzione:
Oh, yeah. Rommie has been our biggest user the last six months, because of all of her COVID work. When you see her enthusiasm and energy and her ability to change the world, who wouldn’t want to help people like that?

Ernest de Leon:
That’s mighty big praise from the Executive Director of the TACC.

Jolie Hales:
Yeah, I could totally understand that. I mean, after talking to her, we were like, “That woman is amazing.”

Dan Stanzione:
I totally agree.

Jolie Hales:
I mean, props to you for helping her out and getting the research off the ground as quickly as you did. I mean, we did the math on the show. The amount of supercomputing resources, like you said, that she was using, it was quite the chunk of compute power.

Dan Stanzione:
We started those runs at pretty large scale the last week of February. That was our first COVID-related research project. By mid-March, it was the largest one we were running. There’s been 50 something others since then. But we got started at scale quickly, largely because again, it’s back to the people in the relationships, right? I already knew Rommie. I already knew what she did. She’s used our systems for years. So, she knew how to get on and be effective right away, doing what we’re doing. Rather than going through some complicated bureaucratic process when COVID was becoming what was obviously going to be a big thing, she sent an email.

Rommie Amaro:
I said, “Hey, Dan, I don’t know what you guys are up to, but this virus is looking pretty serious. I think we probably need to do something with it.”

Dan Stanzione:
I said, “I know you. I know the work you’re doing is great. So, we can make that happen today.”

Rommie Amaro:
It was amazing. I mean, it was amazing to have that level of support, but that was really key in getting time to solution very quickly for this effort.

Dan Stanzione:
We were off and running. Again, we’ve done 60 something other projects with different people since in the COVID space, but you get off the ground quickly, because the infrastructure is in place and the relationships are in place. The knowledge and the training are in place to make all this stuff happen. That’s why we can start so quickly on that work.

Rommie Amaro:
I was so grateful for them.

Dan Stanzione:
The output of Rommie’s work has been the input to some other work more upstream in sorting through billions of possible compounds for good drug candidates. It was done by some people at Argonne National Labs and the University of Chicago and Rutgers and University College London and just a whole bunch of other people around the world. So, that basic structural work that Rommie did and built it into this AI-enabled vaccine discovery pipeline, the four billion compounds we started with, they handed about 30 off to medicinal chemists to start fabricating and start clinical trials on by July-

Jolie Hales:
Wow.

Ernest de Leon:
… or August.

Jolie Hales:
Holy cow.

Dan Stanzione:
Getting that work done early was key and she’s kept discovering new things. I’m sure she told you about figuring out that the spikes on the Coronavirus wrap themselves in a sheet of sugar and then they’re going to get into a cell and all of that. That’s what helps it hide from the immune system, because it just looks like sugar molecules, right? So, none of that is stuff that we knew in January and we know it all now.

Jolie Hales:
Also, you said 60 different projects that TACC is involved in when it comes to COVID-19 right now. Did you say 60, six, zero?

Dan Stanzione:
Yes, I think that’s about right. It changes a little bit every week. It might be 59. It might be 63, but we’re in that neighborhood of different projects we’ve supported. Some of them are the structural level like Rommie, where we’re working at the molecular level. That’s 20 or so projects, 25, 30 in that realm. We have another 10 or 15 that are more on the human side of that, right? You’re modeling societies. You’re using cell phone data to figure out how much people are interacting and where, but there’s other pieces of that, right? What are the causes? What are the projected spread? How did the aerosol spread on a plane? All sorts of things like that.

Dan Stanzione:
And then we have some that are in the middle, looking at the genomic scale, figuring out the evolutionary history of the virus, which helps the viruses it’s related to. That gives you some insights in treatments, but also, the people it infects, right? We know beyond any doubt at this point that it infects different people differently. There’s a lot of pre-existing conditions that feature into that, but a lot of it is also genetic, right? What strings of genome do you have that somebody else doesn’t have that make you more or less vulnerable, right?

Dan Stanzione:
The more we can understand that, again, we might be able to isolate and build therapies based on that or figure out who actually needs different kinds of vaccines to do a more personalized approach. We actually started in March with a number of the other supercomputing centers and the cloud providers. Through the White House and the Office of Science and Technology Policy, we put together the COVID-19-

Jolie Hales:
Yeah, the Consortium.

Dan Stanzione:
… HPC Consortium. Yeah. So, at this point, about more than half of our projects have come through that mechanism, where they write to the Consortium and then they get stuck here at TACC or at the San Diego Supercomputing Center or maybe on Amazon or Microsoft.

Jolie Hales:
So, it’s interesting to think about how you have multiple projects that are using data that was collected from other projects. So, like Rommie Amaro’s work, right? A lot of the data that she’s been able to gather is now being fed into the supercomputers for different research. I mean, that’s pretty cool. It feels like the supercomputing world now is allowing us this time machine forward.

Jolie Hales:
From supersonic jets to personalized medicine, industry leaders everywhere are accelerating innovation with unprecedented speed and efficiency by using rescale. The intelligent control plane that allows you to run any app on any infrastructure totally optimized. As a solution for intelligent full stack automation for big compute and R&D collaboration on hybrid cloud, Rescale empowers IT leaders to deliver high performance computing as a service with software automation, with incredible security, architecture, and financial controls.

Jolie Hales:
As a proud sponsor of the Big Compute Podcast, Rescale would especially like to say thank you to all of the scientists and engineers out there who are working to make a difference for all of us. Rescale, powering science and engineering breakthroughs. Learn how you can modernize HPC at rescale.com/BCpodcast.

Ernest de Leon:
I really love to hear about the great work that scientists and researchers are doing around the world with supercomputers and also love to hear about the tech specs of the machines that they are working on.

Jolie Hales:
I remember when we were talking to Dan, things got pretty technical and it was so fascinating.

Ernest de Leon:
Absolutely. So, I asked Dan, “What are the names, sizes, and the tech specs of the various machines that occupy the TACC?”

Dan Stanzione:
We have about 15 different production platforms at this point, but the biggest one right now is Frontera. That’s our leadership class system. It’s actually a collection of several different kinds of systems, but the biggest piece is an Intel Xeon base, little over 8,000 compute nodes that can do about 40 petaFLOPS with about 425,000 Xeon cores that make that up. There’s also about 1,000 GPUs attached in various subsystems focused more on the machine learning side of things. It has about 50 petabytes of fast file systems from data direct networks. The network comes from Mellanox. We have a 200 gigabit InfiniBand interconnect for it. And then Dell was our integrator who put all the servers together.

Dan Stanzione:
Although on the GPU side, we also use Green Revolution for some cooling systems. We use CoolIT for the water-cooled parts to the chips. We’re using very high-powered chips. So, we just pump liquid directly across them at this point. The GPU nodes, we immersed in mineral oil. So, IBM and Nvidia and a whole bunch of companies were involved in doing all of that, but that’s our newest and largest system. It debuted fifth in the world. It’s been about a year and a half. So, it’s dropped down to about eight in the world at this point. They age just like any other computer does, but we’ll run that one for another four years or so.

Dan Stanzione:
Our other large scale system is Stampede2 that is also an Intel-based supercomputer and about 6,000 nodes, has a mix of Xeon and what were called Knights Landing cores, that Xeon Phi’s. So, it also has around 400,000 cores. It’s about a 20-petaFLOP machine, about 30 petabytes of disk. Frontera does a few dozen of the largest scale projects. So, people get very large allocations, mostly running bigger jobs. Stampede’s our broader mission machine. It’s a couple years older, but it has more 3,000 projects on it. So, that one has 15,000, 18,000 users competing for time on it. So, those are our traditional supercomputers.

Dan Stanzione:
We have other machines that have different missions. Chameleon is our cloud testbed, but that one is where we focus more on computer science research. We have a whole host of storage systems and data intensive computing systems. We have some more for interactive use. We have some more for visualization. We try and just provide that whole computing ecosystem that we think you need for modern science and engineering.

Ernest de Leon:
Awesome. So, one of the not surprising to me, but I know it’s been surprising to a lot of people in this industry is the rise of the ARM processor and the ARM supercomputers.

Jolie Hales:
Like Fugaku in Japan.

Ernest de Leon:
Like Fugaku in Japan. I’m curious, what are the TACC’s plans right now? What are you looking at in terms of ARM in the future and the deprecation of the traditional x86 platform over, obviously, a very long period of time?

Dan Stanzione:
Yeah. So, it will not surprise you that we have some of those chips along with a whole host of other things and certainly, the AMD chips, the Nvidia GPUs and other ones. There’s a number of interesting things about ARM but specifically the Fugaku machine and what my colleagues, Toshi Matsuoka, has been able to do their is they had a very long term partnership to really purpose design a chip with Fujitsu for this big national supercomputer. So, that machine was many years in the planning and design, because most ARM chips, the kind that are in your cell phones and things like that are conventional processors.

Dan Stanzione:
They have some differences, but fundamentally, they work just like the AMD or Intel processors you have in your laptops and your servers. But architecturally, they’re the same. But what makes Fugaku unique is that ARM chip they built with Fujitsu is not only a very nice processor, but it has a bunch of very high bandwidth memory that is integrated directly on the package, right? So, you don’t have to go off the pins to a separate memory chip somewhere else on the motherboard. That gives you much higher bandwidth, right? So, it has the memory bandwidth that a GPU would have.

Ernest de Leon:
Interesting. So, the RAM is on die.

Dan Stanzione:
I believe it is actually stacked on there. Not necessarily on die, but it’s on package.

Jolie Hales:
What does that mean? RAM on die, what does that mean?

Ernest de Leon:
So, in a traditional computer, you have a motherboard and then you have a CPU and then you have RAM sticks. And then you have hard drive or whatever else. The new chip from Apple has unified memory and it’s on die. What that means is the RAM is now in the CPU.

Jolie Hales:
Oh, interesting. Instead of a stick that you plug in, actually, RAM is on the CPU.

Ernest de Leon:
It’s inside the CPU. So, now, you don’t have to have that latency of leaving the CPU socket, going across the motherboard, hitting the RAM stick and then coming back.

Jolie Hales:
Okay.

Ernest de Leon:
In Apple’s case, they put the GPU in the CPU also.

Jolie Hales:
Oh, my gosh.

Ernest de Leon:
You don’t have a separate GPU anymore.

Jolie Hales:
All in one unit.

Ernest de Leon:
The entire computer’s in one chip.

Jolie Hales:
That’s crazy.

Ernest de Leon:
Now, to be fair, they had already been doing this in iPads and iPhones, but this was like another evolutionary step. Because even on those, I believe the RAM was still separate, but it was soldered on to the motherboard of the PCB. But now, they just rolled it all in. So, the one chip handles everything.

Jolie Hales:
Are there any supercomputers that actually have the RAM on die, or wouldn’t they mostly be separate pieces?

Ernest de Leon:
They would almost all be separate pieces, but Dan did note right here that in the case of the Fugaku one, it’s actually stacked. So, what they did is like they have the CPU at the bottom layer and then they put the RAM on it. They stack the layer on top of it. So, it’s like 3D, instead of it being flat.

Jolie Hales:
What’s the advantage of doing that? Just extremely available-

Ernest de Leon:
Extremely fast and available.

Jolie Hales:
So, less latency.

Ernest de Leon:
Less latency, it’s much faster, but then there’s a downside in that you can only fit so much in that space. Whereas with a traditional computer, you can put a terabyte or two terabytes of RAM on a node. There’s no way you can fit that much on a CPU, whether it’s stacked or on die as a matter.

Jolie Hales:
Thanks for letting me know. I remember you asking him that. I also remember not quite understanding what you guys were talking about.

Ernest de Leon:
It’s also very dependent on the software, right? If the software is written to take advantage of the architecture, it could be a lot faster than traditional computing, where the software isn’t as optimized, right?

Dan Stanzione:
I believe the memory bandwidth on Fugaku is something like a terabyte a second per node, right? So, single CPU socket, which is about five times better than we can get out of a current mainline CPU socket, but they’re seeing some fantastic performance out of that on some applications. Now, the downside of that, of course, is that in that particular design, because they’ve squeezed all the memory onto the chip, they have a lot more bandwidth, but they have a lot less capacity.

Ernest de Leon:
So, looking to the future, what are the TACC’s plans in terms of new supercomputer designs?

Dan Stanzione:
You might imagine, we’re always designing the next machine. Right now, we’re planning what the National Science Foundation will call the Leadership-Class Computing Facility. We’ll ultimately replace Frontera and some of our other infrastructure in the 2024, 2025 timeframe Congress permitting. That’s a whole other story. So, we’re in design for that machine now. We’re taking apart the application space. It’s really more than applications, right? You have to understand the field of science, but you also have to understand the method they’re using, right?

Dan Stanzione:
So, when you change algorithms, that can be a big deal for some of these. We’re mapping that to the chip space across GPUs and CPUs and now TPUs and all these coarse grain arrays that people are building for AI chips and field programmable gate arrays. But also, how can we get enough memory bandwidth to it? How can we get enough network bandwidth to it, right? We want to have a single type of chip on a node. We want to have heterogeneous nodes with a mixture of accelerators and conventional processors, right? Trying to find that right mix for a few years out is what we’re spending a lot of our time on now, so.

Ernest de Leon:
I know we have a ton of cool stuff coming down the pipe right now in the world of silicon. While Intel held a seemingly insurmountable grasp on the x86 chip market for years, AMD has now surpassed them as well as Apple with their new Apple silicon. Although I did use AMD back in the day when they, for a brief period, had more powerful processors, but then they fell behind. Of course, I used Intel until now, when AMD released the more powerful processes. The irony is I waited so long for these things that I just jumped ship entirely and went with Apple silicon.

Jolie Hales:
You’re pretty diehard Apple, and I’m pretty diehard not.

Ernest de Leon:
Yet we coexist.

Jolie Hales:
Somehow, we managed to live on the same planet, Mars.

Ernest de Leon:
Mars, yup. Yeah, it’s one of the things that I love to talk about in general is just the feedback loop that happens, right? So historically, we’ve had the ideas for innovation long before the technology was produced. That was mainly a function of material science and the ability to engineer these things. Now, especially in this sector, there’s a feedback loop, because the same supercomputing machines that are used to run the simulations for chip design and the simulations for material science that go into building these chips are being powered by the generation that was produced before.

Ernest de Leon:
So, the faster we advance our computing ability and AI with predictive analysis, as well as material science in general, it just creates this vortex where it starts going faster and faster and faster. Part of it may not move as fast as another, but each of these feed into themselves at some point.

Dan Stanzione:
Yeah, I think that’s absolutely true. I’ve studied it a little less. It’s often we’re inventing the mathematics and the algorithms for these things decades before they actually become useful, right? The fundamentals of digital circuits are Boolean algebra, right? George Boole designed Boolean algebra in 1854. I don’t think he had Intel chips in mind. The math leads the application sometimes by 50 or 100 years.

Jolie Hales:
You need stronger computers to create stronger computers. That’s really interesting.

Ernest de Leon:
Exactly. That’s an excellent way to boil it down, I think. So, one of the things we’d love to discuss here on the Big Compute Podcast is cloud computing. More specifically, the intersection of traditional on-premise supercomputing and cloud HPC.

Dan Stanzione:
In many ways, I think our overarching stance on that is just simply that we are the cloud, but we are a very special kind of cloud for academic scientific research, right? So, in some ways, we’re a bit of a specialized cloud, but we’re tuned both in our hardware and in our support and in our software stack around scientific simulation and AI and analytics. So, which means for services that aren’t the thing that we’re good at, we tend to rely on commercial clouds, right? But there have been impacts of just the ubiquitous adoption of the commercial cloud that I think had been good for us and technologies that we’ve transferred back and in places where we partner.

Dan Stanzione:
So, in the Frontera project, we have bridges to the major commercial cloud providers. We see our users using both and doing crazy hybrid things in a good way. So, one of them is climate simulation, right? We do a bunch of massive climate forecasts at very high resolution on Frontera that take millions of processor hours, and then they dump out a vast amount of data. They do the simulation piece on Frontera. And then they push the analytics piece to Azure, and we push the data back and forth and publish it.

Dan Stanzione:
I think cloud has changed the expectations of users and the usage model. Part of it is that we see less tolerance to wait in batch queues and more demand for interactive things, but the other part is this shift to almost ubiquitous but persistent web services. So, we wrap RESTful API’s on top of our supercomputers now, right? We still have tons of traditional users who come in, log into a UNX command line, run batch jobs, and work in that environment. But we have more and more where the supercomputer is just an automated resource, living behind an API for data processing for the Large Hadron Collider or for some robotic phenotyping work we do.

Dan Stanzione:
There’s just so many different things where it’s becoming HPC as a service. That has grown out of innovations that have happened in the cloud space. At this point, I think we’re one of what will be a set of boutique niche specialty clouds that will exist within the context of the larger clouds but will be linked together by services. A lot of our users will use them synergistically, I would hope. It’s not an either/or. I think whenever it becomes an either/or, it’s a silly conversation, right? It’s both.

Ernest de Leon:
Yeah, and I love to hear that, by the way, that TACC has taken the approach of bringing the two worlds together is really going to make a huge difference in terms of not just the usability and the extensibility of the services you offer, but the ability for others to interact and engage with the data coming out of there in a larger context, a global context.

Dan Stanzione:
Yeah, there will be specialties of the commercial cloud that we want to use, right? I mean, image tagging, right? Things like that, where there’s already a service, language processing. There’s some of these where we can just use a cloud service as part of a larger scientific workflow or use the cloud as the virtual desktop interface to let people get to the supercomputing resources. There’s so many places where I think collaboration is not only possible, it’s just the right thing to do.

Ernest de Leon:
I love Dan’s vision here that the TACC is the cloud, just a special kind of cloud. That interoperability with commercial clouds is just the right thing to do.

Jolie Hales:
I’m curious about the process that researchers go through in order to be able to access TACC supercomputing. So, if I’m a scientist or researcher, I need access to supercomputing. What do I do to get it? Is it completely complimentary?

Dan Stanzione:
There’s a few ways to do it. So, for our big National Science Foundation supported machines, there is a process in conjunction with the NSF. There’s another project called XSEDE that does allocations across the NSF-supported supercomputing centers. So, if you’re a scientist, you write a proposal for time in addition to your proposal for funding. You say, “All right, I have this project, and I need X computing resources to do it.” An independent review panel gets together and looks at the suitability and recommends different machines for them to go on.

Dan Stanzione:
So, essentially, the NSF pays us to build the machine and operate it. And then we make an amount of time available on that machine every quarter. And then this neutral third-party project comes through and hands out those chunks of time to the users. In that model, we’re not charging the users directly, right? All the time is paid for by the NSF in support of what they’re doing.

Jolie Hales:
I see.

Dan Stanzione:
So, it is free to them. It is not free as I like to remind them. It cost many millions of dollars for the time.

Jolie Hales:
Yes.

Dan Stanzione:
Also, we have some Texas-funded machines that are similar process, but open to Texas researchers. And then yeah, the third fallback mechanism is yes, some people just pay us for time, right? When they can’t get time through another process, corporate partners and then some academic researchers or labs who want dedicated time, we’ll just come in and buy a chunk of time to make sure they have it. So, they don’t have to go through that proposal process every time they need more time.

Jolie Hales:
So, we’re all about undercover superheroes. We consider you to be one of those, Mr. Dan Stanzione, because you’re helping all of these projects move forward. These projects are changing the world. So, what are the most memorable or interesting projects that TACC has been a part of?

Dan Stanzione:
There are so many and very few of the things we do change the world directly. We’re letting other people do that, right? This being 2020, you can’t talk about anything without talking about COVID, right? The fact that we’ve been able to help world class people who do incredible science really fight back against this pandemic and do things that probably less than a year from when we run the computation will turn into therapies or policies or vaccines that will have an immediate impact.

Dan Stanzione:
I call that segment of what we do, urgent computing, but the more I thought about it, the more I realized that a huge amount of what we do is urgent computing, right? We just started a new computational oncology partnership with MD Anderson in various kinds of cancer research, doing personalized dosage levels for particle and proton therapy, the most advanced radiation instead of just eyeballing it.

Jolie Hales:
Wow.

Dan Stanzione:
To people who have cancer, that’s no less urgent than the work we do for COVID. Now, this year, we’ve had 12 major Atlantic hurricanes. We’ve run simulations for all of those. We do the storm surge models that lead to the evacuation orders. We run a ton of that stuff. That within days of doing it becomes part of people’s lives. We see the longer term stuff too.

Dan Stanzione:
When a hurricane comes through, we have teams that we work with from universities around the country to go out in the field and start taking pictures right afterwards, particularly buildings and all the structures that are damaged. And then we bring all that data in and do analysis, one, to understand and make building codes better. There’s a ton of what’s called vorticity, these little vortexes that form right at where a roof corner is. They can lift off the corner because it’s not tacked down right and water can get into the house and everything else bad happens, right?

Jolie Hales:
Wow.

Dan Stanzione:
So, we’re starting to look at potential AI methods where you could go and say, “There’s been an earthquake in a city. You have 50,000 buildings you need to go inspect,” right? Which are the priorities, right? That’s a great AI problem. Because if we can just help the first responders to prioritize, these are the buildings you really need to go look at before you can let people get back in, then that’s a huge help. When a hurricane hits Galveston, we have enough data that we can do a model of the storm surge. We can overlay a GIS model of all the houses in Galveston and all the buildings. We know what the height of their foundations are. We know that the electrical outlets are 18 inches above the foundation height.

Dan Stanzione:
We can model everyone that would have been inundated to the level of electrical outlets, which means FEMA has to go in and inspect before they can let people go back into those homes or buildings. There’s new battery technologies that ultimately will really change the world. There’s first observations of gravitational waves that were predicted in 1915 and actually first observed in 2015.

Dan Stanzione:
The Higgs boson discovery had an enormous amount of computation, right? The subatomic structure of the universe. Food production, how do we do hybrids of corn to increase yield per acre, right? There’s such an incredible variety of things we get to play a part in that it’s just wonderful. So, my favorite one usually is the one I’ve worked on in the last few weeks.

Jolie Hales:
For the layperson, my mom’s a pianist and a semi-pro pickleball player, right? She doesn’t do supercomputing. She doesn’t even know what a supercomputer is. What would you tell somebody like my mom who has no idea what’s going on behind the scenes? If you were to describe what supercomputing is doing for her, what would you say to somebody like that?

Dan Stanzione:
When you think about what supercomputers can do, if you look at your cell phone, the amount of computation that went into the design of the case, so that when you drop it, it bounces and doesn’t crack and rattle the motherboard; of the chips that are inside it and how they work; of the materials that make up the battery, so that we can now take this tiny little device that fits in your pocket and have it run for days while wirelessly communicating. There’s computational design in all these things, right? When you have a problem that’s too big to solve on any other computer, that becomes a supercomputing problem.

Dan Stanzione:
When you have a problem where you just have too many of them, we need to look at every possible design for how we build the drag on a wing for an airplane to make it more fuel efficient. You can’t build a millionaire 747 prototypes. You do most of that computationally. And then finally, when you have deadlines, there’s a hurricane, it’s two days from reaching the coast. I can’t spend six days figuring out where it’s going computationally, right? So, we can predict how things move with Newton’s laws of motion.

Dan Stanzione:
If I can build a mathematical model of a physical process in the universe, if I want to ask a question of that model, I need to run a simulation, right? That’s what a simulation is. That’s what we do in computing is ask questions of these models that we can build that are looking at the sky and I need to process data or read all the data off a genome sequencer, right? That’s analytics, and we’re doing that computationally.

Dan Stanzione:
And then finally, now, where we have these huge corpuses of data like everything everyone has ever tweeted or every cell phone position of every person in the country or any of these other vast datasets, the genomic data for everyone who’s ever had COVID, right? We want to crunch through all that data and understand it. We can’t for things like genomics, build a mathematical model. We can build the statistical model of those. That’s basically what we call artificial intelligence at this point is asking questions of these sets of data. Supercomputers do all those things too.

Ernest de Leon:
I heard JPL was working with TACC on Mars rover tech.

Dan Stanzione:
We have some work doing data crunching with Jet Propulsion Lab, where we’re looking at orbital insertion for future Mars missions and how you compute that. It’s not classified, but it’s considered sensitive code and data that we deal with. It’s export controlled and restricted, because in this case, we’re looking at Mars. But there’s only so many ways to insert a big piece of metal into the atmosphere from orbit. You want to restrict how well distributed that information is. So, we have to protect a lot of the data around that stuff, but I know we’re doing some, “What are good orbits and good landing vectors for getting into the atmosphere?” with JPL.

Ernest de Leon:
What are you most looking forward to as you stare down the future path of technology advances, specifically in supercomputing?

Dan Stanzione:
For us, it’s not just the technology. It’s the science we’re going to do with it. I mean, with the merger of AI into the scientific workflow and the potential for that, I think, we’re going to do problems that we haven’t even thought of yet at scales that we can’t dream of. But for me, the next big step is building on this LCCF machine that comes after Frontera, which will involve us building a bigger data center and everything that goes with it and a bigger training program and picking that next set of technologies that are going to work. So, that’s going to be an awfully daunting process, but an exciting one as well.

Ernest de Leon:
Well said, Dan. What an excellent way to put a cap on the meat of this episode.

Jolie Hales:
Hear, hear.

Ernest de Leon:
Where can our listeners find you on the internet?

Dan Stanzione:
We’re on Twitter and Facebook and all the usual places you would expect to find us. If you go to the website and you want to opt into one of our mailing lists, you can get our annual magazine, Texascale. It’s like Exascale, but everything’s a little bigger in Texas.

Jolie Hales:
That’s cute. I like that.

Dan Stanzione:
Some of our coverage-

Jolie Hales:
Texascale.

Dan Stanzione:
… of the science stories and the new machines that are coming down the pipe.

Ernest de Leon:
We will be sure to link some additional information in the show notes for this podcast.

Jolie Hales:
Thank you so much, Dan. Thank you for being an undercover superhero.

Dan Stanzione:
Thanks very much, Jolie and Ernest. Appreciate the time and the coverage of this.

Ernest de Leon:
Thanks again for listening to the latest episode of the Big Compute Podcast. Believe me when I tell you that I love recording these for our listeners, and I know Jolie does too.

Jolie Hales:
This seriously is such a treat for me. I love learning about all of these interesting subjects and all of this technology that I hadn’t been exposed to. I love sharing it with all of you.

Ernest de Leon:
If anyone out there wants to help us get the word out about the Big Compute Podcast, you can leave us a five-star review and follow us on Apple Podcasts or your favorite-

Jolie Hales:
Or Google Podcast.

Ernest de Leon:
I was about to say it.

Jolie Hales:
Just making sure.

Ernest de Leon:
Or your podcatcher of choice.

Jolie Hales:
Yes. If you have any ideas of what we should talk about on our next podcast episodes, feel free to send those in at bigcompute.org, where you can also find a lot of great information, anything to get you down the rabbit hole of the awesomeness of supercomputing.

Ernest de Leon:
Until next time.

Jolie Hales:
Adios!

Jolie Hales

Jolie Hales is an award-winning filmmaker and host of the Big Compute Podcast. She is a former Disney Ambassador and on-camera spokesperson for the Walt Disney Company, and can often be found performing as an actor, singer, or emcee on stage or in front of her toddler. She currently works as Head of Communications at Rescale.

View all posts
Ernest deLeon

Ernest de Leon is a futurist and technologist who loves to be at the intersection of technology and the human condition. A long time cybersecurity leader, Ernest also has deep interests in artificial intelligence and theoretical physics. He spends his free time in remote places only accessible by a Jeep. He currently works as Director of Security and Compliance at Rescale, and is a host on the Big Compute Podcast.

View all posts
Taylore Ratsep

Demand Generation Manager, Rescale

View all posts
Dr. Dan Stanzione

View all posts