A New Chapter: Pursuing HPC research with the DOE CSGF program 🎉#

I am greatly honored to announce that I have been selected as a U.S. Department of Energy Computational Science Graduate Fellowship (DOE CSGF) fellow. This opportunity marks a new chapter in my graduate journey to earn my Ph.D. in Computer Science.

As one of thirty new fellows joining the fellowship’s 35th class, I am joining a large community of over 700 fellows and alumni that pioneer in exascale computing research, where more than 500 have become leaders in academia, national laboratories, and industry.

About the DOE CSGF program#

The DOE CSGF program is a prestigious program that was established in 1991 by the Krell Institute in partnership with the United States Department of Energy. The program’s mission is to train and support fellows in advancing the application of High-Performance Computing (HPC) across a wide range of scientific and engineering disciplines that are in the U.S. Department of Energy’s highest priorities.

Each year, 30 fellows are selected from across the United States. In the 35th class’s official announcement, Dr. Ceren Susut said, “Each of these incredibly talented people has demonstrated both outstanding academic achievement and tremendous research potential. Their research topics cover some of the highest priorities of the Department of Energy, including quantum computing, artificial intelligence, and science and engineering for energy and nuclear security. Over the last 34 years, CSGF has produced a disproportionate share of high-performance-computing leaders in industry, the national laboratories, and academia, and the Department is proud to continue its support for this critical program.”.

The DOE CSGF program provides both academic and financial support, as well as provides professional training through a three-month practicum at a national laboratory. You can learn more about the program at https://www.krellinst.org/csgf/.

CSGF Practicum Site Map

Special Thanks#

I would like to express my deepest gratitude to the Krell Institute and the U.S. Department of Energy for providing me with this extraordinary opportunity. Your support will enable me to continue to pursue High-Performance Computing research during my doctoral program at University of California, Riverside (UCR).

I would like to extend my appreciation to Dr. Bin Tang. During my early stages of research while pursuing my Bachelor’s Degree at California State University, Dominguez Hills, Dr. Tang provided pivotal guidance in my development as a researcher. Your mentorship instilled in me the confidence to pursue a Ph.D. in Computer Science research and a research career in the field. I am profoundly grateful for your support throughout my research journey.

I am also immensely grateful for Dr. Daniel Wong. Dr. Wong provided me with invaluable learning resources and support to understand the intricacies of GPU architecture and programming. These skills are key to understand and optimize algorithms within the domain of High-Performance Computing. I am deeply grateful for your guidance and your passion for GPU and HPC research.

I am also extremely thankful of my Advisor and Principal Investigator Dr. Zizhong Chen. Dr. Chen provided me the opportunity to join his Supercomputing Laboratory to pursue my doctorate degree in Computer Science at University of California, Riverside. Through my doctoral journey so far, your resources and advice have been my greatest assets for navigating the challenges as a first-generation doctoral student. I am excited for what the future holds in my HPC research journey under your mentorship.

Last but not least, I would like to express my profound gratitude for the love and support provided by my family and friends, who have been there for me since the start of my journey. I greatly appreciate all that you have done for me. You all helped make this possible, and I am eager to continue this journey with you all.

Future Plans#

I plan to delve into several challenges that make it difficult to apply High-Performance Computing to scientific and engineering problems. Through my internship work with Google Cloud’s Fault Tolerance Testing team, I developed a keen interest in the reliability and resilience problem as one of the challenges I intend to address in my research.

As systems scale and more CPUs and GPUs are introduced to solve a problem, the likelihood of failure increases dramatically. Possible failures can range from silent data corruption caused by faulty hardware to unexpected hardware failures or complete shutdowns that can completely disable a compute node and lead to data and computation loss. These errors, while usually aren’t considered as dangerous in a sequential setting, can render hundreds or thousands of computer hours on a huge cluster to wasted time and electricity if not addressed.

As fault tolerance is a key problem in porting sequential algorithms into the High-Performance Computing space, I plan to contribute to the reliability and resilience problem as part of my work.