Xianzhi Yu

engineer

❤ 10 years+ parallel computing and performance tunning

Beijing, China

cos_wave@163.com

Last update: 2023.12

INTERESTS

Heterogeneous Computing(Huawei Ascend NPU, ARM CPU/GPU, x86 CPU, AMD GPU, NVIDIA GPU, Intel KNC/KNL)

Distributed Computing(50000+ GPUs parallel optimization experience)

Deep Learning Performance Optimization(CV, NLP, LLMs)

TOP500 HPL Heterogeneous Optimization(leading position)

High Performance Math Library Optimization(leading position)

Graph Computing:Semi-supervised Learning(Google Expander*), Graph Embedding(node2vec)

EXPERIENCE

Researcher @ Huawei Noah Lab June 2019 - Current

The work is about deep learning performance optimization, AI+HPC.

edge device optimization: https://github.com/huawei-noah/bolt

cloud device optimization: PanGu LLMs optimization.

Intern/Project Cooperation @ Sugon Mar. 2018 - June 2019

The work is about large scale heterogeneous parallel optimization. I'm responsible for linear system benchmark HPL optimization, that is TOP500 supercomputer ranking standards. Heterogeneous HPL(HHPL) is proposed to optimize data transmission bottleneck(memory access, PCI-E transmission, inter-node network). HHPL's efficiency is 90.7%, and surpasses NVIDIA's official closed source binary program 9%, and it is cross-platform(AMD GPU, NVIDIA GPU, Hygon DCU, CPU, etc). Sugon E-prototype supercomputer(512 nodes, 2CPU+1GPU) HPL efficiency is 71%, and rank 9th in 2018 China TOP100. Sugon advanced computing supercomputer(Sugon7000, Computer Network Information Center Supercomputer Center Nebula AI) surpasses TOP500 NO.1 Summit system at 2019.6.

Intern @ TensorStack(Beijing YunGe Technology) Mar. 2017 - Aug. 2017

Mainly responsible for distributed data preprocessing and graph computing. Building a data preprocessing pipeline by using Apache Spark and tf-transform framework in deep learning; Using CNN to do sentiment analysis. Chinese word segmentation, building vector from word and convolution neural network. Using label propagation algorithm to solve some semi-supervised learning problem, and implementing Google Expander* algorithm. There are some work about how to encode a graph to vector, and have a knowledge about node2vec(DeepWalk, SDNE, node2vec).

Intern @ Torray Networks Oct. 2015 - Aug. 2016

Mainly do high performance computing optimization. Using NVIDIA GPU to do cryo-EM image reconstruction program Relion acceleration. An adaptive adjustment algorithm is proposed to solve the GPU memory capacity. Single NVIDIA Titan 1080 GPU can replace 56 CPU cores at the same performance request.

EDUCATION

Master, Computer System Architecture @ Institute of Computing Technology 2016 - 2019

High Performance Computing Center, Institute of Computing Technology, University of Chinese Academy of Sciences
GPA: 89.25/100
Major: Deep Learning Performance Optimization/Computer Architecture Optimization
Advisor: Prof. Guangming Tan
Thesis: Design and Implementation of Large Scale HPL Algorithm for Exascale Computing

Bachelor, Computer Science and Technology @ Shandong University 2012 - 2016

Computer Science and Technology Elite Class
GPA: 92/100
Major: Heterogeneous Computing/Distributed Computing
Advisor: Prof. Weiguo Liu(2017 Gordon Bell Prize Winner)
Thesis: Using heterogeneous high performance computing unified framework OpenCL to accelerate biological big data calculation

Scholarships and Awards

National Scholarship(5%) Oct. 2018

Schlumberger Master Scholarship(5%) Oct. 2017

University of Chinese Academy of Sciences Scholarship(2%) Oct. 2016

Shandong University First Prize Scholarship(2%) Oct. 2013

National Scholarship(5%) Oct. 2013

University of Chinese Academy of Sciences Excellent Student(5%) Oct. 2016

Shandong University Excellent Student(5%) Oct. 2013

PAC National Parallel Application Contest(AI), Third Prize Oct. 2017

PAC National Parallel Application Contest(optimization), Third Prize Oct. 2016

PAC National Parallel Application Contest(application), Third Prize Oct. 2015

ChinaVis Data Challenge, Excellent Prize July 2015

ASC Student Supercomputer Challenge, First Prize May 2015

China Undergraduate Mathematical Contest in Modeling, Fist Prize Nov. 2014

Certification Cup Mathematical Modeling Preliminary Competition, Second Prize May 2014

National College Mathematical Modeling Competition(MCM/ICM), Third Prize Feb. 2014

National College Mathematics Competition, Third Prize Oct. 2013

Research Projects

Chinese Academy of Sciences Advanced Computing Project(Sugon8000, Computer Network Information Center cuperComputer Center Nebula AI) June 2018 - June 2019

TOP500 HPL heterogeneous optimization principal
Advanced computing supercomputer's theoretical performance is 300PFLOPS with 10000+ nodes(Hygon CPU*1 + AMD Vega20*4). That is the biggest heterogeneous supercomputer at present. Responsible for the large-scale heterogeneous parallel optimization of the TOP500 linear system solution benchmark HPL. Solving the scalabilty problem of parallel application runing on 50000 GPUs. Proposing a platform oriented parallel algorithm and solving the problem of scalability. Providing a performance optimization experience toolkit for other supercomputer applications. Sugon8000 supercomputer HPL efficiency is 64%.

National "13th Five-Year" High Performance Computing Project - Sugon Exascale prototype supercomputer(Sugon7000) June 2016 - June 2018

TOP500 HPL heterogeneous optimization principal
Sugon E-prototype supercomputer theoretical performance is 3PFLOPS with 512 nodes(Hygon CPU*2 + AMD Vega20). Responsible for large-scale heterogeneous parallel optimization of TOP500 linear system solution benchmark HPL. Heterogeneous HPL(HHPL) is proposed to optimize data transmission bottleneck(memory access, PCI-E transmission, inter-node network). HHPL's efficiency is 90.7%, and surpasses NVIDIA's official closed source binary program 9%, and it is cross-platform(AMD GPU, NVIDIA GPU, Hygon DCU, CPU, etc). Sugon E-prototype supercomputer(512 nodes, 2CPU+1GPU) HPL efficiency is 71%.
In view of E-prototype supercomputer HPL optimization experience, we build a model to ananylze HPL performance for exascale supercomputer, give some exascale supercompter system design advice.
HPL optimization experience also give other supercomputer applications parallel suggestions.

Deep learning optimization June. 2017 - Dec. 2017

matrix multiply and convolution optimization
With the knowledge of accelerator architecture(AMD Vega10 GPU), mainly do some deep learning operator(matrix multiply and convolution) optimization from the assembly level. Based on the theory of hardware concurrent analysis model and register pipeline strategy. The performance of small size convolution(1x1, 5x5, 7x7) can exceed AMD official deep learning library MIOpen 5%-10%.
Half precision GEMM(hgemm) performance can achieve 95% of theoretical peak performance.

Sentiment Analysis July 2017 - Aug. 2017

Using convolution neural network to do sentiment analysis for Chinese consumer reviews(Word Segmentation + Word2vec + CNN). Mining user attitudes about payment method. Model is used to participate in the PAC national parallel application challenge(AI) contest and get a third prize.

Bioinformatics algorithms heterogeneous optimization July 2014 - July 2016

Cryo-EM image reconstruction Relion optimization
Summer visiting student in Institute for Interdisciplinary Information Sciences of Tsinghua University. Under the guidance of Assistant Prof. Wei Xu, start to do cryo-EM image reconstruction Relion performance optimization, including efficient task distribution, MPI communication optimization, load balance and fault tolerance.

Smith-Waterman algorithm cross platform optimization
Implement cross platform(Intel KNC, NVIDIA GPU) acceleration of Smith-Waterman algorithm by using OpenCL. That is the fastest cross platform implementation of sequence alignment. Compare the performance of OpenCL implementation and the fastest CUDA implementation to make a conclusion about how to do heterogeneous optimization. This work has been further expanded during the master period, and the low level optimization of AMD Vega architecture has been done. We can achieve 270GCUPS on 12.5Tflops AMD Vega10(100GCUPS on 4Tflops NVIDIA Geforce GTX780).

heuristics biological gene sequence alignment algorithm Blast parallel on Intel Xeon Phi
Mainly do biological gene sequence alignment acceleration, including Smith-Waterman algorithm and Blast algorithm on Intel KNC coprocessor(multi-core + SIMD + cache). The performance of parallel algorithm has weak scalability on multi-core CPU.

Publications

Google Scholar