Skip to main content
Blog

PyTorch Hangzhou Meetup Recap: Exploring the AI Open Source Ecosystem and Cutting-Edge Technology Practices

On May 17, the PyTorch Meetup was successfully held in Hangzhou, drawing nearly 60 developers and industry experts from companies including Huawei, Tencent, Ant Group, and ByteDance. The event focused on the development of the PyTorch ecosystem, AI acceleration technologies, and industry practices. Through keynote speeches and technical sessions, in-depth discussions were held with participants, providing a valuable platform for exchange and collaboration.

Session Highlights:

Latest Developments in the PyTorch Community and Ecosystem Outlook

Yikun Jiang, a member of the PyTorch Technical Advisory Council (TAC), shared the latest updates from the PyTorch community. Topics included the general progress of PyTorch, PyTorch Foundation Expands to an Umbrella Foundation, the Ambassador Program, and PyTorch Conference planning. He emphasized how PyTorch continues to drive innovation and real-world adoption of AI open source technologies through technical iteration, ecosystem expansion, and global collaboration. He called on developers to actively engage in community building and help shape the future of the AI open source ecosystem.

Torchair: A torch.compile Backend Optimized for Ascend NPU

Peng Xue, Senior Engineer at Huawei, presented technical practices around graph mode optimization on Ascend NPUs. He introduced the two Torchair modes—Reduce-overhead and Max-autotune—and detailed deep optimizations in memory management, dynamic shapes, multi-stream parallelism, and compile-time caching. These improvements aim to enhance model training and inference performance while maintaining ease of use.

PyTorch Ecosystem on Ascend

Yuanhao Ji, Software Engineer at Huawei, discussed support for PyTorch ecosystem projects on Ascend NPUs. Focusing on model training, fine-tuning, and inference, he introduced TorchTitan, TorchTune, and vLLM as case studies. He explained their core features and adaptation strategies for Ascend, offering practical guidance for deploying PyTorch projects on this hardware.

Production Prefill/Decode Disaggregation Based on vLLM at Tencent

Chao Zhang, Senior Engineer at Tencent, presented the practice of Prefill/Decode (PD) separation in large model inference. This technique decouples the compute-intensive prefill stage from the memory-intensive decode stage, significantly improving system throughput and resource utilization. His talk covered key technical implementations such as KV cache transmission optimization, intelligent load balancing, and multi-turn dialogue caching. Real-world deployments on both homogeneous GPUs and heterogeneous setups like Ascend A2 + H20 showed performance improvements of 20%–50%. Tencent has further optimized the vLLM framework for CPUs, GPUs, and uses pipeline decomposition, low-precision KV caches, and graph compilers to enhance adaptability and performance across hardware platforms.

Key Reinforcement Learning (RL) Acceleration Techniques and Training Practices

Chenyi Pan, Senior Engineer at Huawei, shared Ascend’s breakthroughs in reinforcement learning and ecosystem development. Addressing the challenge of low resource utilization in RL systems, introduced a training-inference co-card solution that allows for efficient switching between the two tasks. This approach not only saves 50% in compute resources but also doubles single-card throughput and improves inference memory availability by 80%. To enrich the technical ecosystem, Ascend also launched TransferDock, a streaming data engine that employs dynamic load balancing strategies to improve task efficiency by over 10% compared to traditional caching mechanisms.

On the framework side, MindSpeed-RL combines the MindSpeed training backend with the vLLM inference engine, supporting dynamic weight partitioning and time-sharing of cluster resources while maintaining compatibility with mainstream open source ecosystems. Benchmarks using the Qwen2.5-32B model showed that this setup outperformed the SimpleRL-Zoo baseline on evaluations such as MATH500, demonstrating its technical leadership.

Ray’s Practice and Exploration in Ant Group’s AI Infra Ecosystem

Senlin Zhu, Senior Technical Expert at Ant Group and Head of Ant Ray, shared the practice and exploration of Ray within Ant’s AI Infra ecosystem. He outlined Ray’s architectural design and programming paradigm. Over time, Ray has evolved into critical infrastructure for AI systems, supporting training, inference, hyperparameter tuning, and reinforcement learning.

Since 2017, Ant Group has continuously invested in Ray, which now supports applications at the scale of 2 million cores. Ant has also contributed key features to the community, such as multi-tenancy support and the Flow Insight visual debugging tool. Flow Insight, in particular, has alleviated “black box” issues in complex AI systems and significantly improved observability and deployment efficiency at scale.

Challenges and Standardization in PyTorch Ecosystem Accelerator Development

Zesheng Zong, a community developer from Huawei, provided a systematic overview of the challenges, solutions, and case studies in developing accelerators for the PyTorch ecosystem. Developers integrating out-of-tree hardware face version compatibility issues and a lack of standardized quality benchmarks, making it hard to quantify new device support. In early 2025, a new exploration group was formed in the PyTorch community to tackle these challenges.

Key improvements include: Establishing a standardized testing framework using the public repository pytorch-fdn/oota for daily plugin testing. Developing the OpenReg module to simulate backend behavior and validate with test cases. Optimizing the PrivateUse1 plugin mechanism to reduce integration complexity. Supporting automatic plugin loading to simplify device access. Improving the torch.accelerator device-agnostic API for broader compatibility.

Intel’s community developer Chuanqi Wang followed up with a case study on integrating and running CI infrastructure using Intel Gaudi. He described how to leverage CI from code compilation and unit testing to TorchBench automated benchmarking, ensuring quality for new backend integrations. He also noted plans to reduce testing time, clarify required test items, and define quality standards to improve ecosystem compatibility and development efficiency.

This PyTorch Meetup served as a technical bridge for in-depth developer exchanges and demonstrated the vibrant energy of the PyTorch ecosystem in AI’s cutting-edge domains. Through diverse perspectives, the attendees sketched a picture of how open source collaboration drives technological progress. We look forward to more developers joining this open and thriving wave of innovation, where each exchange can spark new ideas in the age of intelligence.