Blog

Blog

Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton

2D block quantization for Float8 (FP8) holds the promise of improving the accuracy of Float8…

Meta: Less Wright, IBM: Adnan HoqueDecember 6, 2024

Blog

HadaCore: Tensor Core Accelerated Hadamard Transform Kernel

IBM: Krish Agarwal, Rishi Astra, Adnan Hoque, Mudhakar Srivatsa, Raghu GantiMeta: Less Wright, Sijia Chen…

IBM and MetaDecember 2, 2024

Blog

Supercharging Training using float8 and FSDP2

IBM: Tuan Hoang Trong, Alexei Karve, Yan Koyfman, Linsong Chu, Divya Kumari, Shweta Salaria, Robert…

IBM and MetaNovember 25, 2024

Blog

Distilling Llama3.1 8B into 1B in torchtune

In this blog, we present a case study on distilling a Llama 3.1 8B model…

Linda Wang, Evan Smothers, Kartikay KhandelwalNovember 18, 2024

Blog

Deep Dive on CUTLASS Ping-Pong GEMM Kernel

Figure 1. FP8 GEMM Throughput Comparison CUTLASS vs Triton Summary In this post, we provide…

Less Wright, Adnan HoqueNovember 1, 2024

Blog

Deploying LLMs with TorchServe + vLLM

The vLLM engine is currently one of the top-performing ways to execute large language models…

Matthias Reso, Ankith Gunapal, Simon Mo, Li Ning, Hamid ShojanazeriOctober 31, 2024

Blog

Triton Kernel Compilation Stages

The Triton open-source programming language and compiler offers a high-level, python-based approach to create efficient…

Sara Kokkila-Schumacher*, Brian Vaughan*, Raghu Ganti*, and Less Wright+ (*IBM Research, +Meta)October 30, 2024

Blog

Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI

Introduction At the recent PyTorch Conference, Arm highlighted the widespread impact of its technology, spanning from…

Gian Marco Iodice, Arm and Digant Desai, MetaOctober 28, 2024

Blog

Getting started with PyTorch, ExecuTorch, and Ethos-U85 in three easy steps

ExecuTorch support for Ethos-U85 In the rapidly evolving landscape of machine learning, PyTorch has emerged…

Robert Elliott, Fredrik Knutsson, and Mark QuartermainOctober 28, 2024

Blog

Intel GPU Support Now Available in PyTorch 2.5

Support for Intel GPUs is now available in PyTorch® 2.5, providing improved functionality and performance…

PyTorch Team at IntelOctober 25, 2024

Blog

ExecuTorch Beta: On-Device AI and LLMs, Stability, and Acceleration with Partners

ExecuTorch has achieved Beta status with the release of v0.4, providing stable APIs and runtime,…

PyTorch FoundationOctober 24, 2024

Blog

TorchRec and FBGEMM 1.0 Stable Release

We are happy to announce the stable release, 1.0, for TorchRec and FBGEMM. TorchRec is the PyTorch native…

Paul Zhang, Zain Huda, Sarunya Pumma, Shintaro Iwasaki, Supadchaya Puangpontip, Benson MaOctober 23, 2024

Blog

PyTorch 2.5 Release Blog

We are excited to announce the release of PyTorch® 2.5 (release note)! This release features…

PyTorch FoundationOctober 17, 2024

Blog

The Path to Achieve PyTorch Performance Boost on Windows CPU

The challenge of PyTorch’s lower CPU performance on Windows compared to Linux has been a…

Intel CorporationOctober 15, 2024

Blog

PyTorch Foundation Technical Advisory Council Elects New Leadership

We are pleased to announce the first-ever Chair and Vice Chair of the PyTorch Foundation’s…

PyTorch FoundationOctober 8, 2024

Blog

Challenges and Efforts in PyTorch Multi-Device Integration: Compatibility, Portability, and Integration Efficiencies

Introduction As the demand for diverse hardware accelerators grows, the need for a robust and…

Zesheng Zong (Huawei), Jiawei Li (Huawei) | Co-authors: Jiong Gong (Intel), Bartosz Sochacki (Intel), Eikan Wang (Intel)September 18, 2024

Blog

CUDA-Free Inference for LLMs

In this blog, we discuss the methods we used to achieve FP16 inference with popular…

Adnan Hoque, Less Wright, Raghu Ganti and Mudhakar SrivatsaSeptember 4, 2024

Blog

Accelerate Your AI: PyTorch 2.4 Now Supports Intel GPUs for Faster Workloads

We have exciting news! PyTorch 2.4 now supports Intel® Data Center GPU Max Series and…

the PyTorch Team at IntelAugust 29, 2024

Blog

Enabling Fast Gradient Clipping and Ghost Clipping in Opacus

Introduction and Context Differentially Private Stochastic Gradient Descent (DP-SGD) is the canonical method for training machine…

Enayat Ullah, Huanyu Zhang, Will Bullock, Ilya MironovAugust 20, 2024

Blog

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

In theory, Attention is All You Need. In practice, however, we also need optimized attention…

Team PyTorch: Driss Guessous, Yanbo Liang, Joy Dong, Horace HeAugust 7, 2024

Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton

HadaCore: Tensor Core Accelerated Hadamard Transform Kernel

Supercharging Training using float8 and FSDP2

Distilling Llama3.1 8B into 1B in torchtune

Deep Dive on CUTLASS Ping-Pong GEMM Kernel

Deploying LLMs with TorchServe + vLLM

Triton Kernel Compilation Stages

Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI

Getting started with PyTorch, ExecuTorch, and Ethos-U85 in three easy steps

Intel GPU Support Now Available in PyTorch 2.5

ExecuTorch Beta: On-Device AI and LLMs, Stability, and Acceleration with Partners

TorchRec and FBGEMM 1.0 Stable Release

PyTorch 2.5 Release Blog

The Path to Achieve PyTorch Performance Boost on Windows CPU

PyTorch Foundation Technical Advisory Council Elects New Leadership

Challenges and Efforts in PyTorch Multi-Device Integration: Compatibility, Portability, and Integration Efficiencies

CUDA-Free Inference for LLMs

Accelerate Your AI: PyTorch 2.4 Now Supports Intel GPUs for Faster Workloads

Enabling Fast Gradient Clipping and Ghost Clipping in Opacus

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news