Large Language Model Inference

In recent years, large language models (LLMs) like GPT, BERT, and Llama have dramatically advanced natural language processing (NLP), enabling tasks ranging from text generation to sophisticated semantic understanding. Central to leveraging these models effectively is the process of inference—the act of generating predictions from trained LLMs. Efficient inference is critical to harnessing the power of LLMs in real-world applications, especially given their substantial computational demands.

This workshop provides an in-depth introduction to Large Language Model Inference, beginning with fundamental concepts such as the transformer architecture, attention mechanisms, and decoding strategies, such as top-k and top-p. Participants will explore how inference operates at scale, learn methods to optimize model speed and efficiency, and examine best practices for deploying models in production environments. We will highlight popular frameworks, including Hugging Face Transformers and inference optimization tools such as model quantization techniques.

The workshop emphasizes hands-on applications, showcasing practical inference scenarios such as text generation and question-answering. Attendees will engage with live demonstrations using Hugging Face Transformers and interactive coding sessions designed to illustrate inference optimization strategies, including GPU acceleration, quantization, and batching for maximum throughput.

By the end of this workshop, participants will have:

- A solid grasp of inference fundamentals and optimization techniques for LLMs.

- Hands-on experience deploying and accelerating inference tasks using popular tools.

- Insights into practical considerations for deploying LLM inference solutions at scale.

Prerequisites: Basic knowledge of ML, DL, PyTorch, Transformers.

Length: 2 Hours

Level: Intermediate/Advanced

Thursday, May 22, 2025 - 14:00 to 16:00

Large Language Model Inference

Primary tabs

Large Language Model Inference