Windows Server Posted January 23 Posted January 23 Part 1: Scaling LLM Training In this section, we will delve into the essential techniques for scaling the training of large language models (LLMs) using the powerful capabilities of the Azure Machine Learning platform. The focus will be on optimizing infrastructure choices and leveraging Azure's distributed training features to address the computational demands of fine-tuning large-scale models. Key topics include selecting the most suitable GPU compute type and configuration for specific fine-tuning scenarios, as well as configuring different distributed training strategies to maximize efficiency and scalability. 1. Selection of GPU Compute Type Choosing the right Azure hardware for training large language models is a critical step that directly impacts performance, cost efficiency, and training time. The optimal choice depends on several factors, such as: Model Size: Larger models with billions of parameters may require high-memory GPUs or multi-node setups to fit the model and training data into memory. Training Approach: Depending on whether you're performing full fine-tuning or parameter-efficient fine-tuning (e.g., LoRA, adapters), the compute requirements can vary significantly. Training Configuration: Determining whether training will occur on a single GPU, a single node with multiple GPUs, or across multiple distributed nodes influences the hardware selection.Azure Machine Learning offers a range of GPU options, from entry-level GPUs for smaller workloads to cutting-edge GPUs like NVIDIA A100 and H100 for high-performance training. This guide will help you navigate these options, providing recommendations tailored to different use cases to ensure optimal performance and cost-effectiveness Below is a breakdown of Azure GPU options, starting with entry-level configurations and progressing to the most powerful setups for large-scale distributed training. Entry-Level GPU Options For smaller models or use cases with limited computational requirements, Azure offers cost-effective GPU options that are ideal for experimentation and development: NCas_T4_v3 Series GPU: NVIDIA Tesla T4 Memory: 16 GB per GPU Use Case: Suitable for small-scale fine-tuning tasks, experimentation, and parameter-efficient fine-tuning. Best for models with fewer parameters or when training is performed on a single GPU. Advantages: Low cost and energy-efficient. NCs_v3 Series GPU: NVIDIA Tesla V100 Memory: 16 GB per GPU Use Case: A step up from the T4, this series is ideal for medium-scale fine-tuning tasks or small-scale distributed training. Advantages: Higher performance than T4 GPUs, with support for more memory-intensive tasks. Intermediate GPU Options As training demands increase, GPUs with higher memory and support for distributed training become essential:1. ND_v2 Series GPU: NVIDIA Tesla V100 with NVLINK Memory: 32 GB per GPU Features: Includes InfiniBand for high-speed GPU-to-GPU communication in multi-node distributed setups. Use Case: Suitable for distributed training of medium- to large-sized models, particularly when scaling across multiple nodes is required. Advantages: InfiniBand significantly reduces communication overhead, making it effective for data-parallel and model-parallel training strategies. 2. NC_A100_v4 Series GPU: NVIDIA A100 Memory: 40 GB per GPU Use Case: Designed for larger-scale fine-tuning tasks, such as training multi-billion-parameter models. Advantages: A100 GPUs offer Tensor Cores for accelerated mixed-precision training, improving both speed and efficiency. High-Performance GPU Options For large-scale distributed training of extremely large models, Azure provides cutting-edge GPU configurations with advanced interconnect capabilities:1. NDasrA100_v4 Series GPU: NVIDIA A100 with Mellanox HDR InfiniBand Memory: 80 GB per GPU Features: HDR InfiniBand enables high-bandwidth, low-latency communication, making it ideal for distributed training of very large models like LLAMA 70B. Use Case: Designed for highly scalable distributed training, especially for large transformer-based models. Advantages: Supports seamless clustering of GPUs for efficient scaling across nodes. 2. NCads_H100_v5 Series GPU: NVIDIA H100 NVL Memory: 94 GB per GPU Use Case: Geared for ultra-large models and demanding fine-tuning tasks requiring extensive computational resources. Advantages: NVIDIA H100 GPUs offer the latest generation of Tensor Cores, delivering unmatched performance for both training and inference. 3. ND-H100-v5 Series GPU: NVIDIA H100 with scale-out InfiniBand interconnect Memory: 80 GB per GPU Features: Supports a large set of existing AI and HPC tools built on NVIDIA’s NCCL communication libraries, enabling seamless clustering of GPUs for distributed training. Use Case: Designed for training the largest models at scale, such as GPT-4-sized architectures, with exceptional performance for both data and model parallelism. Advantages: The scale-out InfiniBand interconnect delivers unmatched communication speed, ensuring minimal bottlenecks in distributed training setups. Azure ML GPU Options by Scenario Scenario Model Size Training Approach Recommended Azure Compute Small-scale fine-tuning < 1B parameters Parameter-efficient tuning NCas_T4_v3 (Tesla T4, 16 GB) Medium-scale fine-tuning 1–5B parameters Full or parameter-efficient NCs_v3 (Tesla V100, 16 GB) Distributed training for medium models 5–10B parameters Full fine-tuning ND_v2 (Tesla V100 NVLINK, 32 GB, InfiniBand) Large-scale fine-tuning 10–20B parameters Full or parameter-efficient NC_A100_v4 (A100, 40 GB) Distributed training for large models 20–70B parameters Full fine-tuning NDasrA100_v4 (A100, 80 GB, HDR InfiniBand) Ultra-large model training > 70B parameters Full fine-tuning NCads_H100_v5 (H100 NVL, 94 GB) Massive-scale distributed training > 100B parameters Full fine-tuning ND-H100-v5 (H100, 80 GB, scale-out InfiniBand) 2. Distributed Fine-Tuning Techniques Scaling large language models (LLMs) requires moving beyond single-node, single-GPU setups to leverage the power of multi-GPU and multi-node distributed training. Distributed fine-tuning allows workloads to be split across multiple GPUs or nodes, significantly speeding up training while overcoming hardware memory limitations. In this section, we will explore key distributed fine-tuning techniques and best practices for implementing them on Azure Machine Learning (Azure ML).Azure ML provides seamless integration with popular deep learning frameworks like PyTorch and TensorFlow, as well as advanced libraries like DeepSpeed and Fully Sharded Data Parallel (FSDP), which are designed to maximize scalability and efficiency. Below, we break down the main components of distributed fine-tuning, including Distributed Data Parallel (DDP), Model Parallelism, and Memory Optimization Techniques. Distributed Data Parallel (DDP) Distributed Data Parallel (DDP) is the most widely used approach for scaling model training across multiple GPUs and nodes. It replicates the model on each GPU, splits the training data, and synchronizes gradients to ensure consistent updates. Key features of DDP: Multi-GPU, Single Node: Multiple GPUs within a single machine process different mini-batches in parallel. Multi-Node, Multi-GPU: Extends DDP across multiple machines, each with multiple GPUs. For this configuration, an ND-series machine in Azure is recommended, as it supports InfiniBand for high-speed interconnects. Efficient Communication: Uses collective communication libraries like NVIDIA’s NCCL or MPI for gradient synchronization. To set up Distributed Data Parallel (DDP) training in Azure ML for large language model (LLM) fine-tuning, you can leverage Azure ML's built-in PyTorch distribution support. Follow the official Azure ML DDP guide (Azure ML Distributed Training Documentation) to understand the fundamental concepts of DDP in Azure ML. Steps to Configure DDP Training The following is an Azure ML job YAML definition for setting up Distributed Data Parallel (DDP) training using PyTorch. This configuration defines the compute cluster, node allocation, and distribution settings required for running distributed training on Azure ML. Azure ML Job YAML Definition $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/mistralai-Mistral-7B-v01/versions/19 command: > accelerate launch --num_processes 4 --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT compute: azureml:nc12s-cluster resources: instance_count: 2 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Special Note for Using accelerate launch in Azure ML in the example When using the accelerate launch command for distributed training in Azure ML, the configuration must account for the specific way accelerate handles distributed processes across nodes and GPUs. process_count_per_instance Must Be Set to 1 The accelerate launch command internally manages the distribution of processes across nodes and GPUs. Therefore, process_count_per_instance must remain set to 1 in the Azure ML YAML configuration. The total number of GPUs is passed explicitly using the --num_processes parameter in the accelerate launch command. Parameters for accelerate launch --num_processes: Specifies the total number of GPUs across all nodes. For example: If each node has 4 GPUs and there are 2 nodes, set --num_processes 8. In the example, --num_processes 4 assumes 2 nodes with 2 GPUs each. --num_machines: Specifies the total number of nodes (e.g., 2 in this case). --machine_rank: Indicates the rank (index) of the current node in the cluster. Azure ML automatically sets this value via the $NODE_RANK environment variable. --main_process_ip: The IP address of the master node. Azure ML automatically sets this value via the $MASTER_ADDR environment variable. --main_process_port: The communication port for the master process. Azure ML automatically sets this via the $MASTER_PORT environment variable. Our Experiments: Distributed Training on Multiple Nodes using DDP We conducted an experiment to fine-tune the Llama-3.1-8B model using LoRA (Low-Rank Adaptation) on Azure ML NDv2-V100 nodes. The goal was to evaluate the efficiency of fine-tuning across different numbers of nodes (1, 2, and 3) and observe the impact on training time and throughput. Experiment Settings Model: Llama-3.1-8B Fine-Tuning Method: QLoRA (Low-Rank Adaptation) Compute Resources: Azure ML NDv2-V100 nodes Number of Nodes: 1, 2, and 3 Private Dataset Results Number of Nodes Train Runtime (seconds) Train Samples per Second 1 929.0336 8.319 2 474.2971 16.296 3 327.3021 23.614 Observations Training Time: As the number of nodes increased, the training time decreased significantly. This indicates that the distributed training setup was effective in reducing the overall training time. Throughput: The throughput, measured in train samples per second, increased almost linearly with the number of nodes. This suggests that the system scaled efficiently with the addition of more nodes. As you increased the number of nodes from one to three, the throughput increased proportionally. This indicates that the system scaled efficiently with the addition of more nodes, maintaining a close-to-linear improvement in throughput. This linearity suggests that the distributed training setup was well-optimized, allowing for effective parallel processing across multiple nodes. Reference implementation: https://github.com/james-tn/llm-fine-tuning/blob/james-dev/opensource_llm/single_step/run_multigpu.yml Model Parallelism and Memory Optimization Training large language models (LLMs) often exceeds the memory capacity of a single GPU, requiring techniques that distribute both computation and memory across multiple GPUs or nodes. Advanced frameworks like Fully Sharded Data Parallel (FSDP) and DeepSpeed are designed to address these challenges by combining model parallelism and memory optimization strategies. Efficient Scaling with FSDP and DeepSpeed Model parallelism involves splitting a model into smaller components and distributing the workload across GPUs to reduce memory usage on individual devices. Both FSDP and DeepSpeed excel at implementing advanced forms of model parallelism and memory optimization. Key Features of FSDP: Full Parameter Sharding: FSDP shards model parameters, gradients, and optimizer states across GPUs, ensuring no single GPU holds the entire model. Dynamic Shard Loading: Parameters are loaded and unloaded on-demand during training, enabling the handling of models that exceed GPU memory limits. Seamless Scaling: FSDP supports multi-GPU and multi-node setups, making it ideal for distributed training environments like Azure ML. Native PyTorch Integration: FSDP is tightly coupled with PyTorch, offering straightforward integration into existing workflows. Key Features of DeepSpeed: ZeRO Optimization: DeepSpeed’s ZeRO (Zero Redundancy Optimizer) operates in stages to shard optimizer states (Stage 1), gradients (Stage 2), and model parameters (Stage 3), maximizing memory efficiency. Hybrid Parallelism: Combines data parallelism and tensor parallelism for training extremely large models. Quantization and Offloading: Supports INT8 quantization to reduce memory usage and computational requirements, and offloads model states to CPU or NVMe storage for models that exceed GPU memory. Memory Optimization Techniques Both FSDP and DeepSpeed implement advanced memory optimization techniques to further enhance efficiency: Gradient Checkpointing: Reduces memory by recomputing activations during the backward pass, trading memory for additional computation.Supported by both FSDP and DeepSpeed. Mixed Precision Training: Reduces memory usage by using FP16 or BF16 instead of FP32, accelerating training while maintaining numerical stability.Supported by both frameworks. Quantization (DeepSpeed Exclusive): Uses INT8 precision for weights and activations, dramatically reducing memory and compute requirements. Offloading (DeepSpeed Exclusive): Offloads optimizer states and model parameters to CPU or NVMe, freeing up GPU memory for computation. Configure FSDP and Deepspeed in Azure ML for LLM training Azure Machine Learning (Azure ML) offers seamless integration for training large models using frameworks like Hugging Face's accelerate, which provides built-in support for DeepSpeed and Fully Sharded Data Parallel (FSDP). This integration simplifies distributed training setups by allowing users to configure and launch jobs with minimal manual effort. Below, we explore the configurations for both DeepSpeed and FSDP in Azure ML. Using Accelerate with DeepSpeed in Azure ML DeepSpeed’s flexibility and advanced optimizations make it an excellent choice for training large-scale models. By leveraging Hugging Face's accelerate library, users can easily configure distributed training jobs for DeepSpeed in Azure ML. Below is an example of a YAML configuration for a training job using DeepSpeed: $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/Phi-4/versions/2 command: > accelerate launch --config_file "configs/deepspeed_config_zero3.yaml" --num_processes 2 --num_machines 1 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py In this configuration: accelerate launch: This command initializes the distributed training setup using the accelerate library. DeepSpeed Config File: The configs/deepspeed_config_zero3.yaml file contains the DeepSpeed configuration, including optimizations like ZeRO Stage 3 for efficient memory usage. Azure ML Environment Variables: $NODE_RANK, $MASTER_ADDR, and $MASTER_PORT are automatically populated by Azure ML to manage distributed training across multiple nodes and GPUs. Inputs: The model_dir input specifies the location of the model within the Azure ML model registry. Using Accelerate with FSDP in Azure ML FSDP, tightly integrated with PyTorch, offers robust sharding capabilities to handle extremely large models. The following YAML configuration demonstrates how to set up an Azure ML training job using FSDP: $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml-meta/models/Llama-3.3-70B-Instruct/versions/4 command: > accelerate launch --config_file "configs/fsdp_config.yaml" --num_processes 8 --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py Experiment: Scaling LLM Training Using FSDP and Deepspeed We conducted an experiment to fine-tune the following models: Mistral-7B, LLAMA3_8B_Instruct, Phi4 (14B), and LLAMA3.3-70B-Instruct. The goal was to evaluate the impact of different configurations, hardware setups, and memory optimization techniques on the training performanceThe experiments utilized FSDP and DeepSpeed, alongside Distributed Data Parallel (DDP), to test different fine-tuning and optimization settings on Azure ML. Below are the configurations Configurations Model Hardware Configuration Framework Fine-Tuning Method Batch Size Optimization Techniques Mistral-7B 2 nodes of NC12_v2 (4 GPUs, NVIDIA V100, 16 GB) DDP LoRA 5 Gradient checkpointing, mixed precision LLAMA3_8B_Instruct 1 node of NC48_A100 (2 GPUs, NVIDIA A100, 40 GB) FSDP Full-weight fine-tuning 5 FSDP, mixed precision, gradient checkpointing Phi4 (14B) 1 node of NC96_A100 (4 GPUs, NVIDIA A100, 80 GB) FSDP Full-weight fine-tuning 2 Deepspeed, mixed precision, gradient checkpointing LLAMA3.3-70B-Instruct 4 nodes of ND40_v2 (8 GPUs, NVIDIA V100, 32 GB) FSDP LoRA 1 FSDP, mixed precision, gradient checkpointing LLAMA3.3-70B-Instruct 1 node of NC80_H100 (2 GPUs, NVIDIA H100, 80 GB) FSDP LoRA 1 FSDP, mixed precision, gradient checkpointing Key Observations Memory Utilization: FSDP: Sharded parameters and gradients reduced peak memory usage across all configurations, allowing large models (e.g., 70B parameters) to fit within GPU memory. DeepSpeed ZeRO-3: Offloading optimizer states to CPU memory further reduced memory requirements, especially for full-weight fine-tuning tasks. Optimization Techniques: Mixed Precision: Enabled faster training and reduced memory usage across all models and configurations. Gradient Checkpointing: Was critical for reducing activation memory, especially for larger models like LLAMA3.3-70B-Instruct. LoRA: Provided a parameter-efficient alternative to full fine-tuning for models like Mistral-7B and LLAMA3.3-70B-Instruct. Pushing the Limit: Fine-tuning LLAMA 3.3-70B-Instruct using Parameter-Efficient Fine-Tuning (PEFT) is achievable on 2 nodes of ND40_v2 with a total of 32 V100 GPUs using FSDP full sharding. Full-weight fine-tuning of a 14B parameter model (PHi-4) is possible on a single NC96A100 machine equipped with 4 A100 GPUs, leveraging the memory efficiency of FSDP. See the details at the sample repo Detailed Configuration Insights Optimization Technique Description Benefit Applicable Models FSDP Fully sharded model parameters, gradients, and optimizer states across GPUs. Reduced memory usage and enabled training of large models. LLAMA3_8B_Instruct, Phi4, LLAMA3.3-70B DeepSpeed ZeRO-3 Shards optimizer states, gradients, and parameters with additional CPU memory offloading. Reduced memory requirements for full-weight fine-tuning. Phi4 (14B) Gradient Checkpointing Saves memory by recomputing activations during backpropagation. Critical for fitting larger models into GPU memory. All models Mixed Precision Uses FP16 or BF16 for computations instead of FP32. Accelerated training and reduced memory consumption. All models LoRA Updates a small subset of model weights instead of all parameters. Enabled parameter-efficient fine-tuning with minimal computational overhead. Mistral-7B, LLAMA3.3-70B-Instruct Conclusion: Scaling LLM Training with Azure MLScaling the training of large language models (LLMs) is both a challenge and an opportunity for organizations aiming to harness the full potential of AI. Azure Machine Learning, with its robust infrastructure and seamless integration of distributed training frameworks like DeepSpeed, FSDP, and DDP, provides an ideal platform to tackle this complexity. From selecting the right GPU configurations to optimizing memory usage and leveraging distributed fine-tuning techniques, Azure ML empowers users to efficiently train models from small-scale fine-tuning to massive-scale distributed setups.The experiments and configurations discussed in this blog highlight the importance of choosing the right hardware, optimization techniques, and frameworks based on the model size and training requirements. With advancements in parameter-efficient fine-tuning methods like LoRA and tools for memory optimization, even ultra-large models such as LLAMA 70B can be fine-tuned effectively.As organizations continue to push the boundaries of AI, Azure ML's scalable, cost-efficient, and high-performance ecosystem ensures that the journey of training next-generation language models is as seamless as it is innovative. Whether you're working on fine-tuning smaller models or scaling up to billions of parameters, Azure ML is ready to support your AI ambitions. Ready to start scaling your LLM training? Explore the resources and examples provided to implement these techniques and unlock new possibilities in AI development.View the full article Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.