Fine tuning at Scale : Single Device vs Distributed Training

6 min readMay 19, 2025

Objective:

Evaluating the efficiency and effectiveness of Single Device fine tuning. versus Distributed fine tuning(LLAMA3–8b-Instruct) on Vertex AI.

This comparison aims to analyze how fine tuning a model on a single device differs from leveraging multiple distributed resources. Key aspects include training speed, scalability, memory utilization, and overall performance improvements when using Vertex AI’s distributed infrastructure.

Resources:

Vertex AI: Vertex AI provides a fully managed service that eliminates the need for manual resource provisioning and infrastructure management, enabling both single-device and distributed training It seamlessly integrates with GPUs and TPUs, allowing efficient resource utilization for large-scale training. With Vertex AI Pipelines, the entire fine-tuning workflow — including data preprocessing, model training, evaluation, and deployment — can be automated, ensuring reproducibility and scalability in a serverless, distributed training setup.
Axolotl: Powerful framework designed for post-training and fine-tuning of AI models, supporting full fine-tuning, LoRA, QLoRA, and other parameter-efficient techniques. It enables easy configuration through YAML files, allowing users to preprocess datasets, train models, and run inference seamlessly. Axolotl supports multiple Hugging Face architectures like LLaMA, Falcon, and MPT, while integrating FSDP, DeepSpeed, and optimizations like Flash Attention for efficient multi-GPU training.
Dataset:
NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes. The Notechat dataset consists of synthetic doctor-patient conversations conditioned on clinical notes, making it a valuable resource for training medical dialogue models. For our fine-tuning experiments, we used the first 10,000 records from this dataset, ensuring a diverse range of patient interactions while maintaining computational efficiency.

Experiments:

1)Single Device Training: The a2-highgpu-1g machine type with a single A100 GPU was utilized for Parameter Efficient Fine-Tuning on 10,000 records from the Notechat dataset

a)Environment configuration:

Below is a detailed description on A2 Standard machine types

b)Config Hyperparameters:

2)Distributed Training: The a2-highgpu-4g machine type with 4*A100 GPU were utilized for Parameter Efficient Fine-Tuning on the same 10,000 records from the Notechat dataset.

a)Environment configuration:

b)Config Hyperparameters:

The distributed training experiment maintained the same hyperparameters and values as the Single Device Training setup but additionally leveraged DeepSpeed ZeRO-1 optimization. This optimization improves memory efficiency by partitioning optimizer states across multiple GPUs, enabling larger batch sizes and enhancing training scalability.

Results

1)SINGLE DEVICE TRAINING

a)Training Loss Curve: The training loss starts at 1.96 at the beginning of the 0th epoch and gradually decreases, plateauing at 0.03 by the end of the 10th epoch, indicating that the model has effectively learned from the training data.

b)CPU Performance:

1)CPU Utilisation: The CPU utilization averaged 6.42%, as the CPU handled data preprocessing, batch loading, and checkpointing in a single-GPU setup

2)RAM Consumption: Peak RAM consumption during training on the a2-highgpu-1g instance reached 3.58% of the available 85GB. This demonstrates efficient memory utilization, ensuring stability and preventing performance degradation due to excessive swapping or memory exhaustion

c)GPU Performance:

1)GPU Utilisation: Peak GPU utilization reached 94%, indicating that the GPU was actively engaged in training most of the time. This high utilization suggests that the workload effectively leveraged the GPU’s compute resources without significant underutilization or bottlenecks. It also implies that the fine tuning process was computationally intensive, making efficient use of available hardware for model fine-tuning

2)GPU Memory Utilisation: Peak GPU memory utilization reached 76.3% of the available 40GB on the a2-highgpu-1g, indicating that a significant portion of GPU memory was actively used during training. This memory was allocated for storing model parameters, activations, gradients, and input batches, ensuring efficient computation without exceeding hardware limits. The remaining memory provided headroom for handling larger batch sizes, complex model architectures, or additional memory-intensive operations.

d)Time Taken: The PEFT fine-tuning process on a single device took 1 day and 20 hours to complete, highlighting the computational intensity and time investment required for training under a non-distributed setup.

2)DISTRIBUTED TRAINING

a)Training Loss Curve: The training loss starts at 1.99 at the beginning of the 0th epoch and steadily decreases, reaching a plateau at 0.13 by the end of the 10th epoch, demonstrating that the model has effectively learned from the training data. The training loss curve observed in the distributed training setup follows a similar trend to that of the single-device training setup, indicating consistency in model convergence across both approaches

b) CPU Performance:

1)CPU Utilisation: The average CPU utilization was slightly lower at 5.18% as compared to Single Device Training, indicating a marginal reduction in CPU workload due to distributed computation across multiple GPUs. However, the difference is not highly significant, suggesting that CPU-bound processes such as data loading and orchestration remain consistent across both setups.

2)RAM Consumption: Peak RAM consumption during distributed training on the a2-highgpu-4g instance reached 15.61% of the available 340GB, significantly higher than the 3.58% of the available 85G observed in single-device training. This indicates increased memory utilization due to the overhead associated with distributed training, including inter-GPU communication and larger batch processing. However, the memory consumption remains well within available limits, ensuring stable training without bottlenecks.

c)GPU Performance

1)GPU Utilisation: Peak GPU utilization in the distributed training setup also reached 94%, similar to the single-device experiment. This indicates that the distributed workload effectively utilized available GPU resources without significant idle time or underutilization. The consistency in GPU usage across both training configurations suggests that the compute workload was well-balanced, ensuring that distributed training leveraged multiple GPUs efficiently while maintaining high computational throughput

2)GPU Memory Utilisation: In the distributed training setup, peak GPU memory utilization reached 90.67% of the available 160GB on the a2-highgpu-4g instance. This indicates that a substantial portion of GPU memory was utilized for storing model parameters, activations, gradients, and input batches across multiple GPUs. The increased memory usage compared to the single-device experiment suggests that distributed training required additional memory to manage inter-GPU communication, optimizer states, and larger batch sizes, effectively leveraging the available hardware for improved scalability.

d)Time Taken: In the distributed training setup, the PEFT fine-tuning process completed in 11 hours and 8 minutes, demonstrating a significant reduction in training time compared to the single-device setup (1 day and 20 hours). This improvement highlights the efficiency of distributed training in accelerating fine-tuning by leveraging multiple GPUs, optimizing memory utilization, and parallelizing computations across devices

Conclusions

The results highlight the efficiency gains achieved through distributed training, which significantly reduced fine-tuning time by leveraging multiple GPUs for parallel computation. While GPU utilization remained consistently high across both setups, distributed training required greater GPU memory due to increased resource demands from inter-GPU communication and larger batch processing. Additionally, RAM consumption was notably higher, reflecting the additional memory overhead associated with multi-GPU coordination. In conclusion, distributed training effectively optimized computational resources, enabling faster convergence without significant bottlenecks.

** Note: Currently, only A2 machine types can be provisioned, but A3 machine types will be supported soon in the fine-tuning workflow on Vertex AI. *

Oct	NOV	Dec
	14
2024	2025	2026

Fine tuning at Scale : Single Device vs Distributed Training

Objective:

Resources:

Conclusions

Written by Kkshitiz

No responses yet