Multi-GPUs Superb-tuning for Llama 3.1 70B with FSDP and QLoRA

What you are able to do with solely 2×24 GB GPUs and lots of CPU RAM

Generated with DALL-E

Superb-tuning giant language fashions (LLMs) with as much as 35B parameters is comparatively straightforward and low-cost since it may be carried out with a single shopper GPU. Superb-tuning bigger fashions with a single shopper GPU is, in idea, not inconceivable as we are able to offload elements of the mannequin to the CPU reminiscence. Nonetheless, it could be extraordinarily sluggish, even with high-end CPUs.

Utilizing a number of GPUs is the one various to maintain fine-tuning quick sufficient. A configuration with 2×24 GB GPUs opens lots of potentialities. 48 GB of GPU reminiscence is sufficient to fine-tune 70B fashions resembling Llama 3 70B and Qwen2 72B.

On this article, I clarify learn how to fine-tune 70B LLMs utilizing solely two GPUs due to FSDP and QLoRA.

I first clarify what’s FSDP after which we are going to see learn how to modify a normal QLoRA fine-tuning code to run it on a number of GPUs. For the experiments and demonstrations, I exploit Llama 3.1 70B however it could work equally for different LLMs. For the {hardware}, I relied on 2 RTX 3090 GPUs offered by RunPod (referral hyperlink). Utilizing 2 RTX 4090 GPUs could be sooner however dearer.

I additionally made a pocket book implementing the code described on this article. It’s accessible right here: