Superb-tuning massive language fashions (LLMs) has grow to be a vital but resource-intensive activity, demanding appreciable GPU reminiscence — particularly when utilizing the AdamW optimizer, which might shortly devour obtainable sources. For every mannequin parameter, AdamW requires the storage of two further optimizer states in reminiscence, every sometimes in float32 format. This interprets to an additional 8 bytes per parameter, which means that for a mannequin with 8 billion parameters, resembling Llama 3.1, roughly 64 GB of reminiscence goes solely towards managing optimizer states.
The usage of quantized and paged optimizers can considerably scale back reminiscence overhead. Libraries like bitsandbytes have facilitated these memory-efficient approaches, making them more and more fashionable.
On this article, we are going to make a comparative evaluation of AdamW-32bit, its 8-bit counterpart, and paged AdamW optimizers, analyzing their impression on reminiscence consumption, studying curves, and coaching time. Our objective is to determine when memory-efficient optimizers are important and consider their trade-offs in coaching velocity and mannequin accuracy. Within the first part, we are going to overview AdamW 8-bit and its paged variant. Then, we are going to benchmark…