There are numerous strategies to align LLMs with human preferences. Past reinforcement studying with human suggestions (RLHF), usually seen as too resource-intensive for constant software on newly fine-tuned fashions, Direct Desire Optimization (DPO) is without doubt one of the hottest options for LLM alignment.
Though DPO is considerably less expensive than RLHF, it nonetheless requires a reference mannequin along with the “coverage” mannequin (i.e., the mannequin being actively skilled). This implies each fashions have to be loaded into GPU reminiscence concurrently, which could be difficult for single-GPU configurations, particularly with giant fashions.
A extra memory-efficient method could be to make use of LoRA for DPO coaching. As an alternative of coaching your entire mannequin, we freeze its parameters and practice a small adapter. This technique turns into much more environment friendly if each the coverage and reference fashions share the identical base mannequin; in that case, we load the bottom mannequin as soon as, then load a frozen adapter for the reference mannequin and a trainable adapter for the coverage mannequin, considerably lowering reminiscence necessities.
Nevertheless, the impact of LoRA on DPO’s efficiency continues to be understudied in my view. Whereas LoRA can intently approximate full coaching, its efficiency…