Latest text-to-image technology (T2I) fashions, equivalent to Steady Diffusion and Imagen, have made important progress in producing high-resolution photographs primarily based on textual content descriptions. Nevertheless, many generated photographs nonetheless endure from points like artifacts (e.g., distorted objects, textual content and physique components), misalignment with textual content descriptions, and low aesthetic high quality. For instance, the immediate within the picture under says, “A panda using a motorbike”, nevertheless the generated picture reveals two pandas, with extra undesired artifacts, together with distorted panda noses and wheel spokes.
Impressed by the success of reinforcement studying from human suggestions (RLHF) for big language fashions (LLMs), we discover whether or not studying from human suggestions (LHF) may also help enhance picture technology fashions. When utilized to LLMs, human suggestions can vary from easy choice scores (e.g., “thumb up or down”, “A or B”), to extra detailed responses like rewriting a problematic reply. Nevertheless, present work on LHF for T2I primarily focuses on easy responses like choice scores, since fixing a problematic picture typically requires superior expertise (e.g., modifying), making it too tough and time consuming.
In “Wealthy Human Suggestions for Textual content-to-Picture Technology“, we design a course of to acquire wealthy human suggestions for T2I that’s each particular (e.g., telling us what’s improper in regards to the picture and the place) and simple to acquire. We display the feasibility and advantages of LHF for T2I. Our essential contributions are threefold:
- We curate and launch RichHF-18K, a human suggestions dataset protecting 18K photographs generated by Steady Diffusion variants.
- We prepare a multimodal transformer mannequin, Wealthy Computerized Human Suggestions (RAHF), to foretell various kinds of human suggestions, equivalent to implausibility scores, heatmaps of artifact places, and lacking or misaligned textual content/key phrases.
- We present that the expected wealthy human suggestions may be leveraged to enhance picture technology and that the enhancements generalize to fashions (equivalent to Muse) past these used for knowledge assortment (Steady Diffusion variants).
To one of the best of our data, that is the primary wealthy suggestions dataset and mannequin for state-of-the-art text-to-image technology.