Preference-conditioned image generation aims to produce images that reflect a user's aesthetics in addition to prompt semantics. PrefGen leverages a fine-tuned multimodal LLM to extract preference-aware embeddings from a few liked/disliked images, decomposes user signals into identity and semantic components, aligns semantic component distributionally to diffusion text embeddings via an MMD loss, and injects a fused user representation into a diffusion backbone to guide generation. The pipeline yields improved fidelity and stronger alignment to user preferences across automatic metrics and human studies.
Figure 1: Overview of our MLLM-based preference learning framework.
esem: extracted from the top 4 layers (last-token pooling) — captures like/dislike semantics and styles used for aligning to text space. ecore: middle-to-upper layer embeddings — capture stable user identity signals that generalize across prompts. esem through a small MLP and minimize MMD against paired CLIP text embeddings (attributes). Distribution-level loss preserves diversity and avoids collapse that point-wise losses can cause. e_u = [ĕsem; ecore; eimg], where eimg is a CLIP image embedding from a liked example. Inject into the diffusion UNet via a parallel user cross-attention branch. Agent dataset: ~990,998 generated images from 50,153 simulated users (each agent has ~50 attributes; we sample like/dislike pairs). This large-scale synthetic dataset provides controlled and diverse preference signals, enabling robust training of our preference extraction and alignment modules.
Figure 2: Examples from the agent (synthetic) dataset.
Processed Pick-a-Pic (Real users): We additionally build a cleaned and standardized split from the Pick-a-Pic dataset, which contains preference choices from real human users over images generated by multiple diffusion models. Raw Pick-a-Pic contains noisy, single-step preferences, so we perform a multi-stage processing pipeline:
Figure 3: Examples from the processed Pick-a-Pic dataset (real users).
Figure 4: Examples include product design, character design, creative ideation, and personalized avatar / character cloning — PrefGen can adapt to both stylistic and semantic preferences from few-shot histories.
Figure 5: The images generated by PrefGen. Each example shows user history preference on the left, the text prompt in the middle, and results from PrefGen on the right. Our approach adapts to user-specific aesthetic signals, generating outputs that more faithfully reflect the preference history
Figure 6: Qualitative comparison with different methods. Each row shows the user’s preference and outputs from different approaches. PrefGen consistently captures both stylistic and semantic aspects of user preference, while others often fail to balance preference alignment and prompt fidelity
Table 1: PrefGen achieves the best combination of image quality and preference alignment.