PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

Wenyi Mo¹, Tianyu Zhang², Yalong Bai², Ligong Han^3,4, Ying Ba², Dimitris N. Metaxas^1,✉

¹Rutgers University, ²iN2X, ³MIT-IBM Watson AI Lab ⁴Red Hat AI Innovation

🤗

Abstract

Preference-conditioned image generation aims to produce images that reflect a user's aesthetics in addition to prompt semantics. PrefGen leverages a fine-tuned multimodal LLM to extract preference-aware embeddings from a few liked/disliked images, decomposes user signals into identity and semantic components, aligns semantic component distributionally to diffusion text embeddings via an MMD loss, and injects a fused user representation into a diffusion backbone to guide generation. The pipeline yields improved fidelity and stronger alignment to user preferences across automatic metrics and human studies.

Method

Figure 1: Overview of our MLLM-based preference learning framework.

MLLM training: Fine-tune an MLLM on a preference-oriented VQA dataset — this helps the model learn multi-image preference reasoning.
Layer probing & embedding extraction:
- e_sem: extracted from the top 4 layers (last-token pooling) — captures like/dislike semantics and styles used for aligning to text space.
- e_core: middle-to-upper layer embeddings — capture stable user identity signals that generalize across prompts.
Distribution alignment: Map e_sem through a small MLP and minimize MMD against paired CLIP text embeddings (attributes). Distribution-level loss preserves diversity and avoids collapse that point-wise losses can cause.
Fusion & conditioning: Form final user vector e_u = [ĕ_sem; e_core; e_img], where e_img is a CLIP image embedding from a liked example. Inject into the diffusion UNet via a parallel user cross-attention branch.

Data & Benchmarks

Agent dataset: ~990,998 generated images from 50,153 simulated users (each agent has ~50 attributes; we sample like/dislike pairs). This large-scale synthetic dataset provides controlled and diverse preference signals, enabling robust training of our preference extraction and alignment modules.

Figure 2: Examples from the agent (synthetic) dataset.

Processed Pick-a-Pic (Real users): We additionally build a cleaned and standardized split from the Pick-a-Pic dataset, which contains preference choices from real human users over images generated by multiple diffusion models. Raw Pick-a-Pic contains noisy, single-step preferences, so we perform a multi-stage processing pipeline:

remove low-quality or inconsistent entries,
group samples by real user ID to construct reliable preference histories,
aggregate multiple pairwise choices into like/dislike sets,
ensure each user has enough images to form a meaningful preference profile.

Figure 3: Examples from the processed Pick-a-Pic dataset (real users).

Experiment Results

Figure 4: Examples include product design, character design, creative ideation, and personalized avatar / character cloning — PrefGen can adapt to both stylistic and semantic preferences from few-shot histories.

Figure 5: The images generated by PrefGen. Each example shows user history preference on the left, the text prompt in the middle, and results from PrefGen on the right. Our approach adapts to user-specific aesthetic signals, generating outputs that more faithfully reflect the preference history

Figure 6: Qualitative comparison with different methods. Each row shows the user’s preference and outputs from different approaches. PrefGen consistently captures both stylistic and semantic aspects of user preference, while others often fail to balance preference alignment and prompt fidelity

Table 1: PrefGen achieves the best combination of image quality and preference alignment.