Abstract
With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, Qwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose UnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including Non-edit Consistency and Reasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research. The dataset, benchmark, and code will be released.
Core Contributions
UnicEdit-10M Dataset
We propose a scalable, quality-aware data curation pipeline that yields UnicEdit-10M, a 10M dataset, extending beyond basic edits to cover complex spatial, viewpoint transformation and reasoning edits, achieving the SOTA aesthetic quality across other datasets.
Qwen-Verify Expert Model
We introduce Qwen-Verify, a 7B dual-task expert model that performs failure detection and instruction recaptioning. It achieves this with high efficiency, outperforming Qwen2.5-VL-72B at a fraction of the computational and economic cost.
UnicBench Benchmark
We construct UnicBench, a comprehensive benchmark with novel metrics to assess complex reasoning and spatial capabilities beyond simple edits. Using UnicBench, we analyze mainstream models, providing a detailed diagnosis of current limitations and a clear roadmap for future research.
Dataset Showcase
Figure 1: Representative examples of all sub-tasks from UnicEdit-10M.
Data Pipeline
Figure 2: Data curation pipeline with three stages: (1) data preparation, (2) image editing, (3) post verification performing failed edits filtration and recaption.
Qwen-Verify: Expert Model Capability
Figure 3: Post-verification examples of the expert model. Base denotes Qwen2.5-VL-7B; SFT denotes Base model after Stage-1 SFT; Ours denotes the dual-task expert model Qwen-Verify.
UnicBench: Benchmark Overview
UnicBench is a comprehensive benchmark designed to evaluate image editing models across diverse tasks. Beyond basic editing capabilities, UnicBench explicitly assesses spatial understanding and knowledge-driven reasoning abilities. We introduce 4 metrics including:
-
Instruction Following: Measures how well the edited image satisfies the instruction via a VLM-based cross-modal alignment score. -
Non-edit Consistency: Assesses preservation of non-target regions, penalizing unintended changes outside the specified edit area. -
Visual Quality: Instruction-conditioned assessment of naturalness, coherence, and adherence to the intended visual style. -
Reasoning Accuracy: Targets knowledge-intensive edits. A VLM first derives an intended-outcome specification from the instruction and original image. To ground this inference, each sample provides a list of reasoning-points (targets, operations, expected visual changes), which guides the verifier's attention. The edited image is then checked against this specification.
Using UnicBench, we analyze mainstream models to provide detailed diagnostics of current limitations and clear directions for future research.
Experimental Results
Overall Performance Comparison
Figure 4: Overall score of each model on the sub-tasks in UnicBench, for EN (left) and CN (right) instructions. All results are evaluated by GPT-4.1.
Main Results
Table 1: Overall performance of different model on UnicBench. The performance of open-source and closed-source models is separately marked with the best performance in bold, and the second best underlined.
Detailed Task Results (English)
Table 2: Detailed performance across different editing tasks (EN). The performance of open-source and closed-source models is separately marked with the best performance in bold, and the second best underlined. All results are evaluated by GPT-4.1.
Detailed Task Results (Chinese)
Table 3: Detailed performance across different editing tasks (CN). The performance of open-source and closed-source models is separately marked with the best performance in bold, and the second best underlined. All results are evaluated by GPT-4.1.
Benchmark Qualitative Results
Visual comparison of different models' editing results on UnicBench test cases.
Qualitative results for Attribute Editing tasks on UnicBench (EN).
Qualitative results for Object Editing tasks on UnicBench (EN).
Qualitative results for Scene Editing tasks on UnicBench (EN).
Qualitative results for Reasoning Editing tasks on UnicBench (EN).
Figure 5: Qualitative results comparing mainstream image editing models on various UnicBench test cases.
BibTeX
@inproceedings{Ye2025UnicEdit10MAD,
title={UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits},
author={Keming Ye and Zhipeng Huang and Canmiao Fu and Qingyang Liu and Jiani Cai and Zheqi Lv and Chen Li and Jing Lyu and Zhou Zhao and Shengyu Zhang},
year={2025},
url={https://api.semanticscholar.org/CorpusID:283458518}
}