Mini-DALL•E 3:

Interactive Text to Image by
Prompting Large Language Models

1Beijing Institute of Technology, 2Shanghai AI Laboratory, 3Tsinghua University, 4CUHK
(Work in Progress)

Abstract

The revolution of artificial content generation have been rapidly accelerated with the booming text to image (T2I) diffusion models. It was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate, after only less than two years of development. However, it is unfortunately that most of existing T2I models, e.g., Midjourney, Stable Diffusion, still could not be effectively communicated with natural language. This typically makes a engaging image hard to obtain without expertises in complex prompting tuning with wired words compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing techniques endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where pepole can interact with LLM for interleaved high-quality image generation and question answering with stronger images and text corrspondences using natural language. In addressing the iT2I problem, we also present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common use scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text to image models without any training, while brings little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broder attentions and provide inspirations for boosting user experience in human-machine interactions alongside with the image quality of the next generation T2I models.

Interactive Text to Image


We introduce a new task - interactive text to image (iT2I), where people can interact with LLM for inter- leaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language.


Types of Interactions


There are various instructions that could be found in an iT2I system, such as generation, editing, se- lecting, and refinement.


Mini DALL•E 3


we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.


Results


Related Links

You may refer to related work that serves as foundations for our framework and code repository, Stable Diffusion XL, NExT-GPT, DALL•E 3, ChatGPT, and IP-Adapter.

BibTeX

@misc{minidalle3,
    author={Lai, Zeqiang and Zhu, Xizhou and Dai, Jifeng and Qiao, Yu and Wang, Wenhai},
    title={Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models},
    year={2023},
    url={https://github.com/Zeqiang-Lai/Mini-DALLE3},
}