Flexible and Efficient Multimodal Control for Text-to-Image Generation (2024)

Xuehai He¹ Jian Zheng² Jacob Zhiyuan Fang² Robinson Piramuthu²
Mohit Bansal³ Vicente Ordonez⁴ Gunnar A Sigurdsson² Nanyun Peng⁵ Xin Eric Wang¹
¹UC Santa Cruz, ²Amazon, ³UNC Chapel Hill, ⁴Rice University, ⁵University of California, Los Angeles
{xhe89,xwang366}@ucsc.edu

Abstract

Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps.Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities.In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation.At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning.Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.

1 Introduction

Flexible and Efficient Multimodal Control for Text-to-Image Generation (1)

In the realm of text-to-image (T2I) generation, diffusion models exhibit exceptional performance in transforming textual descriptions into visually accurate images. Such models exhibit extraordinary potential across a plethora of applications, spanning from content creation[51, 55, 43, 47, 65, 1, 9], image editing[4, 31, 12, 70, 59, 43, 23, 5, 41], and also fashion design[7]. We propose a new unified method that can tackle two problems in text-to-image generation: improve the training efficiency of T2I models concerning memory usage, computational requirements, and a thirst for extensive datasets[54, 51, 48]; and improve their controllability especially when dealing with multimodal conditioning, e.g. multiple edge maps and at the same time follow the guidance of text prompts, as shown inFigure1 (c).

Controllable text-to-image generation models[42] often come at a significant training computational cost, with linear growth in cost and size when training with different conditions. Our approach can improve the training efficiency of existing text-to-image diffusion models and unify and flexibly handle different structural input conditions all together. We take cues from the efficient parameterization strategies prevalent in the NLP domain[44, 27, 66, 26] and computer vision literature[20]. The key idea is to learn shared decomposed weights for varied input conditions, ensuring their intrinsic characteristics are conserved. Our method has several benefits: It not only achieves greater compactness[51], but also retains the full representation capacity to handle various input conditions of various modalities; Sharing weights across different conditions contributes to the data efficiency; The streamlined parameter space aids in mitigating overfitting to singular conditions, thereby reinforcing the flexible control aspect of our model.

Meanwhile, generating images from multiple hom*ogeneous conditional inputs, especially when they present conflicting conditions or need to align with specific text prompts, is challenging. To further augment our model’s capability to handle multiple inputs from either the same or diverse modalities as shown in Figure1, during training, we introduce a new training strategy with two new loss functions introduced to strengthen the guidance of corresponding conditions. This approach, combined with our compact parameter optimization space, empowers the model to learn and manage multiple controls efficiently, even within the same category (e.g., handling two distinct segmentation maps and two separate edge maps).Our primary contributions are summarized below:

•
We propose FlexEControl, a novel text-to-image generation model for efficient controllable image generation that substantially reduces training memory overhead and model parameters through decomposition of weights shared across different conditions.
•
We introduce a new training strategy to improve the flexible controllability of FlexEControl. Compared with previous works, FlexEControl can generate new images conditioning on multiple inputs from diverse compositions of multiple modalities.
•
FlexEControl shows on-par performance with Uni-ControlNet[71] on controllable text-to-image generation with 41% less trainable parameters and 30% less training memory. Furthermore, FlexEControl exhibits enhanced data efficiency, effectively doubling the performance achieved with only half amount of training data.

2 Method

Flexible and Efficient Multimodal Control for Text-to-Image Generation (2)

The overview of our method is shown in Figure2. In general, we use the copied Stable Diffusion encoder which accepts structural conditional input and then perform efficient training via parameter reduction using Kronecker Decomposition first[67] and then low-rank decomposition over the updated weights of the copied Stable Diffusion encoder. To enhance the control from language and different input conditions, we propose a new training strategy with two newly designed loss functions. The details are shown in the sequel.

2.1 Preliminary

We use Stable Diffusion 1.5[51] in our experiments. This model falls under the category of Latent Diffusion Models (LDM) that encode input images $x$ into a latent representation $z$ via an encoder $\mathcal{E}$ , such that $z=\mathcal{E}(x)$ , and subsequently carry out the denoising process within the latent space $\mathcal{Z}$ . An LDM is trained with a denoising objective as follows:

\mathcal{L}_{\text{ldm}}=\mathbb{E}_{z,c,e,t}\left[\left\lVert\hat{\epsilon}_{%\theta}(z_{t}\mid c,t)-\epsilon\right\rVert^{2}\right]

(1)

where $(z,c)$ constitute data-conditioning pairs (comprising image latents and text embeddings), $\epsilon\sim\mathcal{N}(0,I)$ , $t\sim\text{Uniform}(1,T)$ , and $\theta$ denotes the model parameters.

2.2 Efficient Training for Controllable Text-to-Image (T2I) Generation

Our approach is motivated by empirical evidence that Kronecker Decomposition[67] effectively preserves critical weight information. We employ this technique to encapsulate the shared relational structures among different input conditions. Our hypothesis posits that by amalgamating diverse conditions with a common set of weights, data utilization can be optimized and training efficiency can be improved. We focus on decomposing and fine-tuning only the cross-attention weight matrices within the U-Net[52] of the diffusion model, where recent works[33] show their dominance when customizing the diffusion model. As depicted in Figure2, the copied encoder from the Stable Diffusion will accept conditional input from different modalities. During training, we posit that these modalities, being transformations of the same underlying image, share common information. Consequently, we hypothesize that the updated weights of this copied encoder, $\Delta\boldsymbol{W}$ , can be efficiently adapted within a shared decomposed low-rank subspace. This leads to:

\Delta\boldsymbol{W}=\sum_{i=1}^{n}\boldsymbol{H}_{\boldsymbol{i}}\otimes\left%({u}_{{i}}{v}_{{i}}^{\top}\right)

(2)

with $n$ is the number of decomposed matrices, ${u}_{{i}}\in\mathbb{R}^{\frac{k}{n}\times r}$ and ${v}_{{i}}\in\mathbb{R}^{r\times\frac{d}{n}}$ , where $r$ is the rank of the matrix which is a small number, $\boldsymbol{H}_{\boldsymbol{i}}$ are the decomposed learnable matrices shared across different conditions, and $\otimes$ is the Kronecker product operation. The low-rank decomposition ensures a consistent low-rank representation strategy. This approach substantially saves trainable parameters, allowing efficient fine-tuning over the downstream text-to-image generation tasks.

The intuition for why Kronecker decomposition works for finetuning partially is partly rooted in the findings of[67, 40, 20]. These studies highlight how the model weights can be broken down into a series of matrix products and thereby save parameter space. As shown in Figure2, the original weights is 6x6, then decomposed into a series of matrix products. When adapting the training approach based on the decomposition to controllable T2I, the key lies in the shared weights, which, while being common across various conditions, retain most semantic information. For instance, the shared “slow”weights[61] of an image, combined with another set of “fast” low-rank weights, can preserve the original image’s distribution without a loss in semantic integrity, as illustrated in Figure 3. This observation implies that updating the slow weights is crucial for adapting to diverse conditions. Following this insight, it becomes logical to learn a set of condition-shared decomposed weights in each layer, ensuring that these weights remain consistent across different scenarios. The data utilization and parameter efficiency is also improved.

Flexible and Efficient Multimodal Control for Text-to-Image Generation (3)

2.3 Enhanced Training for Conditional Inputs

We then discuss how to improve the control under multiple input conditions of varying modalities with the efficient training approach.

Dataset Augmentation with Text Parsing and Segmentation

To optimize the model for scenarios involving multiple hom*ogeneous (same-type) conditional inputs, we initially augment our dataset. We utilize a large language model (gpt-3.5-turbo) to parse texts in prompts containing multiple object entities. The parsing query is structured as: Given a sentence, analyze the objects in this sentence, give me the objects if there are multiple. Following this, we apply CLIPSeg[39] (clipseg-rd64-refined version) to segment corresponding regions in the images, allowing us to divide structural conditions into separate sub-feature maps tailored to the parsed objects.

Cross-Attention Supervision

For each identified segment, we calculate a unified attention map, $\boldsymbol{A}_{i}$ , averaging attention across layers and relevant $N$ text tokens:

\boldsymbol{A}_{i}=\frac{1}{L}\sum_{l=1}^{L}\sum_{i=1}^{N}\llbracket T_{i}\in%\mathcal{T}_{j}\rrbracket\mathbf{CA}_{i}^{l},

(3)

where $\llbracket\cdot\rrbracket$ is the Iverson bracket, $\mathbf{CA}_{i}^{l}$ is the cross-attention map for token $i$ in layer $l$ , and $\mathcal{T}_{j}$ denotes the set of tokens associated with the $j$ -th segment.

The model is trained to predict noise for image-text pairs concatenated based on the parsed and segmented results. An additional loss term, designed to ensure focused reconstruction in areas relevant to each text-derived concept, is introduced. Inspired by[2], this loss is calculated as the Mean Squared Error (MSE) deviation from predefined masks corresponding to the segmented regions:

\mathcal{L}_{\text{ca}}=\mathbb{E}_{z,t}\left[\left\|\boldsymbol{A}_{i}(v_{i},%z_{t})-M_{i}\right\|_{2}^{2}\right],

(4)

where $\boldsymbol{A}_{i}(v_{i},z_{t})$ is the cross-attention map between token $v_{i}$ and noisy latent $z_{t}$ , and $M_{i}$ represents the mask for the $i$ -th segment, which is derived from the segmented regions in our augmented dataset and appropriately resized to match the dimensions of the cross-attention maps.

Masked Noise Prediction

To ensure fidelity to the specified conditions, we apply a condition-selective diffusion loss that concentrates the denoising effort on conceptually significant regions. This focused loss function is applied solely to pixels within the regions delineated by the concept masks, which are derived from the non-zero features of the input structural conditions. Specifically, we set the masks to be binary where non-zero feature areas are assigned value of ones[21], and areas lacking features are set to zero. Because of the sparsity of pose features for this condition, we use the all-ones mask. These masks serve to underscore the regions referenced in the corresponding text prompts:

\mathcal{L}_{\text{mask}}=\mathbb{E}_{z,\epsilon,t}\left[\left\|(\epsilon-%\epsilon_{\theta}(z_{t},t))\odot M\right\|_{2}^{2}\right],

(5)

where $M$ represents the union of binary mask obtained from input conditions, $z_{t}$ denotes the noisy latent at timestep $t$ , $\epsilon$ the injected noise, and $\epsilon_{\theta}$ the estimated noise from the denoising network (U-Net).

The total loss function employed is:

\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{ldm}}+\lambda_{\text{ca}}%\mathcal{L}_{\text{ca}}+\lambda_{\text{mask}}\mathcal{L}_{\text{mask}},

(6)

with $\lambda_{\text{rec}}$ and $\lambda_{\text{attn}}$ set to 0.01. The integration of $\mathcal{L}_{\text{ca}}$ and $\mathcal{L}_{\text{mask}}$ ensure the model will focus at reconstructing the conditional region and attend to guided regions during generation.

3 Experiments

3.1 Datasets

In pursuit of our objective of achieving controlled Text-to-Image (T2I) generation, we employed the LAION improved_aesthetics_6plus [57] dataset for our model training. Specifically, we meticulously curated a subset comprising 5,082,236 instances, undertaking the elimination of duplicates and applying filters based on criteria such as resolution and NSFW score. Given the targeted nature of our controlled generation tasks, the assembly of training data involved considerations of additional input conditions, specifically edge maps, sketch maps, depth maps, segmentation maps, and pose maps. The extraction of features from these maps adhered to the methodology expounded in[68].

3.2 Evaluation Metrics

We employ a comprehensive benchmark suite of metrics including mIoU[50], SSIM[60], mAP, MSE, FID[25], and CLIP Score[24, 46]¹¹1https://github.com/jmhessel/clipscore. The details are given in the Appendix.

3.3 Experimental Setup

In accordance with the configuration employed in Uni-ControlNet, we utilized Stable Diffusion 1.5²²2https://huggingface.co/runwayml/stable-diffusion-v1-5 as the foundational model. Our model underwent training for a singular epoch, employing the AdamW optimizer[32] with a learning rate set at $10^{-5}$ . Throughout all experimental iterations, we standardized the dimensions of input and conditional images to $512\times 512$ . The fine-tuning process was executed on P3 AWS EC2 instances equipped with 64 NVIDIA V100 GPUs.

For quantitative assessment, a subset comprising 10,000 high-quality images from the LAION improved_aesthetics_6.5plus dataset was utilized. The resizing of input conditions to $512\times 512$ was conducted during the inference process.

3.3.1 Structural Input Condition Extraction

We start from the processing of various local conditions used in our experiments. To facilitate a comprehensive evaluation, we have incorporated a diverse range of structural conditions, each processed using specialized techniques:

•
Edge Maps: For generating edge maps, we utilized two distinct techniques:
- –
  Canny Edge Detector[6] - A widely used method for edge detection in images.
- –
  HED Boundary Extractor[63] - Holistically-Nested Edge Detection, an advanced technique for identifying object boundaries.
- –
  MLSD[17] - A method particularly designed for detecting multi-scale line segments in images.
•
Sketch Maps: We adopted a sketch extraction technique detailed in[58] to convert images into their sketch representations.
•
Pose Information: OpenPose[8] was employed to extract human pose information from images, which provides detailed body joint and keypoint information.
•
Depth Maps: For depth estimation, we integrated Midas[49], a robust method for predicting depth information from single images.
•
Segmentation Maps: Segmentation of images was performed using the method outlined in[62], which focuses on accurately segmenting various objects within an image.

3.4 Baselines

In our comparative evaluation, we assess T2I-Adapter[42], PHM[67], Uni-ControlNet[71], and LoRA[27].

Flexible and Efficient Multimodal Control for Text-to-Image Generation (4)

Flexible and Efficient Multimodal Control for Text-to-Image Generation (5)

Flexible and Efficient Multimodal Control for Text-to-Image Generation (6)

3.5 Quantitative Results

Models	Memory Cost $\downarrow$	# Params. $\downarrow$	Training Time $\downarrow$
Uni-ControlNet[71]	20.47GB	1271M	5.69 $\pm$ 1.33s/it
LoRA[27]	17.84GB	1074M	3.97 $\pm$ 1.27 s/it
PHM[67]	15.08GB	819M	3.90 $\pm$ 2.01 s/it
FlexEControl (ours)	14.33GB	750M	2.15 $\pm$ 1.42 s/it

Models	Canny	MLSD	HED	Sketch	Depth	Segmentation	Poses	FID $\downarrow$	CLIP Score $\uparrow$
Models	(SSIM) $\uparrow$	(SSIM) $\uparrow$	(SSIM) $\uparrow$	(SSIM) $\uparrow$	(MSE) $\downarrow$	(mIoU) $\uparrow$	(mAP) $\uparrow$	FID $\downarrow$	CLIP Score $\uparrow$
T2IAdapter[42]	0.4480	-	-	0.5241	90.01	0.6983	0.3156	27.80	0.4957
Uni-Control[45]	0.4977	0.6374	0.4885	0.5509	90.04	0.7143	0.2083	27.80	0.4899
Uni-ControlNet[71]	0.4910	0.6083	0.4715	0.5901	90.17	0.7084	0.2125	27.74	0.4890
PHM[67]	0.4365	0.5712	0.4633	0.4878	91.38	0.5534	0.1664	27.91	0.4961
LoRA[27]	0.4497	0.6381	0.5043	0.5097	89.09	0.5480	0.1538	27.99	0.4832
FlexEControl (ours)	0.4990	0.6385	0.5041	0.5518	90.93	0.7496	0.2093	27.55	0.4963

	Models	Canny	MLSD	HED	Sketch	Depth	Segmentation	Poses	FID $\downarrow$	CLIP Score $\uparrow$
	Models	(SSIM) $\uparrow$	(SSIM) $\uparrow$	(SSIM) $\uparrow$	(SSIM) $\uparrow$	(MSE) $\downarrow$	(mIoU) $\uparrow$	(mAP) $\uparrow$	FID $\downarrow$	CLIP Score $\uparrow$
Single Conditioning	Uni-ControlNet	0.3268	0.4097	0.3177	0.4096	98.80	0.4075	0.1433	29.43	0.4844
	FlexEControl (w/o $L_{ca}$ )	0.3698	0.4905	0.3870	0.4855	94.90	0.4449	0.1432	28.03	0.4874
	FlexEControl (w/o $L_{mask}$ )	0.3701	0.4894	0.3805	0.4879	94.30	0.4418	0.1432	28.19	0.4570
	FlexEControl	0.3711	0.4920	0.3871	0.4869	94.83	0.4479	0.1432	28.03	0.4877
\hdashlineMultiple Conditioning	Uni-ControlNet	0.3078	0.3962	0.3054	0.3871	98.84	0.3981	0.1393	28.75	0.4828
	FlexEControl (w/o $L_{ca}$ )	0.3642	0.4901	0.3704	0.4815	94.95	0.4368	0.1405	28.50	0.4870
	FlexEControl (w/o $L_{mask}$ )	0.3666	0.4834	0.3712	0.4831	94.89	0.4400	0.1406	28.68	0.4542
	FlexEControl	0.3690	0.4915	0.3784	0.4849	92.90	0.4429	0.1411	28.24	0.4873

Table1 highlights FlexEControl’s superior efficiency compared to Uni-ControlNet. It achieves a 30% reduction in memory cost, lowers trainable parameters by 41% (from 1271M to 750M), and significantly reduces training time per iteration from 5.69s to 2.15s.

Table2 provides a comprehensive comparison of FlexEControl’s performance against Uni-ControlNet and T2IAdapter across diverse input conditions. After training on a dataset of 5M text-image pairs, FlexEControl demonstrates better, if not superior, performance metrics compared to Uni-ControlNet and T2IAdapter. Note that Uni-ControlNet is trained on a much larger dataset (10M text-image pairs from the LAION dataset). Although there is a marginal decrease in SSIM scores for sketch maps and mAP scores for poses, FlexEControl excels in other metrics, notably surpassing Uni-ControlNet and T2IAdapter. This underscores our method’s proficiency in enhancing efficiency and elevating overall quality and accuracy in controllable text-to-image generation tasks.

To substantiate the efficacy of FlexEControl in enhancing training efficiency while upholding commendable model performance, and to ensure a fair comparison, an ablation study was conducted by training models on an identical dataset. We traineFlexEControl along its variants and Uni-ControlNet on a subset of 100,000 training samples from LAION improved_aesthetics_6plus. When trained with the identical data, FlexEControl performs better than Uni-ControlNet. The outcomes are presented in Table3. Evidently, FlexEControl exhibits substantial improvements over Uni-ControlNet when trained on the same dataset. This underscores the effectiveness of our approach in optimizing data utilization, concurrently diminishing computational costs, and enhancing efficiency in the text-to-image generation process.

To validate FlexEControl’s effectiveness in handling multiple structural conditions, we compared it with Uni-ControlNet through human evaluations. Two scenarios were considered: multiple hom*ogeneous input conditions (300 images, each generated with 2 canny edge maps) and multiple heterogeneous input conditions (500 images, each generated with 2 randomly selected conditions). Results, summarized in Table4, reveal that FlexEControl was preferred by 64.00% of annotators, significantly outperforming Uni-ControlNet (23.67%). This underscores FlexEControl’s proficiency with complex, hom*ogeneous inputs. Additionally, FlexEControl demonstrated superior alignment with input conditions (67.33%) compared to Uni-ControlNet (23.00%). In scenarios with random heterogeneous conditions, FlexEControl was preferred for overall quality and alignment over Uni-ControlNet.

In addition to our primary comparisons, we conducted an additional quantitative evaluation ofFlexEControl and Uni-ControlNet. This evaluation focused on assessing image quality under scenarios involving multiple conditions from both the hom*ogeneous and heterogeneous modalities. The findings of this evaluation are summarized in Table5. FlexEControl consistently outperforms Uni-ControlNet in both categories, demonstrating lower FID scores for better image quality and higher CLIP scores for improved alignment with text prompts.

Condition Type	Metric	Win	Tie	Lose
hom*ogeneous	Human Preference (%)	64.00	12.33	23.67
hom*ogeneous	Condition Alignment (%)	67.33	9.67	23.00
Heterogeneous	Human Preference (%)	9.80	87.40	2.80
Heterogeneous	Condition Alignment (%)	6.60	89.49	4.00

Condition Type	Baseline	FID $\downarrow$	CLIP Score $\uparrow$
Heterogeneous	Uni-ControlNet	27.81	0.4869
Heterogeneous	FlexEControl	27.47	0.4981
\hdashlinehom*ogeneous	Uni-ControlNet	28.98	0.4858
\hdashlinehom*ogeneous	FlexEControl	27.65	0.4932

3.6 Qualitative Results

We present qualitative results of our FlexEControl under three different settings: single input condition, multiple heterogeneous conditions, and multiple hom*ogeneous conditions, illustrated in Figure5, Figure4, and Figure6, respectively. The results indicate that FlexEControl is comparable to baseline models when a single condition is input. However, with multiple conditions, FlexEControl consistently and noticeably outperforms other models. Particularly, under multiple hom*ogeneous conditions, FlexEControl excels in generating overall higher quality images that align more closely with the input conditions, surpassing other models.

4 Related Work

FlexEControl is an instance of efficient training and controllable text-to-image generation. Here, we overview modeling efforts in the subset of efficient training towards reducing parameters and memory cost and controllable T2I.

Efficient Training

Prior work has proposed efficient training methodologies both for pretraining and fine-tuning. These methods have established their efficacy across an array of language and vision tasks. One of these explored strategies is Prompt Tuning[35], where trainable prompt tokens are appended to pretrained models[56, 30, 29, 22]. These tokens can be added exclusively to input embeddings or to all intermediate layers[37], allowing for nuanced model control and performance optimization. Low-Rank Adaptation (LoRA)[27] is another innovative approach that introduces trainable rank decomposition matrices for the parameters of each layer. LoRA has exhibited promising fine-tuning ability on large generative models including diffusion models[19], indicating its potential for broader application. Furthermore, the use of Adapters inserts lightweight adaptation modules into each layer of a pretrained transformer[26, 53]. This method has been successfully extended across various setups[69, 16, 42], demonstrating its adaptability and practicality. Other approaches including post-training model compression[14] facilitate the transition from a fully optimized model to a compressed version – either sparse[15], quantized[36, 18], or both. This methodology was particularly helpful for parameter quantization[13]. Different from these methodologies, our work puts forth a new unified strategy that aims to enhance the efficient training of text-to-image diffusion models through the leverage of low-rank structure. Our proposed method integrates principles from these established techniques to offer a fresh perspective on training efficiency, adding to the rich tapestry of existing solutions in this rapidly evolving field.

Controllable Text-to-Image Generation

Recent developments in the text-to-image generation domain strives for more control over image generation, enabling more targeted, stable, and accurate visual outputs, several models like T2I-Adapter[42] and Composer[28] have emerged to enhance image generations following the semantic guidance of text prompts and multiple different structural conditional control. However, existing methods are struggling at dealing with multiple conditions from the same modalities, especially when they have conflicts, e.g. multiple segmentation maps and at the same time follow the guidance of text prompts; Recent studies also highlight challenges in controllable text-to-image generation (T2I), such as omission of objects in text prompts and mismatched attributes [34, 3], showing that current models are strugging at handling controls from different conditions. Towards these, the Attend-and-Excite method [10] refines attention regions to ensure distinct attention across separate image regions. ReCo [64], GLIGEN [38], and Layout-Guidance [11] allow for image generation informed by bounding boxes and regional descriptions. Our work improves the model’s controllability by proposing a new training strategy.

5 Conclusion

This work introduces a unified approach that improves both the flexibility and efficiency of diffusion-based text-to-image generation. Our experimental results demonstrate a substantial reduction in memory cost and trainable parameters without compromising inference time or performance.Future work may explore more sophisticated decomposition techniques, furthering the pursuit of an optimal balance between model efficiency, complexity, and expressive power.

References

[1]Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski.Break-a-scene: Extracting multiple concepts from a single image.In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023.
[2]Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski.Break-a-scene: Extracting multiple concepts from a single image.In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023.
[3]EslamMohamed Bakr, Pengzhan Sun, Xiaogian Shen, FaizanFarooq Khan, LiErran Li, and Mohamed Elhoseiny.HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20041–20053, October 2023.
[4]Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, etal.ediffi: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022.
[5]Tim Brooks, Aleksander Holynski, and AlexeiA Efros.Instructpix2pix: Learning to follow image editing instructions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
[6]John Canny.A computational approach to edge detection.IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
[7]Shidong Cao, Wenhao Chai, Shengyu Hao, Yanting Zhang, Hangyue Chen, and Gaoang Wang.Difffashion: Reference-based fashion design with structure-aware transfer by diffusion models.IEEE Transactions on Multimedia, 2023.
[8]Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.Realtime multi-person 2d pose estimation using part affinity fields.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
[9]Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, WilliamT Freeman, Michael Rubinstein, etal.Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023.
[10]Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or.Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
[11]Minghao Chen, Iro Laina, and Andrea Vedaldi.Training-free layout control with cross-attention guidance.arXiv preprint arXiv:2304.03373, 2023.
[12]Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.Diffedit: Diffusion-based semantic image editing with mask guidance.arXiv preprint arXiv:2210.11427, 2022.
[13]Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314, 2023.
[14]Gongfan Fang, Xinyin Ma, and Xinchao Wang.Structural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023.
[15]Elias Frantar and Dan Alistarh.Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023.
[16]Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao.Clip-adapter: Better vision-language models with feature adapters.arXiv preprint arXiv:2110.04544, 2021.
[17]Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin.Towards light-weight and real-time line segment detection.In Proceedings of the AAAI Conference on Artificial Intelligence, volume36, pages 726–734, 2022.
[18]Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo.Vector quantized diffusion model for text-to-image synthesis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
[19]Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, WilliamYang Wang, and XinEric Wang.Discriminative diffusion models as few-shot vision and language learners.arXiv preprint arXiv:2305.10722, 2023.
[20]Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, and XinEric Wang.Parameter-efficient fine-tuning for vision transformers.arXiv preprint arXiv:2203.16329, 2022.
[21]Xuehai He and XinEric Wang.Multimodal graph transformer for multimodal question answering.arXiv preprint arXiv:2305.00581, 2023.
[22]Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, WilliamYang Wang, and XinEric Wang.Cpl: Counterfactual prompt learning for vision and language models.arXiv preprint arXiv:2210.10362, 2022.
[23]Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022.
[24]Jack Hessel, Ari Holtzman, Maxwell Forbes, RonanLe Bras, and Yejin Choi.CLIPScore: a reference-free evaluation metric for image captioning.In EMNLP, 2021.
[25]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017.
[26]Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly.Parameter-Efficient Transfer Learning for NLP.arXiv:1902.00751 [cs, stat], June 2019.
[27]EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.LoRA: Low-Rank Adaptation of Large Language Models.arXiv:2106.09685 [cs], Oct. 2021.
[28]Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou.Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023.
[29]Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim.Visual prompt tuning.arXiv preprint arXiv:2203.12119, 2022.
[30]Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie.Prompting visual-language models for efficient video understanding.arXiv preprint arXiv:2112.04478, 2021.
[31]Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani.Imagic: Text-based real image editing with diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
[32]DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
[33]Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu.Multi-concept customization of text-to-image diffusion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
[34]Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and ShixiangShane Gu.Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023.
[35]Brian Lester, Rami Al-Rfou, and Noah Constant.The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021.
[36]Xiuyu Li, Long Lian, Yijiang Liu, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer.Q-diffusion: Quantizing diffusion models.arXiv preprint arXiv:2302.04304, 2023.
[37]XiangLisa Li and Percy Liang.Prefix-Tuning: Optimizing Continuous Prompts for Generation.arXiv:2101.00190 [cs], Jan. 2021.
[38]Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and YongJae Lee.Gligen: Open-set grounded text-to-image generation.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[39]Timo Lüddecke and Alexander Ecker.Image segmentation using text and image prompts.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, June 2022.
[40]RabeehKarimi Mahabadi, James Henderson, and Sebastian Ruder.Compacter: Efficient Low-Rank Hypercomplex Adapter Layers.arXiv:2106.04647 [cs], June 2021.
[41]Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.Null-text inversion for editing real images using guided diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
[42]Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie.T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.arXiv preprint arXiv:2302.08453, 2023.
[43]AlexanderQuinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021.
[44]Hieu Pham, MelodyY. Guan, Barret Zoph, QuocV. Le, and Jeff Dean.Efficient neural architecture search via parameter sharing.In ICML, 2018.
[45]Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, JuanCarlos Niebles, Caiming Xiong, Silvio Savarese, etal.Unicontrol: A unified diffusion model for controllable visual generation in the wild.arXiv preprint arXiv:2305.11147, 2023.
[46]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021.
[47]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In ICML, 2021.
[48]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
[49]René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun.Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
[50]Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese.Generalized intersection over union: A metric and a loss for bounding box regression.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
[51]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
[52]Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical image segmentation.In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[53]Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna Gurevych.AdapterDrop: On the Efficiency of Adapters in Transformers.arXiv:2010.11918 [cs], Oct. 2021.
[54]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, SeyedKamyar SeyedGhasemipour, BurcuKaragol Ayan, S.Sara Mahdavi, RaphaGontijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi.Photorealistic text-to-image diffusion models with deep language understanding.arXiv:2205.11487, 2022.
[55]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, EmilyL Denton, Kamyar Ghasemipour, Raphael GontijoLopes, Burcu KaragolAyan, Tim Salimans, etal.Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
[56]Timo Schick and Hinrich Schütze.Exploiting cloze questions for few shot text classification and natural language inference.arXiv preprint arXiv:2001.07676, 2020.
[57]Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.arXiv preprint arXiv:2210.08402, 2022.
[58]Edgar Simo-Serra, Satoshi Iizuka, Kazuma Sasaki, and Hiroshi Ishikawa.Learning to simplify: fully convolutional networks for rough sketch cleanup.ACM Transactions on Graphics (TOG), 35(4):1–11, 2016.
[59]Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan.Unitune: Text-driven image editing by fine tuning an image generation model on a single image.arXiv preprint arXiv:2210.09477, 2022.
[60]Zhou Wang, AlanC Bovik, HamidR Sheikh, and EeroP Simoncelli.Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004.
[61]Yeming Wen, Dustin Tran, and Jimmy Ba.Batchensemble: an alternative approach to efficient ensemble and lifelong learning.arXiv preprint arXiv:2002.06715, 2020.
[62]Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun.Unified perceptual parsing for scene understanding.In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
[63]Saining Xie and Zhuowen Tu.Holistically-nested edge detection.In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
[64]Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, etal.Reco: Region-controlled text-to-image generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14246–14255, 2023.
[65]Jiahui Yu, Yuanzhong Xu, JingYu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, BurcuKaragol Ayan, etal.Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
[66]EladBen Zaken, Shauli Ravfogel, and Yoav Goldberg.BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models.arXiv:2106.10199 [cs], June 2021.
[67]Aston Zhang, Yi Tay, Shuai Zhang, Alvin Chan, AnhTuan Luu, SiuCheung Hui, and Jie Fu.Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with $1/n$ parameters.arXiv preprint arXiv:2102.08597, 2021.
[68]Lvmin Zhang and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.arXiv preprint arXiv:2302.05543, 2023.
[69]Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li.Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling.arXiv:2111.03930 [cs], Nov. 2021.
[70]Zhongping Zhang, Jian Zheng, JacobZhiyuan Fang, and BryanA Plummer.Text-to-image editing by image information removal.arXiv preprint arXiv:2305.17489, 2023.
[71]Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-YeeK Wong.Uni-controlnet: All-in-one control to text-to-image diffusion models.arXiv preprint arXiv:2305.16322, 2023.