Xuehai He1 Jian Zheng2 Jacob Zhiyuan Fang2 Robinson Piramuthu2
Mohit Bansal3 Vicente Ordonez4 Gunnar A Sigurdsson2 Nanyun Peng5 Xin Eric Wang1
1UC Santa Cruz, 2Amazon, 3UNC Chapel Hill, 4Rice University, 5University of California, Los Angeles
{xhe89,xwang366}@ucsc.edu
Abstract
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps.Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities.In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation.At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning.Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.
1 Introduction
In the realm of text-to-image (T2I) generation, diffusion models exhibit exceptional performance in transforming textual descriptions into visually accurate images. Such models exhibit extraordinary potential across a plethora of applications, spanning from content creation[51, 55, 43, 47, 65, 1, 9], image editing[4, 31, 12, 70, 59, 43, 23, 5, 41], and also fashion design[7]. We propose a new unified method that can tackle two problems in text-to-image generation: improve the training efficiency of T2I models concerning memory usage, computational requirements, and a thirst for extensive datasets[54, 51, 48]; and improve their controllability especially when dealing with multimodal conditioning, e.g. multiple edge maps and at the same time follow the guidance of text prompts, as shown inFigure1 (c).
Controllable text-to-image generation models[42] often come at a significant training computational cost, with linear growth in cost and size when training with different conditions. Our approach can improve the training efficiency of existing text-to-image diffusion models and unify and flexibly handle different structural input conditions all together. We take cues from the efficient parameterization strategies prevalent in the NLP domain[44, 27, 66, 26] and computer vision literature[20]. The key idea is to learn shared decomposed weights for varied input conditions, ensuring their intrinsic characteristics are conserved. Our method has several benefits: It not only achieves greater compactness[51], but also retains the full representation capacity to handle various input conditions of various modalities; Sharing weights across different conditions contributes to the data efficiency; The streamlined parameter space aids in mitigating overfitting to singular conditions, thereby reinforcing the flexible control aspect of our model.
Meanwhile, generating images from multiple hom*ogeneous conditional inputs, especially when they present conflicting conditions or need to align with specific text prompts, is challenging. To further augment our model’s capability to handle multiple inputs from either the same or diverse modalities as shown in Figure1, during training, we introduce a new training strategy with two new loss functions introduced to strengthen the guidance of corresponding conditions. This approach, combined with our compact parameter optimization space, empowers the model to learn and manage multiple controls efficiently, even within the same category (e.g., handling two distinct segmentation maps and two separate edge maps).Our primary contributions are summarized below:
- •
We propose FlexEControl, a novel text-to-image generation model for efficient controllable image generation that substantially reduces training memory overhead and model parameters through decomposition of weights shared across different conditions.
- •
We introduce a new training strategy to improve the flexible controllability of FlexEControl. Compared with previous works, FlexEControl can generate new images conditioning on multiple inputs from diverse compositions of multiple modalities.
- •
FlexEControl shows on-par performance with Uni-ControlNet[71] on controllable text-to-image generation with 41% less trainable parameters and 30% less training memory. Furthermore, FlexEControl exhibits enhanced data efficiency, effectively doubling the performance achieved with only half amount of training data.
2 Method
The overview of our method is shown in Figure2. In general, we use the copied Stable Diffusion encoder which accepts structural conditional input and then perform efficient training via parameter reduction using Kronecker Decomposition first[67] and then low-rank decomposition over the updated weights of the copied Stable Diffusion encoder. To enhance the control from language and different input conditions, we propose a new training strategy with two newly designed loss functions. The details are shown in the sequel.
2.1 Preliminary
We use Stable Diffusion 1.5[51] in our experiments. This model falls under the category of Latent Diffusion Models (LDM) that encode input images into a latent representation via an encoder , such that , and subsequently carry out the denoising process within the latent space . An LDM is trained with a denoising objective as follows:
(1) |
where constitute data-conditioning pairs (comprising image latents and text embeddings), , , and denotes the model parameters.
2.2 Efficient Training for Controllable Text-to-Image (T2I) Generation
Our approach is motivated by empirical evidence that Kronecker Decomposition[67] effectively preserves critical weight information. We employ this technique to encapsulate the shared relational structures among different input conditions. Our hypothesis posits that by amalgamating diverse conditions with a common set of weights, data utilization can be optimized and training efficiency can be improved. We focus on decomposing and fine-tuning only the cross-attention weight matrices within the U-Net[52] of the diffusion model, where recent works[33] show their dominance when customizing the diffusion model. As depicted in Figure2, the copied encoder from the Stable Diffusion will accept conditional input from different modalities. During training, we posit that these modalities, being transformations of the same underlying image, share common information. Consequently, we hypothesize that the updated weights of this copied encoder, , can be efficiently adapted within a shared decomposed low-rank subspace. This leads to:
(2) |
with is the number of decomposed matrices, and , where is the rank of the matrix which is a small number, are the decomposed learnable matrices shared across different conditions, and is the Kronecker product operation. The low-rank decomposition ensures a consistent low-rank representation strategy. This approach substantially saves trainable parameters, allowing efficient fine-tuning over the downstream text-to-image generation tasks.
The intuition for why Kronecker decomposition works for finetuning partially is partly rooted in the findings of[67, 40, 20]. These studies highlight how the model weights can be broken down into a series of matrix products and thereby save parameter space. As shown in Figure2, the original weights is 6x6, then decomposed into a series of matrix products. When adapting the training approach based on the decomposition to controllable T2I, the key lies in the shared weights, which, while being common across various conditions, retain most semantic information. For instance, the shared “slow”weights[61] of an image, combined with another set of “fast” low-rank weights, can preserve the original image’s distribution without a loss in semantic integrity, as illustrated in Figure 3. This observation implies that updating the slow weights is crucial for adapting to diverse conditions. Following this insight, it becomes logical to learn a set of condition-shared decomposed weights in each layer, ensuring that these weights remain consistent across different scenarios. The data utilization and parameter efficiency is also improved.
![Flexible and Efficient Multimodal Control for Text-to-Image Generation (3) Flexible and Efficient Multimodal Control for Text-to-Image Generation (3)](https://i0.wp.com/arxiv.org/html/2405.04834v2/x3.png)
2.3 Enhanced Training for Conditional Inputs
We then discuss how to improve the control under multiple input conditions of varying modalities with the efficient training approach.
Dataset Augmentation with Text Parsing and Segmentation
To optimize the model for scenarios involving multiple hom*ogeneous (same-type) conditional inputs, we initially augment our dataset. We utilize a large language model (gpt-3.5-turbo) to parse texts in prompts containing multiple object entities. The parsing query is structured as: Given a sentence, analyze the objects in this sentence, give me the objects if there are multiple. Following this, we apply CLIPSeg[39] (clipseg-rd64-refined version) to segment corresponding regions in the images, allowing us to divide structural conditions into separate sub-feature maps tailored to the parsed objects.
Cross-Attention Supervision
For each identified segment, we calculate a unified attention map, , averaging attention across layers and relevant text tokens:
(3) |
where is the Iverson bracket, is the cross-attention map for token in layer , and denotes the set of tokens associated with the -th segment.
The model is trained to predict noise for image-text pairs concatenated based on the parsed and segmented results. An additional loss term, designed to ensure focused reconstruction in areas relevant to each text-derived concept, is introduced. Inspired by[2], this loss is calculated as the Mean Squared Error (MSE) deviation from predefined masks corresponding to the segmented regions:
(4) |
where is the cross-attention map between token and noisy latent , and represents the mask for the -th segment, which is derived from the segmented regions in our augmented dataset and appropriately resized to match the dimensions of the cross-attention maps.
Masked Noise Prediction
To ensure fidelity to the specified conditions, we apply a condition-selective diffusion loss that concentrates the denoising effort on conceptually significant regions. This focused loss function is applied solely to pixels within the regions delineated by the concept masks, which are derived from the non-zero features of the input structural conditions. Specifically, we set the masks to be binary where non-zero feature areas are assigned value of ones[21], and areas lacking features are set to zero. Because of the sparsity of pose features for this condition, we use the all-ones mask. These masks serve to underscore the regions referenced in the corresponding text prompts:
(5) |
where represents the union of binary mask obtained from input conditions, denotes the noisy latent at timestep , the injected noise, and the estimated noise from the denoising network (U-Net).
The total loss function employed is:
(6) |
with and set to 0.01. The integration of and ensure the model will focus at reconstructing the conditional region and attend to guided regions during generation.
3 Experiments
3.1 Datasets
In pursuit of our objective of achieving controlled Text-to-Image (T2I) generation, we employed the LAION improved_aesthetics_6plus [57] dataset for our model training. Specifically, we meticulously curated a subset comprising 5,082,236 instances, undertaking the elimination of duplicates and applying filters based on criteria such as resolution and NSFW score. Given the targeted nature of our controlled generation tasks, the assembly of training data involved considerations of additional input conditions, specifically edge maps, sketch maps, depth maps, segmentation maps, and pose maps. The extraction of features from these maps adhered to the methodology expounded in[68].
3.2 Evaluation Metrics
We employ a comprehensive benchmark suite of metrics including mIoU[50], SSIM[60], mAP, MSE, FID[25], and CLIP Score[24, 46]111https://github.com/jmhessel/clipscore. The details are given in the Appendix.
3.3 Experimental Setup
In accordance with the configuration employed in Uni-ControlNet, we utilized Stable Diffusion 1.5222https://huggingface.co/runwayml/stable-diffusion-v1-5 as the foundational model. Our model underwent training for a singular epoch, employing the AdamW optimizer[32] with a learning rate set at . Throughout all experimental iterations, we standardized the dimensions of input and conditional images to . The fine-tuning process was executed on P3 AWS EC2 instances equipped with 64 NVIDIA V100 GPUs.
For quantitative assessment, a subset comprising 10,000 high-quality images from the LAION improved_aesthetics_6.5plus dataset was utilized. The resizing of input conditions to was conducted during the inference process.
3.3.1 Structural Input Condition Extraction
We start from the processing of various local conditions used in our experiments. To facilitate a comprehensive evaluation, we have incorporated a diverse range of structural conditions, each processed using specialized techniques:
- •
Edge Maps: For generating edge maps, we utilized two distinct techniques:
- –
Canny Edge Detector[6] - A widely used method for edge detection in images.
- –
HED Boundary Extractor[63] - Holistically-Nested Edge Detection, an advanced technique for identifying object boundaries.
- –
MLSD[17] - A method particularly designed for detecting multi-scale line segments in images.
- –
- •
Sketch Maps: We adopted a sketch extraction technique detailed in[58] to convert images into their sketch representations.
- •
Pose Information: OpenPose[8] was employed to extract human pose information from images, which provides detailed body joint and keypoint information.
- •
Depth Maps: For depth estimation, we integrated Midas[49], a robust method for predicting depth information from single images.
- •
Segmentation Maps: Segmentation of images was performed using the method outlined in[62], which focuses on accurately segmenting various objects within an image.
3.4 Baselines
In our comparative evaluation, we assess T2I-Adapter[42], PHM[67], Uni-ControlNet[71], and LoRA[27].
![Flexible and Efficient Multimodal Control for Text-to-Image Generation (5) Flexible and Efficient Multimodal Control for Text-to-Image Generation (5)](https://i0.wp.com/arxiv.org/html/2405.04834v2/extracted/2405.04834v2/single_condition.png)
3.5 Quantitative Results
Models Memory Cost # Params. Training Time Uni-ControlNet[71] 20.47GB 1271M 5.69 1.33s/it LoRA[27] 17.84GB 1074M 3.97 1.27 s/it PHM[67] 15.08GB 819M 3.90 2.01 s/it FlexEControl (ours) 14.33GB 750M 2.15 1.42 s/it
Models Canny MLSD HED Sketch Depth Segmentation Poses FID CLIP Score (SSIM) (SSIM) (SSIM) (SSIM) (MSE) (mIoU) (mAP) T2IAdapter[42] 0.4480 - - 0.5241 90.01 0.6983 0.3156 27.80 0.4957 Uni-Control[45] 0.4977 0.6374 0.4885 0.5509 90.04 0.7143 0.2083 27.80 0.4899 Uni-ControlNet[71] 0.4910 0.6083 0.4715 0.5901 90.17 0.7084 0.2125 27.74 0.4890 PHM[67] 0.4365 0.5712 0.4633 0.4878 91.38 0.5534 0.1664 27.91 0.4961 LoRA[27] 0.4497 0.6381 0.5043 0.5097 89.09 0.5480 0.1538 27.99 0.4832 FlexEControl (ours) 0.4990 0.6385 0.5041 0.5518 90.93 0.7496 0.2093 27.55 0.4963
Models Canny MLSD HED Sketch Depth Segmentation Poses FID CLIP Score (SSIM) (SSIM) (SSIM) (SSIM) (MSE) (mIoU) (mAP) Single Conditioning Uni-ControlNet 0.3268 0.4097 0.3177 0.4096 98.80 0.4075 0.1433 29.43 0.4844 FlexEControl (w/o ) 0.3698 0.4905 0.3870 0.4855 94.90 0.4449 0.1432 28.03 0.4874 FlexEControl (w/o ) 0.3701 0.4894 0.3805 0.4879 94.30 0.4418 0.1432 28.19 0.4570 FlexEControl 0.3711 0.4920 0.3871 0.4869 94.83 0.4479 0.1432 28.03 0.4877 \hdashlineMultiple Conditioning Uni-ControlNet 0.3078 0.3962 0.3054 0.3871 98.84 0.3981 0.1393 28.75 0.4828 FlexEControl (w/o ) 0.3642 0.4901 0.3704 0.4815 94.95 0.4368 0.1405 28.50 0.4870 FlexEControl (w/o ) 0.3666 0.4834 0.3712 0.4831 94.89 0.4400 0.1406 28.68 0.4542 FlexEControl 0.3690 0.4915 0.3784 0.4849 92.90 0.4429 0.1411 28.24 0.4873
Table1 highlights FlexEControl’s superior efficiency compared to Uni-ControlNet. It achieves a 30% reduction in memory cost, lowers trainable parameters by 41% (from 1271M to 750M), and significantly reduces training time per iteration from 5.69s to 2.15s.
Table2 provides a comprehensive comparison of FlexEControl’s performance against Uni-ControlNet and T2IAdapter across diverse input conditions. After training on a dataset of 5M text-image pairs, FlexEControl demonstrates better, if not superior, performance metrics compared to Uni-ControlNet and T2IAdapter. Note that Uni-ControlNet is trained on a much larger dataset (10M text-image pairs from the LAION dataset). Although there is a marginal decrease in SSIM scores for sketch maps and mAP scores for poses, FlexEControl excels in other metrics, notably surpassing Uni-ControlNet and T2IAdapter. This underscores our method’s proficiency in enhancing efficiency and elevating overall quality and accuracy in controllable text-to-image generation tasks.
To substantiate the efficacy of FlexEControl in enhancing training efficiency while upholding commendable model performance, and to ensure a fair comparison, an ablation study was conducted by training models on an identical dataset. We traineFlexEControl along its variants and Uni-ControlNet on a subset of 100,000 training samples from LAION improved_aesthetics_6plus. When trained with the identical data, FlexEControl performs better than Uni-ControlNet. The outcomes are presented in Table3. Evidently, FlexEControl exhibits substantial improvements over Uni-ControlNet when trained on the same dataset. This underscores the effectiveness of our approach in optimizing data utilization, concurrently diminishing computational costs, and enhancing efficiency in the text-to-image generation process.
To validate FlexEControl’s effectiveness in handling multiple structural conditions, we compared it with Uni-ControlNet through human evaluations. Two scenarios were considered: multiple hom*ogeneous input conditions (300 images, each generated with 2 canny edge maps) and multiple heterogeneous input conditions (500 images, each generated with 2 randomly selected conditions). Results, summarized in Table4, reveal that FlexEControl was preferred by 64.00% of annotators, significantly outperforming Uni-ControlNet (23.67%). This underscores FlexEControl’s proficiency with complex, hom*ogeneous inputs. Additionally, FlexEControl demonstrated superior alignment with input conditions (67.33%) compared to Uni-ControlNet (23.00%). In scenarios with random heterogeneous conditions, FlexEControl was preferred for overall quality and alignment over Uni-ControlNet.
In addition to our primary comparisons, we conducted an additional quantitative evaluation ofFlexEControl and Uni-ControlNet. This evaluation focused on assessing image quality under scenarios involving multiple conditions from both the hom*ogeneous and heterogeneous modalities. The findings of this evaluation are summarized in Table5. FlexEControl consistently outperforms Uni-ControlNet in both categories, demonstrating lower FID scores for better image quality and higher CLIP scores for improved alignment with text prompts.
Condition Type Metric Win Tie Lose hom*ogeneous Human Preference (%) 64.00 12.33 23.67 Condition Alignment (%) 67.33 9.67 23.00 Heterogeneous Human Preference (%) 9.80 87.40 2.80 Condition Alignment (%) 6.60 89.49 4.00
Condition Type Baseline FID CLIP Score Heterogeneous Uni-ControlNet 27.81 0.4869 FlexEControl 27.47 0.4981 \hdashlinehom*ogeneous Uni-ControlNet 28.98 0.4858 FlexEControl 27.65 0.4932
3.6 Qualitative Results
We present qualitative results of our FlexEControl under three different settings: single input condition, multiple heterogeneous conditions, and multiple hom*ogeneous conditions, illustrated in Figure5, Figure4, and Figure6, respectively. The results indicate that FlexEControl is comparable to baseline models when a single condition is input. However, with multiple conditions, FlexEControl consistently and noticeably outperforms other models. Particularly, under multiple hom*ogeneous conditions, FlexEControl excels in generating overall higher quality images that align more closely with the input conditions, surpassing other models.
4 Related Work
FlexEControl is an instance of efficient training and controllable text-to-image generation. Here, we overview modeling efforts in the subset of efficient training towards reducing parameters and memory cost and controllable T2I.
Efficient Training
Prior work has proposed efficient training methodologies both for pretraining and fine-tuning. These methods have established their efficacy across an array of language and vision tasks. One of these explored strategies is Prompt Tuning[35], where trainable prompt tokens are appended to pretrained models[56, 30, 29, 22]. These tokens can be added exclusively to input embeddings or to all intermediate layers[37], allowing for nuanced model control and performance optimization. Low-Rank Adaptation (LoRA)[27] is another innovative approach that introduces trainable rank decomposition matrices for the parameters of each layer. LoRA has exhibited promising fine-tuning ability on large generative models including diffusion models[19], indicating its potential for broader application. Furthermore, the use of Adapters inserts lightweight adaptation modules into each layer of a pretrained transformer[26, 53]. This method has been successfully extended across various setups[69, 16, 42], demonstrating its adaptability and practicality. Other approaches including post-training model compression[14] facilitate the transition from a fully optimized model to a compressed version – either sparse[15], quantized[36, 18], or both. This methodology was particularly helpful for parameter quantization[13]. Different from these methodologies, our work puts forth a new unified strategy that aims to enhance the efficient training of text-to-image diffusion models through the leverage of low-rank structure. Our proposed method integrates principles from these established techniques to offer a fresh perspective on training efficiency, adding to the rich tapestry of existing solutions in this rapidly evolving field.
Controllable Text-to-Image Generation
Recent developments in the text-to-image generation domain strives for more control over image generation, enabling more targeted, stable, and accurate visual outputs, several models like T2I-Adapter[42] and Composer[28] have emerged to enhance image generations following the semantic guidance of text prompts and multiple different structural conditional control. However, existing methods are struggling at dealing with multiple conditions from the same modalities, especially when they have conflicts, e.g. multiple segmentation maps and at the same time follow the guidance of text prompts; Recent studies also highlight challenges in controllable text-to-image generation (T2I), such as omission of objects in text prompts and mismatched attributes [34, 3], showing that current models are strugging at handling controls from different conditions. Towards these, the Attend-and-Excite method [10] refines attention regions to ensure distinct attention across separate image regions. ReCo [64], GLIGEN [38], and Layout-Guidance [11] allow for image generation informed by bounding boxes and regional descriptions. Our work improves the model’s controllability by proposing a new training strategy.
5 Conclusion
This work introduces a unified approach that improves both the flexibility and efficiency of diffusion-based text-to-image generation. Our experimental results demonstrate a substantial reduction in memory cost and trainable parameters without compromising inference time or performance.Future work may explore more sophisticated decomposition techniques, furthering the pursuit of an optimal balance between model efficiency, complexity, and expressive power.
References
- [1]Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski.Break-a-scene: Extracting multiple concepts from a single image.In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023.
- [2]Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski.Break-a-scene: Extracting multiple concepts from a single image.In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023.
- [3]EslamMohamed Bakr, Pengzhan Sun, Xiaogian Shen, FaizanFarooq Khan, LiErran Li, and Mohamed Elhoseiny.HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20041–20053, October 2023.
- [4]Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, etal.ediffi: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022.
- [5]Tim Brooks, Aleksander Holynski, and AlexeiA Efros.Instructpix2pix: Learning to follow image editing instructions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- [6]John Canny.A computational approach to edge detection.IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
- [7]Shidong Cao, Wenhao Chai, Shengyu Hao, Yanting Zhang, Hangyue Chen, and Gaoang Wang.Difffashion: Reference-based fashion design with structure-aware transfer by diffusion models.IEEE Transactions on Multimedia, 2023.
- [8]Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.Realtime multi-person 2d pose estimation using part affinity fields.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
- [9]Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, WilliamT Freeman, Michael Rubinstein, etal.Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023.
- [10]Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or.Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
- [11]Minghao Chen, Iro Laina, and Andrea Vedaldi.Training-free layout control with cross-attention guidance.arXiv preprint arXiv:2304.03373, 2023.
- [12]Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.Diffedit: Diffusion-based semantic image editing with mask guidance.arXiv preprint arXiv:2210.11427, 2022.
- [13]Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314, 2023.
- [14]Gongfan Fang, Xinyin Ma, and Xinchao Wang.Structural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023.
- [15]Elias Frantar and Dan Alistarh.Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023.
- [16]Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao.Clip-adapter: Better vision-language models with feature adapters.arXiv preprint arXiv:2110.04544, 2021.
- [17]Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin.Towards light-weight and real-time line segment detection.In Proceedings of the AAAI Conference on Artificial Intelligence, volume36, pages 726–734, 2022.
- [18]Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo.Vector quantized diffusion model for text-to-image synthesis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
- [19]Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, WilliamYang Wang, and XinEric Wang.Discriminative diffusion models as few-shot vision and language learners.arXiv preprint arXiv:2305.10722, 2023.
- [20]Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, and XinEric Wang.Parameter-efficient fine-tuning for vision transformers.arXiv preprint arXiv:2203.16329, 2022.
- [21]Xuehai He and XinEric Wang.Multimodal graph transformer for multimodal question answering.arXiv preprint arXiv:2305.00581, 2023.
- [22]Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, WilliamYang Wang, and XinEric Wang.Cpl: Counterfactual prompt learning for vision and language models.arXiv preprint arXiv:2210.10362, 2022.
- [23]Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022.
- [24]Jack Hessel, Ari Holtzman, Maxwell Forbes, RonanLe Bras, and Yejin Choi.CLIPScore: a reference-free evaluation metric for image captioning.In EMNLP, 2021.
- [25]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017.
- [26]Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly.Parameter-Efficient Transfer Learning for NLP.arXiv:1902.00751 [cs, stat], June 2019.
- [27]EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.LoRA: Low-Rank Adaptation of Large Language Models.arXiv:2106.09685 [cs], Oct. 2021.
- [28]Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou.Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023.
- [29]Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim.Visual prompt tuning.arXiv preprint arXiv:2203.12119, 2022.
- [30]Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie.Prompting visual-language models for efficient video understanding.arXiv preprint arXiv:2112.04478, 2021.
- [31]Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani.Imagic: Text-based real image editing with diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- [32]DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
- [33]Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu.Multi-concept customization of text-to-image diffusion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
- [34]Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and ShixiangShane Gu.Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023.
- [35]Brian Lester, Rami Al-Rfou, and Noah Constant.The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021.
- [36]Xiuyu Li, Long Lian, Yijiang Liu, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer.Q-diffusion: Quantizing diffusion models.arXiv preprint arXiv:2302.04304, 2023.
- [37]XiangLisa Li and Percy Liang.Prefix-Tuning: Optimizing Continuous Prompts for Generation.arXiv:2101.00190 [cs], Jan. 2021.
- [38]Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and YongJae Lee.Gligen: Open-set grounded text-to-image generation.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- [39]Timo Lüddecke and Alexander Ecker.Image segmentation using text and image prompts.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, June 2022.
- [40]RabeehKarimi Mahabadi, James Henderson, and Sebastian Ruder.Compacter: Efficient Low-Rank Hypercomplex Adapter Layers.arXiv:2106.04647 [cs], June 2021.
- [41]Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.Null-text inversion for editing real images using guided diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- [42]Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie.T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.arXiv preprint arXiv:2302.08453, 2023.
- [43]AlexanderQuinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021.
- [44]Hieu Pham, MelodyY. Guan, Barret Zoph, QuocV. Le, and Jeff Dean.Efficient neural architecture search via parameter sharing.In ICML, 2018.
- [45]Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, JuanCarlos Niebles, Caiming Xiong, Silvio Savarese, etal.Unicontrol: A unified diffusion model for controllable visual generation in the wild.arXiv preprint arXiv:2305.11147, 2023.
- [46]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021.
- [47]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In ICML, 2021.
- [48]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- [49]René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun.Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
- [50]Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese.Generalized intersection over union: A metric and a loss for bounding box regression.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
- [51]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- [52]Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical image segmentation.In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- [53]Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna Gurevych.AdapterDrop: On the Efficiency of Adapters in Transformers.arXiv:2010.11918 [cs], Oct. 2021.
- [54]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, SeyedKamyar SeyedGhasemipour, BurcuKaragol Ayan, S.Sara Mahdavi, RaphaGontijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi.Photorealistic text-to-image diffusion models with deep language understanding.arXiv:2205.11487, 2022.
- [55]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, EmilyL Denton, Kamyar Ghasemipour, Raphael GontijoLopes, Burcu KaragolAyan, Tim Salimans, etal.Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- [56]Timo Schick and Hinrich Schütze.Exploiting cloze questions for few shot text classification and natural language inference.arXiv preprint arXiv:2001.07676, 2020.
- [57]Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.arXiv preprint arXiv:2210.08402, 2022.
- [58]Edgar Simo-Serra, Satoshi Iizuka, Kazuma Sasaki, and Hiroshi Ishikawa.Learning to simplify: fully convolutional networks for rough sketch cleanup.ACM Transactions on Graphics (TOG), 35(4):1–11, 2016.
- [59]Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan.Unitune: Text-driven image editing by fine tuning an image generation model on a single image.arXiv preprint arXiv:2210.09477, 2022.
- [60]Zhou Wang, AlanC Bovik, HamidR Sheikh, and EeroP Simoncelli.Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004.
- [61]Yeming Wen, Dustin Tran, and Jimmy Ba.Batchensemble: an alternative approach to efficient ensemble and lifelong learning.arXiv preprint arXiv:2002.06715, 2020.
- [62]Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun.Unified perceptual parsing for scene understanding.In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
- [63]Saining Xie and Zhuowen Tu.Holistically-nested edge detection.In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
- [64]Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, etal.Reco: Region-controlled text-to-image generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14246–14255, 2023.
- [65]Jiahui Yu, Yuanzhong Xu, JingYu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, BurcuKaragol Ayan, etal.Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
- [66]EladBen Zaken, Shauli Ravfogel, and Yoav Goldberg.BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models.arXiv:2106.10199 [cs], June 2021.
- [67]Aston Zhang, Yi Tay, Shuai Zhang, Alvin Chan, AnhTuan Luu, SiuCheung Hui, and Jie Fu.Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with parameters.arXiv preprint arXiv:2102.08597, 2021.
- [68]Lvmin Zhang and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.arXiv preprint arXiv:2302.05543, 2023.
- [69]Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li.Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling.arXiv:2111.03930 [cs], Nov. 2021.
- [70]Zhongping Zhang, Jian Zheng, JacobZhiyuan Fang, and BryanA Plummer.Text-to-image editing by image information removal.arXiv preprint arXiv:2305.17489, 2023.
- [71]Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-YeeK Wong.Uni-controlnet: All-in-one control to text-to-image diffusion models.arXiv preprint arXiv:2305.16322, 2023.