当前位置：首页 > article >正文

arXiv学术速递笔记11.28

article 2025/3/10 18:30:18

文章目录

一、自动驾驶/目标检测
- CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration Network
- OpenNet: Incremental Learning for Autonomous Driving Object Detection with Balanced Loss
二、对抗攻击/安全防御
- Trainwreck: A damaging adversarial attack on image classifiers
- Instruct2Attack: Language-Guided Semantic Adversarial Attacks
- Efficient Rehearsal Free Zero Forgetting Continual Learning using Adaptive Weight Modulation
- Effective Backdoor Mitigation Depends on the Pre-training Objective
- Adversarial Purification of Information Masking
- Confidence Is All You Need for MI Attacks
三、AIGC/图像、视频生成
- FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax
- Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
- PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images

一、自动驾驶/目标检测

CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration Network

标题： CalbFormer：一种基于Transformer的激光雷达相机自动校准网络
链接： https://arxiv.org/abs/2311.15241
作者： Yuxuan Xiao,Yao Li,Chengzhen Meng,Xingchen Li,Yanyong Zhang
摘要： LiDAR和摄像头的融合越来越多地被用于自动驾驶的感知任务。这种基于融合的算法的性能在很大程度上取决于传感器校准的准确性，这是具有挑战性的，因为很难识别不同数据模式的共同特征。以前，许多校准方法涉及特定的目标和/或人工干预，这已被证明是繁琐和昂贵的。基于学习的在线校准方法已经被提出，但它们的性能在大多数情况下是勉强令人满意的。这些方法通常存在稀疏特征图、不可靠的跨模态关联、不准确的校准参数回归等问题。
为了解决这些问题，我们提出了CalibFormer，一种用于自动LiDAR相机校准的端到端网络。我们聚合多层相机和LiDAR图像特征以实现高分辨率表示。一个多头相关模块被用来更准确地识别特征之间的相关性。最后，我们采用Transformer架构，从相关信息中估计准确的校准参数。我们的方法在KITTI数据集上实现了 $0.8751 c m$ 的平均平移误差和 $0.0562^{\circ}$ 的平均旋转误差，超过了现有的最先进的方法，并表现出强大的鲁棒性，准确性和泛化能力。
Abstract： The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of $\mathrm{cm}$ and a mean rotation error of $^{\circ}$ on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.

OpenNet: Incremental Learning for Autonomous Driving Object Detection with Balanced Loss

标题： OpenNet：用于平衡损失的自动驾驶目标检测的增量学习
链接： https://arxiv.org/abs/2311.14939
作者： Zezhou Wang,Guitao Cao,Xidong Xi,Jiangtao Wang
摘要： 由于环境的不确定性，自动驾驶目标检测一直是计算机视觉中具有挑战性的任务。这些不确定性包括物体大小的显著差异和遇到看不见的类。当传统的目标检测模型直接应用于自动驾驶检测时，可能会导致性能不佳。因为它们通常假定固定类别的常见交通参与者，如行人和汽车。更糟糕的是，常见类和新类之间的巨大类不平衡进一步加剧了性能下降。
为了解决上述问题，我们提出了OpenNet来缓和类的不平衡与平衡损失，这是基于交叉熵损失。此外，我们采用了一个基于梯度整形的归纳层，以快速学习新的类与有限的样本在增量学习。为了防止灾难性遗忘，我们采用归一化特征提取。另外，我们分别通过FPN和基于能量的检测来提高多尺度检测的鲁棒性和未知类识别率。在CODA数据集上的实验结果表明，该方法比现有方法具有更好的性能。
Abstract： Automated driving object detection has always been a challenging task in computer vision due to environmental uncertainties. These uncertainties include significant differences in object sizes and encountering the class unseen. It may result in poor performance when traditional object detection models are directly applied to automated driving detection. Because they usually presume fixed categories of common traffic participants, such as pedestrians and cars. Worsely, the huge class imbalance between common and novel classes further exacerbates performance degradation. To address the issues stated, we propose OpenNet to moderate the class imbalance with the Balanced Loss, which is based on Cross Entropy Loss. Besides, we adopt an inductive layer based on gradient reshaping to fast learn new classes with limited samples during incremental learning. To against catastrophic forgetting, we employ normalized feature distillation. By the way, we improve multi-scale detection robustness and unknown class recognition through FPN and energy-based detection, respectively. The Experimental results upon the CODA dataset show that the proposed method can obtain better performance than that of the existing methods.

二、对抗攻击/安全防御

Trainwreck: A damaging adversarial attack on image classifiers

标题： Trainwreck：对图像分类器的破坏性敌意攻击
链接： https://arxiv.org/abs/2311.14772
作者： Jan Zahálka
摘要： 对抗性攻击是计算机视觉（CV）的一个重要安全问题，因为它们使恶意攻击者能够可靠地操纵CV模型。现有的攻击旨在引出攻击者所需的输出，但在干净的数据上保持模型完全完整。随着CV模型在应用实践中变得越来越有价值，一种新的攻击媒介正在出现：破坏模型作为经济破坏的一种形式。本文开启了对破坏性对抗攻击（DAA）的探索，这些攻击试图破坏目标模型并最大化破坏所产生的总成本。作为DAA的先驱，本文提出了Trainwreck，这是一种训练时间攻击，它会毒害图像分类器的训练数据，从而降低其性能。Trainwreck使用隐式（ $\leq 8/255$ ）类对通用扰动合并类似类的数据使用代理模型计算。Trainwreck是一种黑盒、可转移的攻击：它不需要了解目标模型的体系结构，并且单个中毒数据集会降低在其上训练的任何模型的性能。对CIFAR-10和CIFAR-100的实验评估表明，Trainwreck确实是一种跨各种模型体系结构的有效攻击，包括EfficientNetV 2、ResNeXt-101和微调的ViT-L-16。攻击的强度可以通过中毒率参数自定义。最后，具有文件散列和/或像素差异的数据冗余被识别为针对Trainwreck或类似DAA的可靠防御技术。该代码可在https://github.com/JanZahalka/trainwreck上获得。
摘要：Adversarial attacks are an important security concern for computer vision (CV), as they enable malicious attackers to reliably manipulate CV models. Existing attacks aim to elicit an output desired by the attacker, but keep the model fully intact on clean data. With CV models becoming increasingly valuable assets in applied practice, a new attack vector is emerging: disrupting the models as a form of economic sabotage. This paper opens up the exploration of damaging adversarial attacks (DAAs) that seek to damage the target model and maximize the total cost incurred by the damage. As a pioneer DAA, this paper proposes Trainwreck, a train-time attack that poisons the training data of image classifiers to degrade their performance. Trainwreck conflates the data of similar classes using stealthy ( $\epsilon \leq 8/255$ ) class-pair universal perturbations computed using a surrogate model. Trainwreck is a black-box, transferable attack: it requires no knowledge of the target model’s architecture, and a single poisoned dataset degrades the performance of any model trained on it. The experimental evaluation on CIFAR-10 and CIFAR-100 demonstrates that Trainwreck is indeed an effective attack across various model architectures including EfficientNetV2, ResNeXt-101, and a finetuned ViT-L-16. The strength of the attack can be customized by the poison rate parameter. Finally, data redundancy with file hashing and/or pixel difference are identified as a reliable defense technique against Trainwreck or similar DAAs. The code is available at https://github.com/JanZahalka/trainwreck.

Instruct2Attack: Language-Guided Semantic Adversarial Attacks

标题： Instruct2Attack：语言制导的语义对抗性攻击
链接： https://arxiv.org/abs/2311.15551
作者： Jiang Liu,Chen Wei,Yuxiang Guo,Heng Yu,Alan Yuille,Soheil Feizi,Chun Pong Lau,Rama Chellappa
备注： under submission, code coming soon
摘要： 我们提出了指令2攻击（I2A），一种语言引导的语义攻击，根据自由形式的语言指令生成语义上有意义的扰动。我们利用最先进的潜在扩散模型，在那里我们对抗性地引导反向扩散过程，以搜索以输入图像和文本指令为条件的对抗性潜在代码。与现有的基于噪声和语义的攻击相比，I2A生成了更自然、更多样化的对抗性示例，同时提供了更好的可控性和可解释性。我们进一步使用GPT-4自动化攻击过程，以生成各种特定于图像的文本指令。我们表明，即使在强大的对抗性防御下，I2 A也可以成功打破最先进的深度神经网络，并在各种网络架构之间表现出很好的可移植性。
摘要：We propose Instruct2Attack (I2A), a language-guided semantic attack that generates semantically meaningful perturbations according to free-form language instructions. We make use of state-of-the-art latent diffusion models, where we adversarially guide the reverse diffusion process to search for an adversarial latent code conditioned on the input image and text instruction. Compared to existing noise-based and semantic attacks, I2A generates more natural and diverse adversarial examples while providing better controllability and interpretability. We further automate the attack process with GPT-4 to generate diverse image-specific text instructions. We show that I2A can successfully break state-of-the-art deep neural networks even under strong adversarial defenses, and demonstrate great transferability among a variety of network architectures.

Efficient Rehearsal Free Zero Forgetting Continual Learning using Adaptive Weight Modulation

标题： 使用自适应权重调制的高效无排练零遗忘连续学习
链接： https://arxiv.org/abs/2311.15276
作者： Yonatan Sverdlov,Shimon Ullman
摘要： 人工神经网络遇到了一个著名的挑战，即持续学习，它涉及在很长一段时间内获取多个任务的知识。这一挑战的出现是由于先前学习的权重倾向于进行调整以适应新任务的目标，从而导致称为灾难性遗忘的现象。解决这个问题的大多数方法都在最大限度地提高新任务的性能和最小限度地减少对先前任务的遗忘之间寻求平衡。相比之下，我们的方法试图最大限度地提高新任务的性能，同时确保零遗忘。这是通过为每个任务创建特定于任务的调制参数来实现的。在学习连续任务的过程中，只有这些才是可学习的参数。通过全面的实验评估，我们的模型表现出优越的性能，在获取和保留新的任务，构成其他多任务模型的困难。这强调了我们的方法在防止灾难性遗忘的同时适应新任务的获得的有效性
摘要：Artificial neural networks encounter a notable challenge known as continual learning, which involves acquiring knowledge of multiple tasks over an extended period. This challenge arises due to the tendency of previously learned weights to be adjusted to suit the objectives of new tasks, resulting in a phenomenon called catastrophic forgetting. Most approaches to this problem seek a balance between maximizing performance on the new tasks and minimizing the forgetting of previous tasks. In contrast, our approach attempts to maximize the performance of the new task, while ensuring zero forgetting. This is accomplished by creating a task-specific modulation parameters for each task. Only these would be learnable parameters during learning of consecutive tasks. Through comprehensive experimental evaluations, our model demonstrates superior performance in acquiring and retaining novel tasks that pose difficulties for other multi-task models. This emphasizes the efficacy of our approach in preventing catastrophic forgetting while accommodating the acquisition of new tasks

Effective Backdoor Mitigation Depends on the Pre-training Objective

标题： 有效的后门缓解取决于预训练目标
链接： https://arxiv.org/abs/2311.14948

作者： Sahil Verma,Gantavya Bhatt,Avi Schwarzschild,Soumye Singhal,Arnav Mohanty Das,Chirag Shah,John P Dickerson,Jeff Bilmes
备注： Accepted for oral presentation at BUGS workshop @ NeurIPS 2023 (this https URL)
摘要：尽管当代机器学习(ML)模型具有先进的能力，但它们仍然容易受到对抗性和后门攻击。这一漏洞在现实世界的部署中尤其令人担忧，在这种情况下，受损的模型可能在关键场景中表现出不可预测的行为。收集大量互联网来源的数据集用于预训练多模态模型的流行做法加剧了这种风险，因为这些数据集可能有后门。人们提出了各种技术来减轻这些模型中后门的影响，例如目前最先进的方法CleanCLIP。本文证明了CleanCLIP在缓解后门方面的功效，高度依赖于模型预训练期间使用的特定目标。更强的预训练目标与更难去除后门行为相关。本文通过在两个由300万(CC3M)和600万(CC6M)数据点组成的大型数据集上训练多模态模型来证明这一点，在各种预训练目标下，然后使用CleanCLIP去毒。当使用更强的预训练目标时，CleanCLIP是无效的，即使是广泛的超参数调优。本文的发现强调了使用大规模网络策划数据预训练模型并担心潜在后门威胁的机器学习从业人员的关键考虑因素。值得注意的是，本文结果表明，更简单的预训练目标更适合有效的后门移除。对于寻求在使用更强的预训练目标和安全防范后门攻击之间平衡的从业者来说，这一见解至关重要。
摘要： Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for pre-training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in these models such as CleanCLIP which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP is ineffective when stronger pre-training objectives are used, even with extensive hyperparameter tuning. Our findings underscore critical considerations for ML practitioners who pre-train models using large-scale web-curated data and are concerned about potential backdoor threats. Notably, our results suggest that simpler pre-training objectives are more amenable to effective backdoor removal. This insight is pivotal for practitioners seeking to balance the trade-offs between using stronger pre-training objectives and security against backdoor attacks.

Adversarial Purification of Information Masking

标题： 信息掩饰的对抗性净化
链接： https://arxiv.org/abs/2311.15339
作者： Sitong Liu,Zhichao Lian,Shuangquan Zhang,Liang Xiao
摘要： 对抗性攻击精心地对图像产生微小的、难以察觉的扰动，以欺骗神经网络。对抗性净化方法试图将对抗性输入样本转换为干净的输出图像，以抵御对抗性攻击。尽管如此，程度生成模型未能有效地消除对抗性扰动，产生不太理想的纯化结果。我们强调了剩余对抗扰动对目标模型的潜在威胁，定量地建立了扰动规模和攻击能力之间的关系。值得注意的是，纯化图像上的残留扰动主要来自对抗样本的相同位置补丁和相似补丁。我们提出了一种新的对抗性净化方法，称为信息掩码净化（IMPure），旨在广泛消除对抗性扰动。为了获得对抗样本，我们首先屏蔽部分补丁信息，然后重建补丁以抵抗来自补丁的对抗扰动。我们并行重建所有的补丁，以获得一个有凝聚力的图像。然后，为了保护纯化的样本免受潜在的类似区域扰动，我们通过将纯化的样本与输入样本随机混合来模拟这种风险，然后将其输入到特征提取网络中。最后，我们建立了一个像素损失和感知损失的联合约束，以增强模型的重建适应性。在ImageNet数据集上使用三个分类器模型进行的大量实验表明，我们的方法在对抗九种对抗性攻击方法时取得了最先进的结果。实现代码和预先训练的权重可以在https//github.com/NoWindButRain/IMPure访问。
摘要：Adversarial attacks meticulously generate minuscule, imperceptible perturbations to images to deceive neural networks. Counteracting these, adversarial purification methods seek to transform adversarial input samples into clean output images to defend against adversarial attacks. Nonetheless, extent generative models fail to effectively eliminate adversarial perturbations, yielding less-than-ideal purification results. We emphasize the potential threat of residual adversarial perturbations to target models, quantitatively establishing a relationship between perturbation scale and attack capability. Notably, the residual perturbations on the purified image primarily stem from the same-position patch and similar patches of the adversarial sample. We propose a novel adversarial purification approach named Information Mask Purification (IMPure), aims to extensively eliminate adversarial perturbations. To obtain an adversarial sample, we first mask part of the patches information, then reconstruct the patches to resist adversarial perturbations from the patches. We reconstruct all patches in parallel to obtain a cohesive image. Then, in order to protect the purified samples against potential similar regional perturbations, we simulate this risk by randomly mixing the purified samples with the input samples before inputting them into the feature extraction network. Finally, we establish a combined constraint of pixel loss and perceptual loss to augment the model’s reconstruction adaptability. Extensive experiments on the ImageNet dataset with three classifier models demonstrate that our approach achieves state-of-the-art results against nine adversarial attack methods. Implementation code and pre-trained weights can be accessed at \textcolor{blue}{https://github.com/NoWindButRain/IMPure}.

Confidence Is All You Need for MI Attacks

标题： 信心是MI攻击所需要的全部
链接： https://arxiv.org/abs/2311.15373

作者： Abhishek Sinha,Himanshi Tibrewal,Mansi Gupta,Nikhar Waghela,Shivank Garg
备注： 2 pages, 1 figure
摘要： 在这个不断发展的机器学习安全时代，成员推断攻击已经成为对敏感数据机密性的潜在威胁。在这种攻击中，对手的目标是确定在目标模型的训练过程中是否使用了特定的点。本文提出了一种新的方法来衡量一个数据点的模型的训练集的成员。我们没有像传统上那样将损失与成员资格相关联，而是利用了这样一个事实，即训练示例在分类到其实际类别时通常会表现出更高的置信度值。在训练过程中，模型基本上是“适合”训练数据的，并且在泛化到看不见的数据时可能会面临特殊的困难。这种不对称性导致模型在训练数据上获得更高的置信度，因为它利用了训练数据中存在的特定模式和噪声。我们提出的方法利用了机器学习模型生成的置信度值。这些置信度值提供了模型在其预测中的确定性的概率度量，并且可以进一步用于推断给定数据点的成员资格。此外，我们还介绍了我们的方法的另一种变体，它允许我们在不知道给定数据点的地面真相（真类）的情况下进行这种攻击，从而提供了比现有标签相关攻击方法更好的优势。
摘要： In this evolving era of machine learning security, membership inference attacks have emerged as a potent threat to the confidentiality of sensitive data. In this attack, adversaries aim to determine whether a particular point was used during the training of a target model. This paper proposes a new method to gauge a data point’s membership in a model’s training set. Instead of correlating loss with membership, as is traditionally done, we have leveraged the fact that training examples generally exhibit higher confidence values when classified into their actual class. During training, the model is essentially being ‘fit’ to the training data and might face particular difficulties in generalization to unseen data. This asymmetry leads to the model achieving higher confidence on the training data as it exploits the specific patterns and noise present in the training data. Our proposed approach leverages the confidence values generated by the machine learning model. These confidence values provide a probabilistic measure of the model’s certainty in its predictions and can further be used to infer the membership of a given data point. Additionally, we also introduce another variant of our method that allows us to carry out this attack without knowing the ground truth(true class) of a given data point, thus offering an edge over existing label-dependent attack methods.

三、AIGC/图像、视频生成

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

标题： FlowZero：基于LLM驱动的动态场景语法的Zero-Shot文本到视频合成
链接： https://arxiv.org/abs/2311.15813
作者： Yu Lu,Linchao Zhu,Hehe Fan,Yi Yang
备注： Project page: this https URL
摘要： 文本到视频（T2V）生成是一个快速发展的研究领域，旨在将复杂视频文本中的场景，对象和动作转换为连贯的视觉帧序列。我们提出了FlowZero，这是一个新的框架，它将大型语言模型（LLM）与图像扩散模型相结合，以生成时间连贯的视频。FlowZero使用LLM从文本中理解复杂的时空动态，其中LLM可以生成包含场景描述，对象布局和背景运动模式的全面动态场景语法（DSS）。然后，DSS中的这些元素用于指导具有平滑对象运动和帧到帧相干性的视频生成的图像扩散模型。此外，FlowZero还集成了一个迭代的自我优化过程，增强了时空布局和视频文本提示之间的对齐。为了增强全局一致性，我们提出了丰富的初始噪声与运动动态的每一帧来控制背景运动和摄像机运动自适应。通过使用时空语法来指导扩散过程，FlowZero实现了zero-shot视频合成的改进，生成具有生动运动的连贯视频。
摘要：Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

标题： 稳定的视频扩散：将潜在视频扩散模型扩展到大数据集
链接： https://arxiv.org/abs/2311.15127
作者： Andreas Blattmann,Tim Dockhorn,Sumith Kulal,Daniel Mendelevitch,Maciej Kilian,Dominik Lorenz,Yam Levi,Zion English,Vikram Voleti,Adam Letts,Varun Jampani,Robin Rombach
摘要： 我们提出稳定的视频扩散-一个潜在的视频扩散模型，用于高分辨率，最先进的文本到视频和图像到视频的生成。最近，为2D图像合成而训练的潜在扩散模型已经通过插入时间层并在小的、高质量的视频数据集上对其进行微调而转变成生成视频模型。然而，文献中的训练方法差异很大，该领域尚未就管理视频数据的统一策略达成一致。在本文中，我们识别并评估了成功训练视频LDM的三个不同阶段：文本到图像预训练、视频预训练和高质量视频微调。此外，我们证明了精心策划的预训练数据集对于生成高质量视频的必要性，并提出了一个系统的策划过程来训练一个强大的基础模型，包括字幕和过滤策略。然后，我们探讨了微调我们的基础模型对高质量数据的影响，并训练了一个文本到视频的模型，与闭源视频生成竞争。我们还表明，我们的基础模型为下游任务提供了强大的运动表示，例如图像到视频生成和对相机运动特定LoRA模块的适应性。最后，我们证明了我们的模型提供了一个强大的多视图3D先验，可以作为一个基础，微调多视图扩散模型，共同生成多个视图的对象在前馈的方式，优于基于图像的方法在其计算预算的一小部分。我们在https://github.com/Stability-AI/generative-models上发布代码和模型权重。
摘要：We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .

PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images

标题： PKU-I2 IQA：用于AI生成图像的图像到图像质量评估数据库
链接： https://arxiv.org/abs/2311.15556

作者： Jiquan Yuan,Xinyan Cao,Changjin Li,Fanyi Yang,Jinlong Lin,Xixin Cao
备注： 18 pages
摘要： 随着图像生成技术的发展，基于人工智能的图像生成被应用到各个领域。然而，AIGC图像生成模型的发展也带来了新的问题和挑战。人工智能生成的图像(AI-generated images, AIGI)与自然图像相比可能存在一些独特的失真，并且并非所有生成的图像都满足现实世界的要求，因此对人工智能生成的图像进行更全面的评价具有重要意义。虽然之前的工作已经为文本生成图像建立了一些基于人类感知的AIGC图像质量评估数据库，但AI图像生成技术包括文本到图像和图像到图像等场景，仅对文本到图像模型生成的图像进行评估是不够的。为了解决这个问题，建立了一个基于人类感知的图像到图像的AIGC图像质量评价数据库PKU-I2IQA。对PKU-I2IQA数据库进行了全面分析。进一步，介绍了两个基准模型:基于无参考图像质量评价的NR-AIGCIQA模型和基于全参考图像质量评价的FR-AIGCIQA模型。最后，利用该数据库进行了基准实验，比较了所提出的基准模型的性能。PKU-I2IQA数据库和基准将发布，以促进未来关于https://github.com/jiquan123/I2IQA的研究。
摘要： With the development of image generation technology, AI-based image generation has been applied in various fields. However, the development of AIGC image generative models also brings new problems and challenges. A significant challenge is that AI-generated images (AIGI) compared to natural images may have some unique distortions, and not all generated images meet the requirements of the real world, so it is of great significance to evaluate AI-generated images more comprehensively. Although previous work has established some human perception-based AIGC image quality assessment databases for text-generated images, the AI image generation technology includes scenarios like text-to-image and image-to-image, and assessing only the images generated by text-to-image models is insufficient. To address this issue, we have established a human perception-based image-to-image AIGC image quality assessment database, named PKU-I2IQA. We conducted a comprehensive analysis of the PKU-I2IQA database. Furthermore, we introduced two benchmark models: NR-AIGCIQA based on no-reference image quality assessment and FR-AIGCIQA based on full-reference image quality assessment.Finally, leveraging this database, we conducted benchmark experiments and compared the performance of the proposed benchmark models. The PKU-I2IQA database and benchmarks will be released to facilitate future research on https://github.com/jiquan123/I2IQA.