MoMA: 基于多头注意力的动量对比学习知识蒸馏,用于组织病理学图像分析|文献速递-视觉大模型医疗图像应用
Title
题目
MoMA: Momentum contrastive learning with multi-head attention-basedknowledge distillation for histopathology image analysis
MoMA: 基于多头注意力的动量对比学习知识蒸馏,用于组织病理学图像分析
01
文献速递介绍
计算病理学是一门新兴学科,近年来在提高传统病理学的准确性和鲁棒性方面显示出了巨大潜力,从而改善了患者护理、治疗和管理的质量(Cui and Zhang, 2021)。随着先进人工智能(AI)和机器学习(ML)技术的发展,以及高质量和高分辨率数据集的可用性,计算病理学方法已成功应用于传统病理学工作流程的多个方面,例如细胞核检测(Graham et al., 2019)、组织分类(Marini et al., 2021)、疾病分层(Chunduru et al., 2022)以及生存分析(Huang et al., 2021;Li et al., 2023)。然而,近期研究指出,计算病理学工具的泛化能力问题仍未得到解决(Stacke et al., 2020;Aubreville et al., 2023)。
要构建准确且可靠的计算病理学工具,不仅需要先进的AI模型,还需要大量高质量的数据。在计算病理学中,AI和ML技术的学习能力以及病理学数据集的数量正在不断增加。然而,与自然语言处理(NLP)(Ghorbani et al., 2022)和计算机视觉(Zhai et al., 2022;Dehghani et al., 2023)等其他学科相比,公开可用的病理学数据集数量仍然少得多。这部分是由于病理学数据集的特点,如包含多千兆像素的全切片图像(WSI),这使得在全球范围内透明共享这些数据变得困难;此外,患者数据的隐私和伦理问题也限制了数据的获取和共享。此外,病理学数据集的多样性也受到限制。例如,Kather19数据集(Kather et al., 2019a)包含9种不同结直肠组织类型的100,000张图像块,但这些图像块最初仅来自86张全切片图像。GLySAC数据集(Doan et al., 2022)包含3种细胞类型的30,875个细胞核,这些数据仅来自8张全切片图像,由一台数字切片扫描仪生成。数据集中扫描仪的单一性也阻碍了AI模型在计算病理学中的泛化能力(Stacke et al., 2020)。
尽管有一些努力提供大量多样化的病理学数据集,例如用于前列腺癌Gleason评分的PANDA数据集(Bulten et al., 2022),它包含来自6个机构的12,625张WSI,使用了3种不同的数字切片扫描仪。然而,在特定计算病理学任务中,获得足够数量和多样性的相关数据集仍然具有挑战性。这种数据收集无论如何都需要大量时间和劳动力。因此,开发针对特定任务的计算病理学模型和工具仍然是一个未满足的需求。
迁移学习是最常用的学习方法之一,通过重用或转移从一个问题/任务中获得的知识,克服数据集不足的问题。尽管迁移学习在病理学图像分析和其他领域中已被广泛采用,但以往的研究大多使用自然图像(如ImageNet或JFT)的预训练权重(Hosseinzadeh Taher et al., 2021;Dosovitskiy et al., 2021)。尽管有研究表明从自然图像中学习的现成特征对计算病理学任务是有用的,但迁移学习的效果在很大程度上依赖于病理学图像的复杂性/类型(Li and Plataniotis, 2020)。随着公开病理学图像数据集数量的增加,这些数据集的预训练权重可能会用于迁移学习;然而,目前尚不清楚这些数据集的数量是否足够大,或多样性是否足够广泛。
此外,由Hinton等人(2015)提出的知识蒸馏(KD)是另一种可以克服数据不足的方法。它不仅使用现有模型作为预训练权重(类似于迁移学习),还通过训练过程让目标(学生)模型直接从现有(教师)模型中学习,即学生模型试图模仿教师模型的输出。KD的变体已成功应用于模型压缩(Tian et al., 2020)、跨模态知识转移(Yuan et al., 2022;Ahmed et al., 2022;Zhao et al., 2020)以及集成蒸馏(Du et al., 2020;Lin et al., 2020;Allen-Zhu and Li, 2023)等任务。然而,KD在模型之间知识转移方面的潜力,特别是在病理图像分析中的应用尚未得到充分探索。
本研究旨在解决计算病理学领域中数据和标注有限的挑战,目标是开发能够准确且鲁棒地应用于未见数据的计算病理学工具。为实现这一目标,我们提出了一种高效且有效的学习框架,利用基于高质量源数据集构建的现有模型,并在相对较小的数据集上训练目标模型。该方法被称为基于多头注意力的动量对比学习知识蒸馏(MoMA),遵循KD框架以从现有模型中转移相关知识,并采用动量对比学习和注意力机制获取一致、可靠且具有上下文感知的特征表示。
我们在多组织病理数据集上对MoMA进行了评估,以模拟计算病理学工具研究与开发中的真实场景。与其他方法相比,MoMA在特定任务的目标模型学习中表现出卓越能力。此外,实验结果为如何在有限目标数据集上更好地从预训练模型向学生模型转移知识提供了指导。
我们的主要贡献如下:
开发了一种高效且有效的学习框架MoMA,能够利用现有模型和高质量数据集,并在有限数据集上构建准确且鲁棒的计算病理学工具。
提出利用基于多头注意力的动量对比学习进行知识蒸馏,以一致且可靠的方式将知识从现有模型转移到目标模型。
在多组织病理数据集上评估了MoMA,展示了其在特定任务目标模型学习中的卓越性能,优于其他相关方法。
分析了MoMA及其他相关方法在不同设置下的表现,并为有限数据集条件下的计算病理学工具开发提供了指导。
Aastract
摘要
There is no doubt that advanced artificial intelligence models and high quality data are the keys to success indeveloping computational pathology tools. Although the overall volume of pathology data keeps increasing,a lack of quality data is a common issue when it comes to a specific task due to several reasons includingprivacy and ethical issues with patient data. In this work, we propose to exploit knowledge distillation, i.e.,utilize the existing model to learn a new, target model, to overcome such issues in computational pathology.Specifically, we employ a student–teacher framework to learn a target model from a pre-trained, teachermodel without direct access to source data and distill relevant knowledge via momentum contrastive learningwith multi-head attention mechanism, which provides consistent and context-aware feature representations.This enables the target model to assimilate informative representations of the teacher model while seamlesslyadapting to the unique nuances of the target data. The proposed method is rigorously evaluated across differentscenarios where the teacher model was trained on the same, relevant, and irrelevant classification tasks withthe target model. Experimental results demonstrate the accuracy and robustness of our approach in transferringknowledge to different domains and tasks, outperforming other related methods. Moreover, the results providea guideline on the learning strategy for different types of tasks and scenarios in computational pathology
毫无疑问,先进的人工智能模型和高质量数据是开发计算病理学工具成功的关键。然而,由于隐私和患者数据的伦理问题等原因,在针对特定任务时,缺乏高质量数据是一个常见问题,尽管病理学数据的总体量持续增长。在本研究中,我们提出利用知识蒸馏(即利用现有模型学习一个新的目标模型)来克服计算病理学中的这些问题。具体而言,我们采用学生–教师框架,通过一个预训练的教师模型在无需直接访问源数据的情况下学习目标模型,并通过结合动量对比学习和多头注意力机制蒸馏相关知识,从而提供一致且具有上下文感知的特征表示。这使得目标模型能够在无缝适应目标数据独特特征的同时,吸收教师模型的信息性表示。
我们的方法在不同场景中进行了严格评估,其中教师模型在与目标模型相同、相关或无关的分类任务上进行了训练。实验结果表明,我们的方法在知识向不同领域和任务的迁移方面表现出了卓越的准确性和鲁棒性,优于其他相关方法。此外,结果还为计算病理学中不同类型任务和场景的学习策略提供了指导。
Method
方法
The overview of the proposed MoMA is shown in Fig. 1 and Alg. 1in Appendix A. Let 𝐷𝑆𝐶 = {(𝐱𝑖 , 𝐲𝑖 )}𝑁 𝑖=1 𝑆𝐶 be a source/teacher dataset and𝐷𝑇 𝐺* = {(𝐱𝑖 , 𝐲𝑖 )}𝑁 𝑖=1 𝑇 𝐺 be a target/student dataset where 𝐱𝑖 and 𝐲𝑖 represent the 𝑖th pathology image and its ground truth label, respectively,and *𝑁𝑆𝐶* and 𝑁**𝑇 𝐺 represent the number of source and target samples(𝑁**𝑆𝐶 ≫ 𝑁**𝑇 𝐺), respectively. The source/teacher dataset refers to thedataset that is utilized to train a teacher model and the target/studentdataset denotes the dataset that is employed to learn a target/studentmodel. Let 𝑇 be a teacher model and 𝑆 be a student model. 𝑇consists of a teacher encoder 𝑓 𝑇 and a teacher classifier 𝑔 𝑇 . 𝑆 includesa student encoder 𝑓 𝑆 and a student classifier 𝑔 𝑆. In addition to 𝑇 and 𝑆, MoMA includes a teacher projection head (𝑝 𝑇 ), a teacher attentionhead (ℎ 𝑇 ), a student projection head (𝑝 𝑆), and a student attentionhead (ℎ 𝑆). Given an input image 𝐱𝑖 , 𝑓 𝑇 and 𝑓 𝑆 extracts initial featurerepresentations, each of which is subsequently processed by a series ofa projection head and an attention head, i.e., 𝑝 𝑇 followed by ℎ 𝑇 or 𝑝 ?followed by ℎ 𝑆, to improve its representation power. 𝑔 𝑇 and 𝑔 𝑆 receivethe initial feature representations and conduct image classification. 𝑔 𝑇is only utilized during the training of 𝑇 .Due to the restrictions on sharing medical data, we assume a scenario where 𝑇 has been already trained on 𝐷**𝑆𝐶, the pre-trainedweights of 𝑇 are available, but the direct access to 𝐷𝑆𝐶 is limited.Provided with the pre-trained 𝑓 𝑇 , the objective of MoMA is to learn 𝑆 on 𝐷𝑇 𝐺 in an accurate and robust manner. For optimization, MoMAexploits two learning paradigms: (1) KD framework and (2) momentum contrastive learning. Combining the two learning methodologies,MoMA permits a robust and dynamic transfer of knowledge from 𝑓 𝑇 ,which was pre-trained on a high-quality dataset, i.e., 𝐷𝑆𝐶, to a target𝑓 𝑆, which is trained on a limited dataset, i.e., 𝐷𝑇 𝐺.
所提出的MoMA框架的概述如图1和附录A中的算法1所示。设𝐷𝑆𝐶 = {(𝐱𝑖 , 𝐲𝑖 )}𝑖=1…𝑁 𝑆𝐶为源/教师数据集,𝐷𝑇𝐺 = {(𝐱𝑖 , 𝐲𝑖 )}𝑖=1…𝑁 𝑇𝐺为目标/学生数据集,其中𝐱*𝑖和𝐲𝑖分别表示第𝑖张病理图像及其真实标签,𝑁𝑆𝐶* 和 *𝑁𝑇𝐺分别表示源数据集和目标数据集的样本数量(𝑁𝑆𝐶* *≫ 𝑁𝑇𝐺*)。源/教师数据集指用于训练教师模型的数据集,而目标/学生数据集则用于训练目标/学生模型。设 𝑇为教师模型, 𝑆为学生模型。 𝑇由教师编码器𝑓𝑇和教师分类器𝑔𝑇组成; 𝑆包括学生编码器𝑓𝑆和学生分类器𝑔𝑆。此外,MoMA还包括教师投影头(𝑝𝑇)、教师注意力头(ℎ𝑇)、学生投影头(𝑝𝑆)和学生注意力头(ℎ𝑆)。对于输入图像𝐱𝑖,𝑓𝑇和𝑓𝑆提取初始特征表示,这些特征随后通过投影头和注意力头的一系列处理,即通过𝑝𝑇和ℎ𝑇或通过𝑝𝑆和ℎ𝑆,以增强其特征表示能力。𝑔𝑇和𝑔𝑆接收初始特征表示并进行图像分类。𝑔*𝑇仅在训练 𝑇时使用。由于共享医学数据的限制,我们假设一种场景: 𝑇已经在𝐷𝑆𝐶上完成了训练,其预训练权重是可用的,但无法直接访问𝐷𝑆𝐶。在提供预训练的𝑓𝑇的情况下,MoMA的目标是在𝐷𝑇𝐺上准确且鲁棒地学习 𝑆。为此,MoMA利用了两种学习范式:(1) 知识蒸馏(KD)框架和(2) 动量对比学习。结合这两种学习方法,MoMA允许从在高质量数据集(即𝐷𝑆𝐶)上预训练的𝑓𝑇向在有限数据集(即𝐷𝑇𝐺)上训练的目标𝑓𝑆进行鲁棒且动态的知识转移。
Conclusion
结论
Herein, we propose an efficient and effective learning frameworkcalled MoMA to build an accurate and robust classification modelin pathology images. Exploiting the KD framework, momentum contrastive learning, and SA, MoMA was able to transfer knowledge froma source domain to a target domain and to learn a robust classificationmodel for five different tasks. Moreover, the experimental results ofMoMA suggest an adequate learning strategy for different distillationtasks and scenarios. We anticipate that this will be a great help indeveloping computational pathology tools for various tasks. Futurestudies will entail the further investigation of the efficient KD methodand extended validation and application of MoMA to other types ofdatasets and tasks in computational pathology
在本文中,我们提出了一种高效且有效的学习框架,称为MoMA,用于构建病理图像的准确且鲁棒的分类模型。通过利用知识蒸馏(KD)框架、动量对比学习以及注意力机制(SA),MoMA能够实现从源域到目标域的知识转移,并为五种不同任务学习鲁棒的分类模型。此外,MoMA的实验结果为不同蒸馏任务和场景提供了适当的学习策略。我们预计这将对开发用于各种任务的计算病理学工具有很大帮助。未来的研究将进一步探索高效的知识蒸馏方法,并扩展MoMA在计算病理学中其他类型数据集和任务上的验证和应用。
Results
结果
Table 1 and Fig. 3 (and Figs. C.1 and C.2 in Appendix C) show theresults of MoMA and its competitors on the two TMA prostate datasets(Prostate USZ and Prostate UBC). On Prostate USZ, the teacher modelTC𝑃 𝐴𝑁𝐷𝐴, which was trained on PANDA only, achieved 63.4% ACC,0.526 F1, and 0.531 𝜅**𝑤, which is substantially lower to other studentmodels with 𝑇 𝐿, 𝐿𝐷, and 𝐹𝐷. Among the student models with 𝑇 𝐿, thestudent model with no pre-trained weights (FT𝑁𝑜𝑛𝑒) was inferior to theother two student models; the student model pre-trained on PANDA(FT𝑃 𝐴𝑁𝐷𝐴) outperformed the student model pre-trained on ImageNet(FT𝐼𝑚𝑎𝑔𝑒𝑁𝑒𝑡). These indicate the importance of pre-trained weights andfine-tuning on the target dataset, i.e., Prostate USZ. As for the KDapproaches, MoMA𝑃 𝐴𝑁𝐷𝐴, pre-trained on PANDA, outperformed allother KD methods, achieving ACC of 73.6%, which is 0.9% higher thanFT𝑃 𝐴𝑁𝐷𝐴, and F1 of 0.687 and 𝜅𝑤 of 0.670, which are comparable tothose of FT𝑃 𝐴𝑁𝐷𝐴.
On the independent test set, Prostate UBC, it is remarkable thatTC𝑃 𝐴𝑁𝐷𝐴 achieved 78.2% ACC and 0.680 𝜅**𝑤, which are superior tothose of all the student models with 𝑇 𝐿, likely suggesting that thecharacteristic of PANDA is more similar to Prostate UBC than ProstateUSZ. The performance of the student models with 𝑇 𝐿 and 𝐹𝐷 wassimilar to each other between Prostate USZ and Prostate UBC; forinstance, MoMA𝑃 𝐴𝑁𝐷𝐴 obtained higher ACC but lower F1 and 𝜅𝑤 onProstate UBC than on Prostate USZ. As MoMA and other student modelswith 𝐹𝐷 adopt vanilla KD by setting 𝛾 to 1 in , i.e., mimicking theoutput logits of the teacher model, there was, in general, a substantialincrease in the performance on Prostate UBC. MoMA𝑃 𝐴𝑁𝐷𝐴, in particular, achieved the highest ACC of 83.3% and 𝜅𝑤 of 0.763 overall modelsunder consideration, which are 11.1% and 0.145 higher than those onProstate USZ in ACC and 𝜅𝑤, respectively.By randomly sampling 25% and 50% of the training set, we repeatedthe above experiments using MoMA and other competing models toassess the effect of the size of the training set. The results of the sametask distillation using 25% and 50% of the training set are availablein Appendix B (Tables B.1 and B.2). The experimental results were moreor less the same as those using the entire training set. MoMA𝑃 𝐴𝑁𝐷𝐴 wascomparable to FT𝑃 𝐴𝑁𝐷𝐴 on Prostate USZ. KL+MoMA𝑃 𝐴𝑁𝐷𝐴 outperformed the competing models on Prostate UBC. These results validatethe effectiveness of MoMA on the extremely small target dataset.
表1和图3(以及附录C中的图C.1和C.2)展示了MoMA及其竞争方法在两个TMA前列腺数据集(Prostate USZ和Prostate UBC)上的实验结果。
在Prostate USZ数据集上,仅在PANDA数据集上训练的教师模型TC𝑃 𝐴𝑁𝐷𝐴获得了63.4%的ACC(准确率)、0.526的F1分数以及0.531的𝜅**𝑤(加权Kappa系数),明显低于通过𝑇 𝐿(迁移学习)、𝐿𝐷(逻辑蒸馏)和𝐹𝐷(特征蒸馏)训练的学生模型。在𝑇 𝐿方法下,没有预训练权重的学生模型(FT𝑁𝑜𝑛𝑒)的性能劣于其他两种学生模型;在PANDA上进行预训练的学生模型(FT𝑃 𝐴𝑁𝐷𝐴)的表现优于在ImageNet上预训练的学生模型(FT𝐼𝑚𝑎𝑔𝑒𝑁𝑒𝑡)。这些结果表明预训练权重以及在目标数据集(如Prostate USZ)上的微调的重要性。
在知识蒸馏方法中,基于PANDA预训练的MoMA𝑃 𝐴𝑁𝐷𝐴优于所有其他知识蒸馏方法,取得了73.6%的ACC,比FT𝑃 𝐴𝑁𝐷𝐴高0.9%;同时,其F1分数为0.687,𝜅𝑤为0.670,与FT𝑃 𝐴𝑁𝐷𝐴的结果相当。
在独立测试集Prostate UBC上,教师模型TC𝑃 𝐴𝑁𝐷𝐴取得了78.2%的ACC和0.680的𝜅𝑤,优于所有通过𝑇 𝐿方法训练的学生模型。这可能表明PANDA的特征与Prostate UBC的特征比Prostate USZ更为相似。在Prostate USZ和Prostate UBC之间,使用𝑇 𝐿和𝐹𝐷的学生模型表现相似;例如,MoMA𝑃 𝐴𝑁𝐷𝐴在Prostate UBC上的ACC更高,但F1和𝜅𝑤较Prostate USZ更低。由于MoMA和其他使用𝐹𝐷的学生模型采用了基础KD(通过设置𝛾为1以模拟教师模型的输出logits),总体上Prostate UBC的性能显著提高。特别是,MoMA𝑃 𝐴𝑁𝐷𝐴在所有模型中取得了最高的ACC(83.3%)和𝜅𝑤(0.763),分别比其在Prostate USZ上的ACC和𝜅𝑤高出11.1%和0.145。
通过随机抽取训练集的25%和50%,我们重复了上述实验,以评估训练集大小对模型性能的影响。使用25%和50%训练集的相同任务蒸馏结果详见附录B(表B.1和B.2)。实验结果与使用整个训练集的结果基本一致。MoMA𝑃 𝐴𝑁𝐷𝐴在Prostate USZ上与FT𝑃 𝐴𝑁𝐷𝐴表现相当,而KL+MoMA𝑃 𝐴𝑁𝐷𝐴在Prostate UBC上优于其他竞争模型。这些结果验证了MoMA在极小目标数据集上的有效性。
Figure
图
Fig. 1. Overview of the MoMA: Attention-augmented momentum contrast knowledge distillation framework. A batch of input images is encoded by the student encoder (𝑓 𝑆 ), andthe momentum teacher (𝑓 𝑇 ), and each feature representation is re-weighted with regard to other images in the batch as the context. A classifier is added on top of the studentencoder. The student model is jointly optimized by contrastive loss and cross-entropy loss
图1. MoMA概览:基于注意力增强的动量对比知识蒸馏框架。一个输入图像批次由学生编码器(𝑓 𝑆)和动量教师编码器(𝑓 𝑇)进行编码,每个特征表示根据批次中的其他图像作为上下文进行重新加权。在学生编码器的顶部添加了一个分类器。学生模型通过对比损失和交叉熵损失联合优化。
Fig. 2. Overview of distillation flow across different tasks and datasets. (1) Supervised task is always conducted, (2) Feature distillation is applied if a well-trained teacher modelis available, and (3) Vanilla 𝐿𝐾𝐷 is employed if teacher and student models conduct the same task. SSL stands for self-supervised learning
图2. 不同任务和数据集之间蒸馏流程的概览。(1) 始终进行监督任务,(2) 如果有经过良好训练的教师模型,则应用特征蒸馏,(3) 如果教师模型和学生模型执行相同任务,则采用基本的𝐿𝐾𝐷。SSL代表自监督学习。
Fig. 3. Box plots for same task distillation: All the KD models utilize the pre-trained weights from PANDA.
图3. 相同任务蒸馏的箱线图:所有知识蒸馏(KD)模型均使用来自PANDA的预训练权重。
Fig. 4. Box plots for relevant task distillation. All the KD models utilize the pre-trained weights from PANDA.
图4. 相关任务蒸馏的箱线图:所有知识蒸馏(KD)模型均使用来自PANDA的预训练权重。
Fig. 5. Bar plots for irrelevant task distillation. All the KD models utilize the pre-trained weights from ImageNet
图5. 无关任务蒸馏的柱状图:所有知识蒸馏(KD)模型均使用来自ImageNet的预训练权重。
Fig. 6. The correlation coefficient matrix between feature presentations of a teacher network and student network.
图6. 教师网络和学生网络特征表示之间的相关系数矩阵。
Fig. 7. t-SNE visualization of feature representations with silhouette scores for ImageNet and PANDA teacher models and 5 student datasets: (a) Prostate USZ, (b) Prostate UBC,(c) Prostate AGGC, (d) Colon K16, (e) Breast BRACS.
图7. 使用t-SNE可视化特征表示及其轮廓系数,分别针对ImageNet和PANDA教师模型以及5个学生数据集:(a) Prostate USZ,(b) Prostate UBC,(c) Prostate AGGC,(d) Colon K16,(e) Breast BRACS。
Fig. C.1. Confusion matrices on Prostate USZ (Test I). Each confusion matrix represents the average across 5 runs.
图C.1. Prostate USZ(测试集I)上的混淆矩阵。每个混淆矩阵表示5次运行的平均结果。
Fig. C.2. Confusion matrices on Prostate UBC (Test II). Each confusion matrix represents the average across 5 runs.
图C.2. Prostate UBC(测试集II)上的混淆矩阵。每个混淆矩阵表示5次运行的平均结果。
Fig. C.3. Confusion matrices on Prostate AGGC CV (Test I). Each confusion matrix represents the average across 5-fold cross-validation experiments.
图C.3. Prostate AGGC CV(测试集I)上的混淆矩阵。每个混淆矩阵表示5折交叉验证实验的平均结果。
Fig. C.4. Confusion matrices on Prostate AGGC test (Test II). Each confusion matrix represents the average across 5-fold cross-validation experiments.
图C.4. Prostate AGGC测试集(测试集II)上的混淆矩阵。每个混淆矩阵表示5折交叉验证实验的平均结果。
Fig. C.5. Confusion matrices on Colon K16 SN (Test I). Each confusion matrix represents the average across 5 runs.
图C.5. Colon K16 SN(测试集I)上的混淆矩阵。每个混淆矩阵表示5次运行的平均结果。
Fig. C.6. Confusion matrices on Colon K16 (Test II). Each confusion matrix represents the average across 5 runs
图C.6. Colon K16(测试集II)上的混淆矩阵。每个混淆矩阵表示5次运行的平均结果。
Fig. C.7. Confusion matrices on irrelevant task distillation: breast carcinoma sub-type classification. Each confusion matrix represents the average across 5 runs.
图C.7. 无关任务蒸馏的混淆矩阵:乳腺癌亚型分类。每个混淆矩阵表示5次运行的平均结果。
Fig. C.8. Confusion matrices on irrelevant task distillation: gastric microsatellite instability prediction. Each confusion matrix represents the average across 5-fold cross-validationexperiments
图C.8. 无关任务蒸馏的混淆矩阵:胃部微卫星不稳定性预测。每个混淆矩阵表示5折交叉验证实验的平均结果。
Table
表
Table 1Results of same task distillation. KL denotes the use of KL divergence loss.
表1 相同任务蒸馏的结果。KL表示使用KL散度损失。
Table 2Results of relevant task distillation.
表2 相关任务蒸馏的结果。
Table 3Results of irrelevant task distillation: colon tissue type classification.
表3 无关任务蒸馏的结果:结肠组织类型分类。
Table 4Results of irrelevant task distillation: breast carcinoma sub-type classification.
表4无关任务蒸馏的结果:乳腺癌亚型分类。
Table 5Ablation results of MoMA with and without Multi-head Attention on the three distillation tasks.
表5带有和不带多头注意力的MoMA在三个蒸馏任务上的消融实验结果。
Table 6Results of three distillation tasks with MoMA and self-supervised learning.
表6使用MoMA和自监督学习在三个蒸馏任务上的结果。
Table 7Results of irrelevant task distillation: gastric microsatellite instability prediction
表7无关任务蒸馏的结果:胃部微卫星不稳定性预测。
Table B.1Results of same task distillation on 50% training data. KL denotes the use of KL divergence loss.
表B.1在50%训练数据上的相同任务蒸馏结果。KL表示使用KL散度损失。
Table B.2Results of same task distillation on 25% training data. KL denotes the use of KL divergence loss.
表B.2在25%训练数据上的相同任务蒸馏结果。KL表示使用KL散度损失
Table B.3Results of irrelevant task distillation: prostate cancer classification.
表B.3无关任务蒸馏的结果:前列腺癌分类。