Rethinking Pre-training and Self-training


Quoc Le: We researchers love pre-training. Our new paper shows that pre-training is unhelpful when we have a lot of labeled data. In contrast, self-training works well even when we have a lot of labeled data. SOTA on PASCAL segmentation & COCO detection. Link:

Sayak Paul: Here's a list of my favorite recent papers on transfer learning for vision: - BigTransfer: - VirTex: - SimCLRv2: - Self-training: Would love to see a T5-like paper for vision.

Barret Zoph: Models and checkpoints are now open sourced for my recent work: "Rethinking Pre-training and Self-training". Paper link: Code Link: On COCO we achieve 54.3 AP and on Pascal Segmentation 90.5 mIOU!

Thang Luong: Success of self-training extends to object detection and semantic segmentation! Key to SOTA results in PASCAL semantic segmentation is the usage of #NoisyStudent checkpoints EfficientNet-L2 :)

Joan Serrà: Insightful paper comparing pre-trained (transfer learning) with self-trained models: TLDR: self-training >> pre-training (including self-supervised pre-training). Encouraging!

Mingxing Tan: Excited to see self-training obtains SoTA accuracy on COCO detection and Pascal segmentation. What if you also need efficiency? Try out our updated EfficientDet (53.7AP, with 55M params and 122ms latency): Enjoy :)

Hossein Mobahi: I see a rapidly growing success from "self-training" and "self-distillation" type methods recently. There is a lot of opportunity there for theoretical understanding and explanations with huge practical impact as these methods are now at the core to some SOTA models.

Leo Boytsov: If self supervised and supervised pretraining both have somewhat limited value in CV, is ther a hope for NLP? Do large self supervisedly trained Transformers work bc most NLP tasks are low data regime tasks (and NLP might need more data compared to vision?)

Daisuke Okanohara: Pre-training cannot improve (or even hurts) the performance when stronger data augmentation and large labeled data is available. On the other hand, self-training always helpful for low-data and large-data regimes with stronger data augmentation.

Connor Shorten: Rethinking Pre-training and Self-training 📚 "Our results suggest that both supervised and self-supervised pre-training methods fail to scale as the labeled dataset size grows, while self-training is still useful."

akira: If object detection is performed on all labels, or if strong data extensions are used, ImageNet pre-trained model can be degraded. But using Self-training (Noisy Student) to learn beforehand, found that even in those cases, the accuracy is improved.

