Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization

21 Oct 2021  ·  Devansh Arpit, Huan Wang, Yingbo Zhou, Caiming Xiong ·

In Domain Generalization (DG) settings, models trained independently on a given set of training domains have notoriously chaotic performance on distribution shifted test domains, and stochasticity in optimization (e.g. seed) plays a big role. This makes deep learning models unreliable in real world settings. We first show that this chaotic behavior exists even along the training optimization trajectory of a single model, and propose a simple model averaging protocol that both significantly boosts domain generalization and diminishes the impact of stochasticity by improving the rank correlation between the in-domain validation accuracy and out-domain test accuracy, which is crucial for reliable early stopping. Taking advantage of our observation, we show that instead of ensembling unaveraged models (that is typical in practice), ensembling moving average models (EoA) from independent runs further boosts performance. We theoretically explain the boost in performance of ensembling and model averaging by adapting the well known Bias-Variance trade-off to the domain generalization setting. On the DomainBed benchmark, when using a pre-trained ResNet-50, this ensemble of averages achieves an average of $68.0\%$, beating vanilla ERM (w/o averaging/ensembling) by $\sim 4\%$, and when using a pre-trained RegNetY-16GF, achieves an average of $76.6\%$, beating vanilla ERM by $6\%$. Our code is available at \url{https://github.com/salesforce/ensemble-of-averages}.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Domain Generalization DomainNet Ensemble of Averages (ResNet-50) Average Accuracy 47.4 # 16
Domain Generalization DomainNet Ensemble of Averages (RegNetY-16GF) Average Accuracy 60.9 # 5
Domain Generalization DomainNet Ensemble of Averages (ResNeXt-50 32x4d) Average Accuracy 54.6 # 10
Domain Generalization Office-Home Ensemble of Averages (ResNeXt-50 32x4d) Average Accuracy 80.2 # 12
Domain Generalization Office-Home Ensemble of Averages (RegNetY-16GF) Average Accuracy 83.9 # 7
Domain Generalization Office-Home Ensemble of Averages (ResNet-50) Average Accuracy 72.5 # 18
Domain Generalization PACS Ensemble of Averages (ResNeXt-50 32x4d) Average Accuracy 93.2 # 12
Domain Generalization PACS Ensemble of Averages (ResNet-50) Average Accuracy 88.6 # 19
Domain Generalization PACS Ensemble of Averages (RegNetY-16GF) Average Accuracy 95.8 # 9
Domain Generalization TerraIncognita Ensemble of Averages (ResNeXt-50 32x4d) Average Accuracy 55.2 # 11
Domain Generalization TerraIncognita Ensemble of Averages (RegNetY-16GF) Average Accuracy 61.1 # 4
Domain Generalization TerraIncognita Ensemble of Averages (ResNet-50) Average Accuracy 52.3 # 13
Domain Generalization VLCS Ensemble of Averages (RegNetY-16GF) Average Accuracy 81.1 # 12
Domain Generalization VLCS Ensemble of Averages (ResNet-50) Average Accuracy 79.1 # 22
Domain Generalization VLCS Ensemble of Averages (ResNeXt-50 32x4d) Average Accuracy 80.4 # 13

Methods