Learning Rare Category Classifiers on a Tight Labeling Budget

Many real-world ML deployments require learning a rare category model with a small labeling budget. Because often one also has access to large amounts of unlabeled data, it is attractive to formulate the problem as semi- supervised or active learning. However, prior work often makes two assumptions that do not hold in practice; (a) one has access to a modest amount of labeled data to bootstrap learning and (b) every image belongs to a common category of interest. In this paper, we learn models initialized with at-little-as five labeled positives and where 99.9% of the unlabeled data does not belong to the category of interest. To do so, we introduce active semi-supervised methods tailored for rare categories and small labeling budgets. We make use of two key insights: (a) We delegate human and machine effort where each is most useful; human labels are used to identify “needle-in-a-haystack” positives, while machine-generated psuedo-labels are used to identify negatives. (b) Because iteratively learning from highly-imbalanced and noisy labels is difficult, we leverage simple approaches to knowledge transfer to learn good features and rapidly train models using cached features. We compare our approach with prior active learning and semi-supervised approaches, demonstrating significant improvements in accuracy per unit labeling effort, particularly on a tight labeling budget.