I am a Ph.D. student in the Computer Science Department at Stanford University, advised by Prof. Kayvon Fatahalian. My research focuses on designing algorithms and systems that enable domain experts to rapidly define, train, and validate computer vision models for their tasks. I've done research internships at Facebook Reality Labs (with Yaser Sheikh, Chenglei Wu, and Shoou-I Yu) and at NVIDIA Research (with Michael Garland and Michael Bauer), and I've worked as a contractor at Facebook to transfer my research into production.
Forager: Rapid Data Exploration and Model Development
It is now possible to create computer vision (CV) models that can reliably perform important visual recognition tasks – object detection, classification, depth estimation, etc. However, the techniques for creating such CV models are ad-hoc, expensive, and must be repeated whenever the task definition changes. For example, creating a model for detecting a new type of object requires repeating the manual process of data collection, data labeling, model selection, model training, model optimization, and validation. The Forager project is a broad effort toward unifying and streamlining the model creation process.
Scanner: Efficient Video Analysis at Scale
Scanner is a system for efficient video analysis at scale and was one of the major thrusts of the Intel Science and Technology Center for Visual Cloud Systems at CMU. The focus of Scanner is to answer the question: what are the fundamental primitives for expressing large-scale video processing algorithms and what is the system architecture which can efficiently implement these primitives on heterogeneous clusters featuring both CPUs and GPUs? Scanner is available as an open-source system to the community (https://github.com/scanner-research/scanner). In cooperation with collaborators at Facebook, CMU, Stanford, and UC Berkeley, Scanner has been used to:
- Provide the production compute engine for executing Facebook’s processing pipelines for synthesizing high-quality 360 stereo VR video.
- Mine and annotate 10 years worth of TV News (200k hours) to analyze trends and biases in the media
- Accelerate 3D pose estimation from 480 cameras
Learning Rare Category Classifiers on a Tight Labeling Budget
Many real-world ML deployments require learning a rare category model with a small labeling budget. Because often one also has access to large amounts of unlabeled data, it is attractive to formulate the problem as semi- supervised or active learning. However, prior work often makes two assumptions that do not hold in practice; (a) one has access to a modest amount of labeled data to bootstrap learning and (b) every image belongs to a common category of interest. In this paper, we learn models initialized with at-little-as five labeled positives and where 99.9% of the unlabeled data does not belong to the category of interest. To do so, we introduce active semi-supervised methods tailored for rare categories and small labeling budgets. We make use of two key insights: (a) We delegate human and machine effort where each is most useful; human labels are used to identify “needle-in-a-haystack” positives, while machine-generated psuedo-labels are used to identify negatives. (b) Because iteratively learning from highly-imbalanced and noisy labels is difficult, we leverage simple approaches to knowledge transfer to learn good features and rapidly train models using cached features. We compare our approach with prior active learning and semi-supervised approaches, demonstrating significant improvements in accuracy per unit labeling effort, particularly on a tight labeling budget.
Low-shot Validation: Active Importance Sampling for Estimating Classifier Performance on Rare Categories
For machine learning models trained with limited labeled training data, validation stands to become the main bottleneck to reducing overall annotation costs. We propose a statistical validation algorithm that accurately estimates the F-score of binary classifiers for rare categories, where finding relevant examples to evaluate on is particularly challenging. Our key insight is that simultaneous calibration and importance sampling enables accurate estimates even in the low-sample regime (< 300 samples). Critically, we also derive an accurate single-trial estimator of the variance of our method and demonstrate that this estimator is empirically accurate at low sample counts, enabling a practitioner to know how well they can trust a given low-sample estimate. When validating state-of-the-art semi-supervised models on ImageNet and iNaturalist2017, our method achieves the same estimates of model performance with up to 10× fewer labels than competing approaches. In particular, we can estimate model F1 score’s with a variance of 0.005 using as few as 100 labels.
In this paper, we focus on the problem of training deep image classification models for a small number of extremely rare categories. In this common, real-world scenario, almost all images belong to the background category in the dataset. We find that state-of-the-art approaches for training on imbalanced datasets do not produce accurate deep models in this regime. Our solution is to split the large, visually diverse background into many smaller, visually similar categories during training. We implement this idea by extending an image classification model with an additional auxiliary loss that learns to mimic the predictions of a pre-existing classification model on the training set. The auxiliary loss requires no additional human labels and regularizes feature learning in the shared network trunk by forcing the model to discriminate between auxiliary categories for all training set examples, including those belonging to the monolithic background of the main rare category classification task. To evaluate our method we contribute modified versions of the iNaturalist and Places365 datasets where only a small subset of rare category labels are available during training (all other images are labeled as background). By jointly learning to recognize both the selected rare categories and auxiliary categories, our approach yields models that perform 8.3 mAP points higher than state-of-the-art imbalanced learning baselines when 98.30% of the data is background, and up to 42.3 mAP points higher than fine-tuning baselines when 99.98% of the data is background.
A growing number of visual computing applications depend on the analysis of large video collections. The challenge is that scaling applications to operate on these datasets requires efficient systems for pixel data access and parallel processing across large numbers of machines. Few programmers have the capability to operate efficiently at these scales, limiting the field’s ability to explore new applications that leverage big video data. In response, we have created Scanner, a system for productive and efficient video analysis at scale. Scanner organizes video collections as tables in a data store optimized for sampling frames from compressed video, and executes pixel processing computations, expressed as dataflow graphs, on these frames. Scanner schedules video analysis applications expressed using these abstractions onto heterogeneous throughput computing hardware, such as multi-core CPUs, GPUs, and media processing ASICs, for high-throughput pixel processing. We demonstrate the productivity of Scanner by authoring a variety of video processing applications including the synthesis of stereo VR video streams from multi-camera rigs, markerless 3D human pose reconstruction from video, and data-mining big video datasets such as hundreds of feature-length films or over 70,000 hours of TV news. These applications achieve near-expert performance on a single machine and scale efficiently to hundreds of machines, enabling formerly long-running big video data analysis tasks to be carried out in minutes to hours.
We present an approach to accelerate multi-view stereo (MVS) by prioritizing computation on image patches that are likely to produce accurate 3D surface reconstructions. Our key insight is that the accuracy of the surface recon- struction from a given image patch can be predicted signif- icantly faster than performing the actual stereo matching. The intuition is that non-specular, fronto-parallel, in-focus patches are more likely to produce accurate surface recon- structions than highly specular, slanted, blurry patches — and that these properties can be reliably predicted from the image itself. By prioritizing stereo matching on a subset of patches that are highly reconstructable and also cover the 3D surface, we are able to accelerate MVS with minimal reduction in accuracy and completeness. To predict the re- constructability score of an image patch from a single view, we train an image-to-reconstructability neural network: the I2RNet. This reconstructability score enables us to effi- ciently identify image patches that are likely to provide the most accurate surface estimates before performing stereo matching. We demonstrate that the I2RNet, when trained on the ScanNet dataset, generalizes to the DTU and Tanks & Temples MVS datasets. By using our I2RNet with an ex- isting MVS implementation, we show that our method can achieve more than a 30⇥ speed-up over the baseline with only an minimal loss in completeness.