Evaluating Model Performance Under Worst-case Subpopulations | Synapse