Evidence-Based Rankings

Tuesday, July 9th, 2019, 9:45 am–10:30 am

Add to Calendar

Event:

Recent Developments in Research on Fairness

Speaker:

Michael Kim (Stanford University)

Many selection procedures involve ordering candidates according to their qualifications. For example, a university might order applicants according to a perceived probability of graduation within four years, and then select the top 1000 applicants. In this work, we address the problem of ranking members of a population according to their "probability" of success, based on a training set of *binary* outcome data (e.g., graduated in four years or not). We show how to obtain rankings that satisfy a number of desirable accuracy and fairness criteria, despite the coarseness of binary outcome data. As the task of ranking is global -- that is, the rank of every individual depends not only on their own qualifications, but also on every other individuals' qualifications -- ranking is more subtle (and vulnerable to manipulation) than standard prediction tasks.

We develop two hierarchies of definitions of "evidence-based" rankings. The first hierarchy relies on a semantic notion of *domination-compatibility*: if the training data suggest that members of a set A are on-average more qualified than the members of B, then a ranking that favors B over A (i.e., where B *dominates* A) is blatantly inconsistent with the data, and likely to be discriminatory. The definition asks for domination-compatibility, not just for a pair of sets, but rather for every pair of sets from a rich collection C of subpopulations. The second hierarchy aims at precluding even more general forms of discrimination; this notion of *evidence-consistency* requires that the ranking must be justified on the basis of consistency with the expectations for every set in the collection C. Somewhat surprisingly, while evidence-consistency is a strictly stronger notion than domination-compatibility when the collection C is predefined, the two notions are equivalent when the collection C may depend on the ranking in question and are closely related to the notion of *multi-calibration* for predictors.