Utilizing predictions from arbitrary fashions to get tighter confidence intervals

That is Jessica. I beforehand blogged about conformal prediction, an strategy to getting prediction units which can be assured on common to realize no less than some user-defined protection degree (e.g., 95%). If it’s a classification downside, the prediction units are comprised of a discrete set of labels, and if the result is steady (regression) they’re intervals. The fundamental concept will be described as utilizing a labeled hold-out knowledge set (the calibration set) to regulate the (usually incorrect) heuristic notion of uncertainty you get from a predictive mannequin, just like the softmax worth, with the intention to get legitimate prediction units.
Currently I’ve been pondering a bit about how helpful it’s in apply, like when predictions can be found to somebody making a choice. E.g., if the choice maker is introduced with a prediction set relatively than simply the only most probability label, in what methods would possibly this alteration their determination course of? It’s additionally attention-grabbing to consider the way you get individuals to know the variations between a model-agnostic versus a model-dependent prediction set or uncertainty interval, and the way use of them ought to change.
However past the human dealing with side, there are some extra direct purposes of conformal prediction to enhance inference duties. One makes use of what is basically conformal prediction to estimate the transfer performance of an ML model educated on one area if you apply it to a brand new area. It’s a helpful concept for those who’re pleased with assuming that the domains have been drawn i.i.d. from some unknown meta-distribution, which appears laborious in apply.
One other recent idea coming from Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (the primary two of whom have created a bunch of useful materials explaining conformal prediction) is in the identical spirit as conformal, in that the objective is to make use of labeled knowledge to “repair” predictions from a mannequin with the intention to enhance upon some classical estimate of uncertainty in an inference.
What they name prediction-powered inference is a variation on semi-supervised studying that begins by assuming that you just wish to estimate some parameter worth theta*, and you’ve got some labeled knowledge of measurement n, a a lot bigger set of unlabeled knowledge of measurement N >> n, and entry to a predictive mannequin which you can apply to the unlabeled knowledge. The predictive mannequin is unfair in that it is perhaps match to another knowledge than the labeled and unlabeled knowledge you wish to use to do inference. The thought is then to first assemble an estimate of the error within the predictions of theta* from the mannequin on the unlabeled knowledge. That is referred to as a rectifier because it rectifies the expected parameter worth you’d get if we have been to deal with the mannequin predictions on the unlabeled knowledge because the true/gold customary values with the intention to get better theta*. Then, you utilize the labeled knowledge to assemble a confidence set estimating your uncertainty concerning the rectifier. Lastly, you utilize that confidence set to create a provably legitimate confidence set for theta* which adjusts for the prediction error.
You may evaluate this sort of strategy to the case the place you simply assemble your confidence set utilizing solely the labeled observations, leading to a large interval, or the place you do inference on the mixture of labeled and unlabeled knowledge by assuming the mannequin predicted labels for the unlabeled knowledge are right, which will get you tighter uncertainty intervals however which can not include the true parameter worth. To present instinct for the way prediction powered inference differs, the authors begin with an instance of imply estimation, the place your prediction powered estimate decomposes to your common prediction for the unabeled knowledge, minus the common error in predictions on the labeled knowledge. If the mannequin is correct, the second time period is 0, so you find yourself with an estimate on the unlabeled knowledge which has a lot decrease variance than your classical estimate (since N >> n). Relative to current work on estimation with a mix of labeled and unlabeled knowledge, prediction-powered inference assumes that a lot of the knowledge is unlabeled, and considers circumstances the place the mannequin is educated on separate knowledge, which permits for generalizing the strategy to any estimator which is minimizing some convex goal and avoids making assumptions concerning the mannequin.
Right here’s a determine illustrating this course of (which is relatively stunning I believe, no less than by pc science requirements):
They apply the strategy to a lot of examples to create confidence intervals for e.g., the proportion of individuals voting for every of two candidates in a San Francisco election (utilizing a pc imaginative and prescient mannequin educated on pictures of ballots), predicting intrinsically disordered areas of protein constructions (utilizing AlphaFold), estimating the consequences of age and intercourse on earnings from census knowledge, and so on.
Additionally they present an extension to circumstances the place there’s distribution shift, within the type of the proportion of lessons within the labeled being completely different from that within the unlabeled knowledge. I respect this, as certainly one of my pet peeves with a lot of the ML uncertainty estimation work occurring as of late is the how comfortably individuals appear to be utilizing the time period “distribution-free,” relatively than one thing like non-parametric, though the default assumption is that the (unknown) distribution doesn’t change. In fact the distribution issues, utilizing labels that suggest we don’t care in any respect about it feels sort of like implying that there’s in truth the potential for a free lunch.