Inferring the Unseen
Inferring the Unseen
In-Person Talk
Consider a finite sample from an unknown distribution over a countable alphabet. Unobserved events are alphabet symbols which do not appear in the sample. Estimating the probabilities of unobserved events is a basic problem in statistics and related fields, which was extensively studied in the context of point estimation. In this work we introduce a novel interval estimation scheme for unobserved events. Our proposed framework applies selective inference, as we construct confidence intervals (CIs) for the desired set of parameters. Interestingly, we show that the obtained CIs are dimension-free, as they do not grow with the alphabet size. Further, we show that our CIs are (almost) tight, in the sense that they cannot be further improved without violating the prescribed coverage rate. We utilize our proposed scheme for large alphabet modeling. We introduce a novel inference framework for large alphabet distributions which outperforms currently known methods while maintaining the desired confidence level. Our proposed method is robust, easy to apply and provides favorable performance guarantees.