Title: | Dataset Shift with Outlier Scores |
---|---|
Description: | Test for no adverse shift in two-sample comparison when we have a training set, the reference distribution, and a test set. The approach is flexible and relies on a robust and powerful test statistic, the weighted AUC. Technical details are in Kamulete, V. M. (2021) <arXiv:1908.04000>. Modern notions of outlyingness such as trust scores and prediction uncertainty can be used as the underlying scores for example. |
Authors: | Vathy M. Kamulete [aut, cre] , Royal Bank of Canada (RBC) [cph] (Research supported and funded by RBC) |
Maintainer: | Vathy M. Kamulete <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.2 |
Built: | 2024-11-10 04:45:41 UTC |
Source: | https://github.com/vathymut/dsos |
Convert P-value to Bayes Factor
as_bf(pvalue)
as_bf(pvalue)
pvalue |
P-value. |
Bayes Factor (scalar value).
Marsman, M., & Wagenmakers, E. J. (2017). Three insights from a Bayesian interpretation of the one-sided P value. Educational and Psychological Measurement, 77(3), 529-539.
[as_pvalue()] to convert Bayes factor to p-value.
Other bayesian-test:
as_pvalue()
,
bf_compare()
,
bf_from_os()
library(dsos) bf_from_pvalue <- as_bf(pvalue = 0.5) bf_from_pvalue
library(dsos) bf_from_pvalue <- as_bf(pvalue = 0.5) bf_from_pvalue
Convert Bayes Factor to P-value
as_pvalue(bf)
as_pvalue(bf)
bf |
Bayes factor. |
p-value (scalar value).
Marsman, M., & Wagenmakers, E. J. (2017). Three insights from a Bayesian interpretation of the one-sided P value. Educational and Psychological Measurement, 77(3), 529-539.
[as_bf()] to convert p-value to Bayes factor.
Other bayesian-test:
as_bf()
,
bf_compare()
,
bf_from_os()
library(dsos) pvalue_from_bf <- as_pvalue(bf = 1) pvalue_from_bf
library(dsos) pvalue_from_bf <- as_pvalue(bf = 1) pvalue_from_bf
Test for no adverse shift with outlier scores. Like goodness-of-fit testing,
this two-sample comparison takes the training set, x_train
as the
as the reference. The method checks whether the test set, x_test
, is
worse off relative to this reference set. The function scorer
assigns
an outlier score to each instance/observation in both training and test set.
at_from_os(os_train, os_test)
at_from_os(os_train, os_test)
os_train |
Outlier scores in training (reference) set. |
os_test |
Outlier scores in test set. |
Li and Fine (2010) derives the asymptotic null distribution for the weighted
AUC (WAUC), the test statistic. This approach does not use permutations
and can, as a result, be much faster because it sidesteps the need to refit
the scoring function scorer
. This works well for large samples. The
prefix at stands for asymptotic test to tell it apart from the
prefix pt, the permutation test.
A named list of class outlier.test
containing:
statistic
: observed WAUC statistic
seq_mct
: sequential Monte Carlo test, when applicable
p_value
: p-value
outlier_scores
: outlier scores from training and test set
The outlier scores should all mimic out-of-sample behaviour. Mind that the training scores are not in-sample and thus, biased (overfitted) while the test scores are out-of-sample. The mismatch – in-sample versus out-of-sample scores – voids the test validity. A simple fix for this is to get the training scores from an indepedent (fresh) validation set; this follows the train/validation/test sample splitting convention and the validation set is effectively the reference set or distribution in this case.
Kamulete, V. M. (2022). Test for non-negligible adverse shifts. In The 38th Conference on Uncertainty in Artificial Intelligence. PMLR.
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.
[at_oob()] for variant requiring a scoring function. [pt_from_os()] for permutation test with the outlier scores.
Other asymptotic-test:
at_oob()
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) test_result <- at_from_os(os_train, os_test) test_result
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) test_result <- at_from_os(os_train, os_test) test_result
Test for no adverse shift with outlier scores. Like goodness-of-fit testing,
this two-sample comparison takes the training set, x_train
as the
as the reference. The method checks whether the test set, x_test
, is
worse off relative to this reference set. The function scorer
assigns
an outlier score to each instance/observation in both training and test set.
at_oob(x_train, x_test, scorer)
at_oob(x_train, x_test, scorer)
x_train |
Training (reference/validation) sample. |
x_test |
Test sample. |
scorer |
Function which returns a named list with outlier scores from
the training and test sample. The first argument to |
Li and Fine (2010) derives the asymptotic null distribution for the weighted
AUC (WAUC), the test statistic. This approach does not use permutations
and can, as a result, be much faster because it sidesteps the need to refit
the scoring function scorer
. This works well for large samples. The
prefix at stands for asymptotic test to tell it apart from the
prefix pt, the permutation test.
A named list of class outlier.test
containing:
statistic
: observed WAUC statistic
seq_mct
: sequential Monte Carlo test, when applicable
p_value
: p-value
outlier_scores
: outlier scores from training and test set
The scoring function, scorer
, predicts out-of-bag scores to mimic
out-of-sample behaviour. The suffix oob stands for out-of-bag to
highlight this point. This out-of-bag variant avoids refitting the
underlying algorithm from scorer
at every permutation. It can, as a
result, be computationally appealing.
Kamulete, V. M. (2022). Test for non-negligible adverse shifts. In The 38th Conference on Uncertainty in Artificial Intelligence. PMLR.
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.
[pt_oob()] for (faster) p-value approximation via out-of-bag predictions. [pt_refit()] for (slower) p-value approximation via refitting.
Other asymptotic-test:
at_from_os()
library(dsos) set.seed(12345) data(iris) setosa <- iris[1:50, 1:4] # Training sample: Species == 'setosa' versicolor <- iris[51:100, 1:4] # Test sample: Species == 'versicolor' # Using fake scoring function scorer <- function(tr, te) list(train=runif(nrow(tr)), test=runif(nrow(te))) oob_test <- at_oob(setosa, versicolor, scorer = scorer) oob_test
library(dsos) set.seed(12345) data(iris) setosa <- iris[1:50, 1:4] # Training sample: Species == 'setosa' versicolor <- iris[51:100, 1:4] # Test sample: Species == 'versicolor' # Using fake scoring function scorer <- function(tr, te) list(train=runif(nrow(tr)), test=runif(nrow(te))) oob_test <- at_oob(setosa, versicolor, scorer = scorer) oob_test
Test for no adverse shift with outlier scores. Like goodness-of-fit testing,
this two-sample comparison takes the training (outlier) scores,
os_train
, as the reference. The method checks whether the test
scores, os_test
, are worse off relative to the training set.
bf_compare(os_train, os_test, threshold = 1/12, n_pt = 4000)
bf_compare(os_train, os_test, threshold = 1/12, n_pt = 4000)
os_train |
Outlier scores in training (reference) set. |
os_test |
Outlier scores in test set. |
threshold |
Threshold for adverse shift. Defaults to 1 / 12, the asymptotic value of the test statistic when the two samples are drawn from the same distribution. |
n_pt |
The number of permutations. |
This compares the Bayesian to the frequentist approach for convenience.
The Bayesian test mimics 'bf_from_os()' and the frequentist one,
'pt_from_os()'. The Bayesian test computes Bayes factors based on the
asymptotic (defaults to 1/12) and the exchangeable threshold. The latter
calculates the threshold as the median weighted AUC (WAUC) after n_pt
permutations assuming outlier scores are exchangeable. This is recommended
for small samples. The frequentist test converts the one-sided (one-tailed)
p-value to the Bayes factor - see as_bf
function.
A list of factors (BF) for 3 different test specifications:
frequentist
: Frequentist BF.
bayes_noperm
: Bayestion BF test with asymptotic threshold.
bayes_perm
: Bayestion BF with exchangeable threshold.
The outlier scores should all mimic out-of-sample behaviour. Mind that the training scores are not in-sample and thus, biased (overfitted) while the test scores are out-of-sample. The mismatch – in-sample versus out-of-sample scores – voids the test validity. A simple fix for this is to get the training scores from an indepedent (fresh) validation set; this follows the train/validation/test sample splitting convention and the validation set is effectively the reference set or distribution in this case.
[bf_from_os()] for bayes factor, the Bayesian test. [pt_from_os()] for p-value, the frequentist test.
Other bayesian-test:
as_bf()
,
as_pvalue()
,
bf_from_os()
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) bayes_test <- bf_compare(os_train, os_test) bayes_test # To run in parallel on local cluster, uncomment the next two lines. # library(future) # future::plan(future::multisession) parallel_test <- bf_compare(os_train, os_test) parallel_test
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) bayes_test <- bf_compare(os_train, os_test) bayes_test # To run in parallel on local cluster, uncomment the next two lines. # library(future) # future::plan(future::multisession) parallel_test <- bf_compare(os_train, os_test) parallel_test
Test for no adverse shift with outlier scores. Like goodness-of-fit testing,
this two-sample comparison takes the training (outlier) scores,
os_train
, as the reference. The method checks whether the test
scores, os_test
, are worse off relative to the training set.
bf_from_os(os_train, os_test, n_pt = 4000, threshold = 1/12)
bf_from_os(os_train, os_test, n_pt = 4000, threshold = 1/12)
os_train |
Outlier scores in training (reference) set. |
os_test |
Outlier scores in test set. |
n_pt |
The number of permutations. |
threshold |
Threshold for adverse shift. Defaults to 1 / 12, the asymptotic value of the test statistic when the two samples are drawn from the same distribution. |
The posterior distribution of the test statistic is based on n_pt
(boostrap) permutations. The method uses the Bayesian bootstrap as a
resampling procedure as in Gu et al (2008). Johnson (2005) shows to
leverage (turn) a test statistic into a Bayes factor. The test statistic
is the weighted AUC (WAUC).
A named list of class outlier.bayes
containing:
posterior
: Posterior distribution of WAUC test statistic
threshold
: WAUC threshold for adverse shift
adverse_probability
: probability of adverse shift
bayes_factor
: Bayes factor
outlier_scores
: outlier scores from training and test set
The outlier scores should all mimic out-of-sample behaviour. Mind that the training scores are not in-sample and thus, biased (overfitted) while the test scores are out-of-sample. The mismatch – in-sample versus out-of-sample scores – voids the test validity. A simple fix for this is to get the training scores from an indepedent (fresh) validation set; this follows the train/validation/test sample splitting convention and the validation set is effectively the reference set or distribution in this case.
Kamulete, V. M. (2023). Are you OK? A Bayesian test for adverse shift. Manuscript in preparation.
Johnson, V. E. (2005). Bayes factors based on test statistics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(5), 689-701.
Gu, J., Ghosal, S., & Roy, A. (2008). Bayesian bootstrap estimation of ROC curve. Statistics in medicine, 27(26), 5407-5420.
Other bayesian-test:
as_bf()
,
as_pvalue()
,
bf_compare()
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) bayes_test <- bf_from_os(os_train, os_test) bayes_test # To run in parallel on local cluster, uncomment the next two lines. # library(future) # future::plan(future::multisession) parallel_test <- bf_from_os(os_train, os_test) parallel_test
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) bayes_test <- bf_from_os(os_train, os_test) bayes_test # To run in parallel on local cluster, uncomment the next two lines. # library(future) # future::plan(future::multisession) parallel_test <- bf_from_os(os_train, os_test) parallel_test
Plot Bayesian test for no adverse shift.
## S3 method for class 'outlier.bayes' plot(x, ...)
## S3 method for class 'outlier.bayes' plot(x, ...)
x |
A |
... |
Placeholder to be compatible with S3 method |
A ggplot2 plot with outlier scores and p-value.
Other s3-method:
plot.outlier.test()
,
print.outlier.bayes()
,
print.outlier.test()
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_plot <- bf_from_os(os_train, os_test) plot(test_to_plot)
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_plot <- bf_from_os(os_train, os_test) plot(test_to_plot)
Plot frequentist test for no adverse shift.
## S3 method for class 'outlier.test' plot(x, ...)
## S3 method for class 'outlier.test' plot(x, ...)
x |
A |
... |
Placeholder to be compatible with S3 method |
A ggplot2 plot with outlier scores and p-value.
Other s3-method:
plot.outlier.bayes()
,
print.outlier.bayes()
,
print.outlier.test()
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_plot <- at_from_os(os_train, os_test) # Also: pt_from_os(os_train, os_test) for permutation test plot(test_to_plot)
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_plot <- at_from_os(os_train, os_test) # Also: pt_from_os(os_train, os_test) for permutation test plot(test_to_plot)
Print Bayesian test for no adverse shift.
## S3 method for class 'outlier.bayes' print(x, ...)
## S3 method for class 'outlier.bayes' print(x, ...)
x |
A |
... |
Placeholder to be compatible with S3 method |
Print to screen: display Bayes factor and other information.
Other s3-method:
plot.outlier.bayes()
,
plot.outlier.test()
,
print.outlier.test()
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_print <- bf_from_os(os_train, os_test) test_to_print
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_print <- bf_from_os(os_train, os_test) test_to_print
Print frequentist test for no adverse shift.
## S3 method for class 'outlier.test' print(x, ...)
## S3 method for class 'outlier.test' print(x, ...)
x |
A |
... |
Placeholder to be compatible with S3 method |
Print to screen: display p-value and other information.
Other s3-method:
plot.outlier.bayes()
,
plot.outlier.test()
,
print.outlier.bayes()
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_print <- at_from_os(os_train, os_test) # Also: pt_from_os(os_train, os_test) for permutation test test_to_print
set.seed(12345) os_train <- rnorm(n = 3e2) os_test <- rnorm(n = 3e2) test_to_print <- at_from_os(os_train, os_test) # Also: pt_from_os(os_train, os_test) for permutation test test_to_print
Test for no adverse shift with outlier scores. Like goodness-of-fit testing,
this two-sample comparison takes the training (outlier) scores,
os_train
, as the reference. The method checks whether the test
scores, os_test
, are worse off relative to the training set.
pt_from_os(os_train, os_test, n_pt = 2000)
pt_from_os(os_train, os_test, n_pt = 2000)
os_train |
Outlier scores in training (reference) set. |
os_test |
Outlier scores in test set. |
n_pt |
The number of permutations. |
The null distribution of the test statistic is based on n_pt
permutations. For speed, this is implemented as a sequential Monte Carlo test
with the simctest package. See Gandy (2009) for details. The prefix
pt refers to permutation test. This approach does not use the
asymptotic null distribution for the test statistic. This is the recommended
approach for small samples. The test statistic is the weighted AUC (WAUC).
A named list of class outlier.test
containing:
statistic
: observed WAUC statistic
seq_mct
: sequential Monte Carlo test, when applicable
p_value
: p-value
outlier_scores
: outlier scores from training and test set
The outlier scores should all mimic out-of-sample behaviour. Mind that the training scores are not in-sample and thus, biased (overfitted) while the test scores are out-of-sample. The mismatch – in-sample versus out-of-sample scores – voids the test validity. A simple fix for this is to get the training scores from an indepedent (fresh) validation set; this follows the train/validation/test sample splitting convention and the validation set is effectively the reference set or distribution in this case.
Kamulete, V. M. (2022). Test for non-negligible adverse shifts. In The 38th Conference on Uncertainty in Artificial Intelligence. PMLR.
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.
[pt_oob()] for variant requiring a scoring function. [at_from_os()] for asymptotic test with the outlier scores.
Other permutation-test:
pt_oob()
,
pt_refit()
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) null_test <- pt_from_os(os_train, os_test) null_test
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) null_test <- pt_from_os(os_train, os_test) null_test
Test for no adverse shift with outlier scores. Like goodness-of-fit testing,
this two-sample comparison takes the training set, x_train
as the
as the reference. The method checks whether the test set, x_test
, is
worse off relative to this reference set. The function scorer
assigns
an outlier score to each instance/observation in both training and test set.
pt_oob(x_train, x_test, scorer, n_pt = 2000)
pt_oob(x_train, x_test, scorer, n_pt = 2000)
x_train |
Training (reference/validation) sample. |
x_test |
Test sample. |
scorer |
Function which returns a named list with outlier scores from
the training and test sample. The first argument to |
n_pt |
The number of permutations. |
The null distribution of the test statistic is based on n_pt
permutations. For speed, this is implemented as a sequential Monte Carlo test
with the simctest package. See Gandy (2009) for details. The prefix
pt refers to permutation test. This approach does not use the
asymptotic null distribution for the test statistic. This is the recommended
approach for small samples. The test statistic is the weighted AUC (WAUC).
A named list of class outlier.test
containing:
statistic
: observed WAUC statistic
seq_mct
: sequential Monte Carlo test, when applicable
p_value
: p-value
outlier_scores
: outlier scores from training and test set
The scoring function, scorer
, predicts out-of-bag scores to mimic
out-of-sample behaviour. The suffix oob stands for out-of-bag to
highlight this point. This out-of-bag variant avoids refitting the
underlying algorithm from scorer
at every permutation. It can, as a
result, be computationally appealing.
Kamulete, V. M. (2022). Test for non-negligible adverse shifts. In The 38th Conference on Uncertainty in Artificial Intelligence. PMLR.
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.
[pt_refit()] for (slower) p-value approximation via refitting. [at_oob()] for p-value approximation from asymptotic null distribution.
Other permutation-test:
pt_from_os()
,
pt_refit()
library(dsos) set.seed(12345) data(iris) idx <- sample(nrow(iris), 2 / 3 * nrow(iris)) iris_train <- iris[idx, ] iris_test <- iris[-idx, ] # Use a synthetic (fake) scoring function for illustration scorer <- function(tr, te) list(train=runif(nrow(tr)), test=runif(nrow(te))) pt_test <- pt_oob(iris_train, iris_test, scorer = scorer) pt_test
library(dsos) set.seed(12345) data(iris) idx <- sample(nrow(iris), 2 / 3 * nrow(iris)) iris_train <- iris[idx, ] iris_test <- iris[-idx, ] # Use a synthetic (fake) scoring function for illustration scorer <- function(tr, te) list(train=runif(nrow(tr)), test=runif(nrow(te))) pt_test <- pt_oob(iris_train, iris_test, scorer = scorer) pt_test
Test for no adverse shift with outlier scores. Like goodness-of-fit testing,
this two-sample comparison takes the training set, x_train
as the
as the reference. The method checks whether the test set, x_test
, is
worse off relative to this reference set. The function scorer
assigns
an outlier score to each instance/observation in both training and test set.
pt_refit(x_train, x_test, scorer, n_pt = 2000)
pt_refit(x_train, x_test, scorer, n_pt = 2000)
x_train |
Training (reference/validation) sample. |
x_test |
Test sample. |
scorer |
Function which returns a named list with outlier scores from
the training and test sample. The first argument to |
n_pt |
The number of permutations. |
The null distribution of the test statistic is based on n_pt
permutations. For speed, this is implemented as a sequential Monte Carlo test
with the simctest package. See Gandy (2009) for details. The prefix
pt refers to permutation test. This approach does not use the
asymptotic null distribution for the test statistic. This is the recommended
approach for small samples. The test statistic is the weighted AUC (WAUC).
A named list of class outlier.test
containing:
statistic
: observed WAUC statistic
seq_mct
: sequential Monte Carlo test, when applicable
p_value
: p-value
outlier_scores
: outlier scores from training and test set
The scoring function, scorer
, predicts out-of-sample scores by
refitting the underlying algorithm from scorer
at every permutation
The suffix refit emphasizes this point. This is in contrast to the
out-of-bag variant, pt_oob
, which only fits once. This method can be
be computationally expensive.
Kamulete, V. M. (2022). Test for non-negligible adverse shifts. In The 38th Conference on Uncertainty in Artificial Intelligence. PMLR.
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.
[pt_oob()] for (faster) p-value approximation via out-of-bag predictions. [at_oob()] for p-value approximation from asymptotic null distribution.
Other permutation-test:
pt_from_os()
,
pt_oob()
library(dsos) set.seed(12345) data(iris) setosa <- iris[1:50, 1:4] # Training sample: Species == 'setosa' versicolor <- iris[51:100, 1:4] # Test sample: Species == 'versicolor' scorer <- function(tr, te) list(train=runif(nrow(tr)), test=runif(nrow(te))) pt_test <- pt_refit(setosa, versicolor, scorer = scorer) pt_test
library(dsos) set.seed(12345) data(iris) setosa <- iris[1:50, 1:4] # Training sample: Species == 'setosa' versicolor <- iris[51:100, 1:4] # Test sample: Species == 'versicolor' scorer <- function(tr, te) list(train=runif(nrow(tr)), test=runif(nrow(te))) pt_test <- pt_refit(setosa, versicolor, scorer = scorer) pt_test
Computes the weighted AUC with the weighting scheme described in Kamulete, V. M. (2021). This assumes that the training set is the reference distribution and specifies a particular functional form to derive weights from threshold scores.
wauc_from_os(os_train, os_test, weight = NULL)
wauc_from_os(os_train, os_test, weight = NULL)
os_train |
Outlier scores in training (reference) set. |
os_test |
Outlier scores in test set. |
weight |
Numeric vector of weights of length
|
The weighted AUC (scalar value) given the weighting scheme.
Kamulete, V. M. (2022). Test for non-negligible adverse shifts. In The 38th Conference on Uncertainty in Artificial Intelligence. PMLR.
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) test_stat <- wauc_from_os(os_train, os_test)
library(dsos) set.seed(12345) os_train <- rnorm(n = 100) os_test <- rnorm(n = 100) test_stat <- wauc_from_os(os_train, os_test)