An implementation of Approximate Bayesian Computation (ABC) methods
applied to pooled sequencing (Pool-seq) data is available in R language
in the package poolABC
. The purpose of this vignette is to
provide an in-depth overview of the capabilities of the package,
highlighting the usage of its main functions.
The initial sections of this vignette detail how to import pooled
sequencing data from files in the _rc
format. We also show
how to randomly select multiple subsets of the observed data, compute
summary statistics for those subsets and use those summary statistics as
target for parameter estimation and model selection with ABC.
This vignette also teaches users how to simulate Pool-seq data under
pre-defined models. Note that the simulation of Pool-seq data requires
functions included in the poolHelper
package. We then
exemplify how the imported data and the simulated data can be used to
perform parameter inference. The poolABC
package allows the
use of genome-wide multilocus data for ABC by using multiple subsets of
simulated and observed loci.
Briefly, we obtain one set of summary statistics for each set of
simulated loci and for each random subset of observed data. Each set of
summary statistics computed from a unique subset of observed data is
used as an independent target for parameter estimation. Thus, with the
poolABC
package, users obtain one posterior distribution,
for each parameter of interest and for each subset of observed loci.
Then, our package allows users to combine those multiple posteriors to
obtain a single estimate per parameter. This merging is performed with
the Epanechnikov kernel and weighting according to the distance between
the mean summary statistics of a subset of loci and the mean across the
genome, giving more weight to subsets of loci with a mean closer to the
overall mean. Finally, the poolABC
package also includes
functions to compute point estimates and produce plots of those merged
posterior distributions.
This package uses pooled sequencing data stored on _rc
format files. These _rc
files are created by running the
SNP-frequency-diff.pl
function of
popoolation2
. Briefly, this is an example of a typical
_rc
file with only two populations:
data.frame(chr = c("NC297", "NC297"), pos =c(3530, 5450), rc = c("A", "T"), allele_count = 2,
allele_states = c("A/G", "T/A"), deletion_sum = 0, snp_type = "pop",
major_alleles = c("AA","TT"), minor_alleles = c("GG", "AN"), maa_1 = c("54/55", "51/54"),
maa_2 = c("76/78", "96/96"), mia_1 = c("1/55", "3/54"), mia_2 = c("2/78", "0/96"))
#> chr pos rc allele_count allele_states deletion_sum snp_type major_alleles
#> 1 NC297 3530 A 2 A/G 0 pop AA
#> 2 NC297 5450 T 2 T/A 0 pop TT
#> minor_alleles maa_1 maa_2 mia_1 mia_2
#> 1 GG 54/55 76/78 1/55 2/78
#> 2 AN 51/54 96/96 3/54 0/96
More information about these files can be found at: https://sourceforge.net/p/popoolation2/wiki/Main/
If you have your data in _rc
files in a folder of your
computer, you can simply use the importContigs
function.
# load multiple files and organize information by contig
files <- importContigs(path = "/home/myawesomedata", pops = c(1, 4, 7, 10))
The path
input of this function indicates the path to
the folder where the _rc
files are located. By default, the
importContigs
function will import all files present in the
folder that include the _rc
pattern in their name. The
index of populations to import is defined by the pops
input
argument. For instance, the input pops = c(1, 4, 7, 10)
will import the major and minor allele for the first, fourth, seventh
and tenth population in the _rc
files.
The importContigs
function also includes several
optional input arguments. The files
input argument allows
you to specify the index of the files to import. For instance, by
setting files = 1:5
, only the first five files listed in
the output of list.files(path = path)
will be imported.
Additionally, specific contigs can be removed from the data by adding
their names to the remove
input argument.
The min.minor
input argument allows you to filter the
data by the number of minor allele reads. For instance, if
min.minor = 2
all sites where the total number of minor
allele reads across all populations of the pops
input
argument is below 2, will be removed from the data. Alternatively, by
setting filter = TRUE
, you can filter the data by the
frequency of the minor allele. When filter = TRUE
, the user
can define a threshold
for the minimum allowed frequency of
the minor allele. If no threshold
is defined, the
importContigs
function will assume that at least one minor
allele read per site should exist. Finally, it is possible to include an
header when importing the data. This header can be created with the
createHeader
function.
Random windows of a given size (in base pairs) can be selected from
the imported data with the pickWindows
function. The data
imported with the importContigs
function is a list with all
the elements required for the pickWindows
function. You can
assign each of those list elements to an individual object or use them
directly as input argument of the pickWindows
function.
# randomly select blocks of a given size from several contigs
blocks <- pickWindows(freqs, positions, range, rMajor, rMinor, coverage, window = 1000, nLoci = 100)
With this function, users can randomly select a subset of the
complete pooled sequencing data at their disposal. More specifically,
the pickWindows
function allows users to randomly select
nLoci
blocks of window
size (in base pairs)
from the data imported in the previous section. In other words, this
function will randomly select select nLoci
contigs and then
select one random block with a user defined size (defined by the
window
input) per contig.
The next step is the computation of a set of summary statistics from
the observed data. To compute summary statistics from the observed data,
we can use the statsContig
function. This function will
compute the same set of summary statistics used in the simulations from
the multiple random subset of loci obtained in the previous step.
# compute a set of observed summary statitstics
obs <- statsContig(randomWindows = blocks, nPops = 4, stat.names = stat.names)
Note that we are using the blocks
object created with
the pickWindows
function as the randomWindows
input argument. The statsContig
function will compute
summary statistics from those randomly selected blocks of observed data.
Also, the use of names for the summary statistics is strongly
recommended. To ensure that the set of observed summary statistics is
named, we should obtain the name of the simulated summary statistics and
include those in the stat.names
input argument of the
statsContig
function.
With this package, you also have the ability to simulate pooled
sequencing data under three different models by using the
poolSim
function. The model
input argument
allows the user to define which model to simulate. At the moment, this
package includes three different models: an isolation with migration
model with two populations, a model representing a single origin of two
divergent ecotypes and a third model representing a parallel origin of
those ecotypes.
To simulate data using the two populations model, you have to define the mean depth of coverage and the variance of the coverage for those two populations. You also need to create a list with the number of individuals per pool and per population. In the next chunk, you can see how to simulate data using this model:
# set the mean and variance of the coverage
sMean <- c(84.34, 66.76); sVars <- c(1437.22, 912.43)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(10, 10)), 2)
# run simulation for a two-populations model
sims <- poolSim(model="2pops", nDip=200, nPops=2, nLoci=100, nSites=2000, mutrate=2e-8, size=size, mean=sMean, variance=sVars, minimum=15, maximum=180, min.minor=1, Nref=c(25000, 25000), ratio=c(0.1, 3), pool=c(5, 180), seq=c(0.0001, 0.001), split=c(0, 3), CW=c(1e-13, 1e-3), WC=c(1e-13, 1e-3), bT=c(0, 0.5))
The poolSim
function requires several input arguments,
that are explained in detail in the help page of the function. However,
note that most of those input arguments define the minimum and maximum
values for a variety of relevant parameters. To simulate data using a
four populations model:
# set the mean and variance of the coverage
sMean <- c(84.34, 66.76, 65.69, 68.83); sVars <- c(1437.22, 912.43, 848.02, 1028.23)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(5, 20)), 4)
# run simulation for a four-populations model
sims <- poolSim(model="Single", nDip=400, nPops=4, nLoci=100, nSites=2000, mutrate=2e-8, size=size, mean=sMean, variance=sVars, minimum=15, maximum=180, min.minor=2, Nref=c(25000, 25000), ratio=c(0.1, 3), pool=c(5, 180), seq=c(0.0001, 0.001), split=c(0, 3), CW=c(1e-13, 1e-3), WC=c(1e-13, 1e-3), CC=c(1e-13, 1e-3), WW=c(1e-13, 1e-3), ANC=c(1e-13, 1e-3), bT=c(0, 0.5), bCW=c(0, 0.5), bWC=c(0, 0.5))
The poolSim
function can be used to perform a single
simulation. However, most of the times, you will want to perform
thousands of simulations. One way to accomplish this is to use
replicate
function together with our poolSim
function. We recommend that you do the following:
# set the mean and variance of the coverage
sMean <- c(84.34, 66.76); sVars <- c(1437.22, 912.43)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(5, 20)), 2)
# define how many simulations to run
nSims <- 10
# run one batch of simulations
sims <- t(replicate(n = nSims, unlist(poolSim(model="2pops", nDip=200, nPops=2, nLoci=100, nSites=1000, mutrate=2e-8, size=size, mean=sMean, variance=sVars, minimum=20, maximum=185, min.minor=2, Nref=c(25000, 25000), ratio=c(0.1, 3), pool=c(5, 180), seq=c(0.0001, 0.001), split=c(0, 3), CW=c(1e-13, 1e-3), WC=c(1e-13, 1e-3), bT=c(0, 0.5)))))
By using the replicate
function, you can perform
multiple simulations. By unlisting and then transposing the output of
those simulations, you obtain a matrix where each row corresponds to a
different simulation and each column is a different parameter or summary
statistic.
The observed summary statistics computed in the previous sections and the simulations performed in the previous one can then be used to perform parameter estimation with Approximate Bayesian Computation (ABC).
We included with this package a small dataset simulated under the two
populations model. This includes one matrix (sumstats
) with
the summary statistics computed from the simulated data, one matrix
(params
) with the simulated parameter values and a final
matrix (limits
) with the minimum and maximum value of the
prior distribution for each parameter.
The poolABC
package aims at streamlining the process of
parameter inference with Pool-seq data. One of the key components of
that design is the ABC
function.
By using this function, users can simultaneously perform parameter
estimation with ABC for multiple targets. The ABC
function
requires the data, imported with the importContigs
function
and then uses both the pickWindows
and
statsContig
functions to select multiple random subset of
loci from the observed data and compute a set of observed summary
statistics for each of those subsets. Thus, for each subset of loci we
obtain a vector of summary statistics and each vector acts as an
independent target for parameter estimation. The ABC
function can be used by doing:
# parameter estimation with ÃBC function
myabc <- ABC(nPops = 2, ntrials = 10, freqs = freqs, positions = positions, range = range, rMajor = rMajor, rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 100, limits = limits, params = params, sumstats = sumstats, tol = 0.01, method = "regression")
The ntrials
input argument defines the number of
independent targets for parameter estimation. In this example, we are
performing parameter inference for 10 different targets. Each of those
targets was obtained by computing summary statistics from windows of
1000 base pairs (window = 1000
) from 100
(nLoci = 100
) randomly selected contigs of the observed
data.
Note that you should also define the method
and
tolerance rate, tol
, to use. The tol
is
defined as the percentage of accepted simulation. You should strive to
keep a low tolerance rate, to avoid accepting simulations that are too
distant from the observed data, but it is also important to avoid very
stringent tolerance rates that may lead to few accepted values. A
typical value of tol = 0.01
or tol = 0.05
is
recommended but you should test different tol
values in the
cross-validation analysis (see more about this in subsequent
sections).
This package implements two ABC algorithms for constructing the
posterior distribution from the accepted simulations: a rejection method
and a regression-based correction using a local linear regression. When
method
is “rejection”, simulations are accepted if the
Euclidean distance between the set of summary statistics computed from
the simulated data and the target is sufficiently small and these
accepted simulations are considered a sample from the posterior
distribution. When method
is “regression”, an additional
step is used to correct for the imperfect match between the summary
statistics computed from the simulated data and the summary statistics
computed from the observed data. For this reason, we recommend that you
select the regression method because it will, most often than not, lead
to more precise parameter estimates.
After using the ABC
function to perform parameter
estimation with Approximate Bayesian Computation for several targets, we
need to merge the multiple posteriors obtained (one for each target)
into a single posterior per parameter.
This can be performed with the mergepost
function. One
of the required input arguments of this function is the
global
input. This input should be a vector with the
observed summary statistics computed from the entire dataset. We
recommend that you use the pickWindows
function to select a
large number of loci and then use that random selection as the input
argument of the statsContig
function.
# load multiple files and organize information by contig
blocks <- pickWindows(freqs = freqs, positions = positions, range = range, rMajor = rMajor, rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 800)
# compute a set of summary statistics from the observed data
global <- statsContig(randomWindows = blocks, nPops = 2, stat.names = stat.names)
The global vector can then be used in the mergepost
function. The remaining required input arguments are the matrix with the
target
for the parameter inference and the list containing
the posteriors (post
) for each target and parameter. It is
also possible to include the regression weights in the
wtreg
option.
# merge posterior distributions
myabc <- mergepost(target = myabc$target, global = global, post = myabc$adjusted, wtreg = myabc$weights)
Briefly, this function will merge the different posteriors into a
single one, using different weighting methods for the merging. Details
about the various elements of the mergepost
output can be
found in the help page of the function. Note that the
merge
, weighted
, merge_reg
and
weighted_reg
entries contain, for each parameter, a
locfit
object, obtained after merging the multiple
posteriors using the corresponding method. The merged_stat
,
weighted_stat
, merge_reg_stat
and
weighted_reg_stat
are the posterior point estimates for the
corresponding merging method.
Users can then plot the resulting merged posterior distribution with
the plot_weighted
function. You should include your matrix
with the simulated parameter values in the prior
input
argument of this function to also plot the prior distribution of the
chosen parameter. This allows for a comparison, in the same plot, of the
prior and posterior shape.
You should include the output of the mergepost
function
as the merged_posterior
input argument and a matrix with
the limits
of the prior distribution for each parameter.
Then, you just need to define which parameter to plot with the
index
input argument.
# plot the density estimation of a parameter
plot_weighted(prior = params, merged_posterior = myabc, limits = limits, index = 2)
The plot_weighted
function plots the posterior density
of the chosen parameter, together with the prior distribution of the
same plot.
The poolABC
package also allows users to perform model
selection by estimating the posterior model probabilities, comparing two
scenarios of ecotype formation: the single and the parallel origin
scenario. The modelSelect
function can be used to perform
model selection with ABC.
One of the required input arguments of the modelSelect
function is the index
. This is a vector of model indices
that should have the same length
as
nrow(sumstats)
to indicate to which model a particular row
of sumstats
belongs. The remaining input arguments are
explained in the help page of the function. As before, you should also
define the tolerance rate (tol
) and the method
to use. A tolerance of 0.01 and the “regression” method are
recommended.
# create a vecto of model indices
index <- c(rep("model1", nrow(sumstats)/2), rep("model2", nrow(sumstats)/2))
# select a random simulation to act as target
target <- sumstats[10, ]
# perform model selection with ABC
mysel <- modelSelect(target = target, index = index, sumstats = sumstats, tol = 0.1, method = "regression")
# display the structure of the mysel object
str(mysel)
#> List of 6
#> $ method : chr "regression"
#> $ indices: Factor w/ 2 levels "model1","model2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ pred : 'table' num [1:2(1d)] 0.541 0.459
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "model1" "model2"
#> $ ss : num [1:1000, 1:14] 0 0.0106 0.00118 0.01243 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:14] "Sf" "Sx1" "Sx2" "SS" ...
#> $ weights: num [1:1000] 0.2723 1 0.0146 0.0171 0.3713 ...
#> $ nmodels: Named int [1:2] 5000 5000
#> ..- attr(*, "names")= chr [1:2] "model1" "model2"
The output of the modelSelect
function is a list with
six entries. To quickly view the results of the model selection, you can
use the summary_modelSelect
function. This function will
provide an easy to read display of the posterior model probabilities and
Bayes factors. The only required input argument of the
summary_modelSelect
function is the object created with the
the modelSelect
function.
# check results of model selection
msel <- summary_modelSelect(object = mysel)
#> Data:
#> results based on 1000 posterior samples
#>
#> Models a priori:
#> model1, model2
#>
#> Models a posteriori:
#> model1, model2
#>
#> Proportion of accepted simulations (rejection):
#> model1 model2
#> 0.51 0.49
#>
#> Posterior model probabilities (mnlogistic):
#> model1 model2
#> 0.541 0.459
As you can see, by running the summary.modelSelect
function we get an output with the proportion of accepted simulation for
each model under a rejection method and posterior model probabilities
under the regression method. If we print the object itself
# print results of model selection
msel
#> $rejection
#> $rejection$Prob
#> model1 model2
#> 0.51 0.49
#>
#> $rejection$BayesF
#> model1 model2
#> model1 1.0000000 1.040816
#> model2 0.9607843 1.000000
#>
#>
#> $mnlogistic
#> $mnlogistic$Prob
#> model1 model2
#> 0.5406397 0.4593603
#>
#> $mnlogistic$BayesF
#> model1 model2
#> model1 1.0000000 1.176941
#> model2 0.8496605 1.000000
we can also see the Bayes factors between pairs of models for both the rejection and the regression methods.
A fundamental part of any ABC analysis is the validation of the
results obtained in the parameter estimation and model selection steps.
The poolABC
package includes tools to perform cross
validation for both analysis, computing prediction errors for both
parameter inference and model selection.
One important component of this validation process is the calculation of prediction errors for each parameter. This allows us to evaluate the confidence of the estimates and the effect of various point estimates and/or tolerance rates.
To perform a leave-one-out cross validation for ABC, you can use the
simulationABC
function. This functions requires the
simulated parameter values, params
, the simulated summary
statistics, sumstats
and a matrix with the
limits
of the prior distributions. You should also define
the size of the cross-validation sample, nval
, the
tolerance rate, tol
, and the type of ABC algorithm to be
applied in the method
input.
# perform an Approximate Bayesian Computation simulation study
mycv <- simulationABC(params = params, sumstats = sumstats, limits = limits, nval = 100, tol = 0.01, method = "regression")
# display the structure of the mycv object
str(mycv, max.level = 2)
#> List of 3
#> $ true: num [1:100, 1:8] 0.102 1.611 0.101 0.831 0.258 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ rej :List of 4
#> ..$ mode : num [1:100, 1:8] 0.108 1.972 0.099 0.366 0.308 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ median: num [1:100, 1:8] 0.142 1.92 0.196 0.506 0.418 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ mean : num [1:100, 1:8] 0.159 1.884 0.246 0.664 0.483 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ error : num [1:3, 1:8] 0.474 0.332 0.324 0.344 0.235 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> $ reg :List of 4
#> ..$ mode : num [1:100, 1:8] 0.108 2.187 0.125 0.526 0.279 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ median: num [1:100, 1:8] 0.11 2.254 0.13 0.565 0.334 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ mean : num [1:100, 1:8] 0.111 2.276 0.133 0.632 0.34 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ error : num [1:3, 1:8] 0.18 0.152 0.145 0.178 0.132 ...
#> .. ..- attr(*, "dimnames")=List of 2
The output of the simulationABC
function is a list with
three elements. Details about each list element are available in the
help page of the function. A quick way to visualize the results of the
leave-one-out cross validation is to plot the the cross-validation
results.
The poolABC
package includes the
plot_errorABC
function to allow this visual evaluation of
the quality of the estimation. This function requires as input the
output of the simulationABC
function. Additionally, you
need to define the ABC algorithm (either “reg” for regression or “rej”
for rejection) and which point estimate (“mode”, “median” or “mean”) to
plot. You should also define which parameter to plot, by selecting the
corresponding index
.
# plot the prediction errors
plot_errorABC(x = mycv, method = "reg", statistic = "median", index = 1)
As you can see, this produces a plot with the true parameter value in x-axis and the estimate value of the parameter in the y-axis. Thus, the closer the points are to the diagonal line, the higher is the quality of the estimation.
It is also possible to evaluate how much confidence we should place
in the model selection results by performing a leave-one-out cross
validation for model selection with ABC via subsequent calls to the
function modelSelect
. Briefly, several simulations from
each model are selected to act as validation simulations, while the
remaining simulations are used as training simulations. For each
validation simulation, the function modelSelect
is called
to estimate the posterior model probabilities.
# perform a leave-one-out cross validation for model selection
modelSim <- sim_modelSel(index = index, sumstats = sumstats, nval = 100, tol = 0.1)
# display the structure of the modelSim object
str(modelSim)
#> List of 5
#> $ cvsamples : Named int [1:200] 4192 2651 3721 4237 2109 3662 2407 3400 1684 4372 ...
#> ..- attr(*, "names")= chr [1:200] "model11" "model12" "model13" "model14" ...
#> $ true : chr [1:200] "model1" "model1" "model1" "model1" ...
#> $ estimated : chr [1:200] "model1" "model1" "model1" "model1" ...
#> $ model.probs: num [1:200, 1:2] 0.52 0.511 0.582 0.598 0.592 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:200] "model1" "model1" "model1" "model1" ...
#> .. ..$ : chr [1:2] "model1" "model2"
#> $ models : chr [1:2] "model1" "model2"
The output of this leave-one-out cross validation for model selection
is a list with 5 different elements that can be used in the
error_modelSel
function to compute the confusion matrix and
the mean misclassification probabilities of models.
Users can also define a threshold
for the posterior
model probabilities. This threshold
corresponds to the
minimum posterior probability of assignment. Thus, a simulation where
the posterior probability of any model is below the threshold will not
be assigned to a model and will instead be classified as
unclear
.
# compute the mean misclassification probabilities
mselError <- error_modelSel(object = modelSim)
#> Confusion matrix based on 100 samples for each model
#>
#> model1 model2
#> model1 54 46
#> model2 54 46
#>
#> Mean model posterior probabilities (mnlogistic)
#>
#> model1 model2
#> model1 0.506 0.494
#> model2 0.506 0.494
#>
#> Posterior probabilities of correctly assigned model1 models
#>
#> model1 incorrect
#> 0.541 0.459
#>
#> Posterior probabilities of correctly assigned model2 models
#>
#> incorrect model2
#> 0.463 0.537
#>
#> Posterior probabilities when model1 is estimated as model2
#>
#> model1 model2
#> 0.465 0.535
#>
#> Posterior probabilities when model2 is estimated as model1
#>
#> model1 model2
#> 0.542 0.458
The error_modelSel
function outputs the confusion matrix
and the mean model posterior probabilities obtained with the regression
method. It will also output other useful information such as the mean
posterior probability of correctly assigned models and the mean
posterior probability when each model is not correctly assigned.
For a more visual interpretation of these results, it is also
possible to display a barplot of the model misclassification. By using
the plot_msel
function we can plot the confusion matrix,
either in colour (if color = TRUE
) or in grey (if
color = FALSE
).
Using the output of the error_modelSel
function as the
input of the plot_msel
function, we can produce this
barplot showing the proportion of simulations classified to any of the
models.