Executes the Expectation-Maximization (EM) algorithm indicating the approximation method to use in the E-step.
Certain methods may require additional arguments, which can be passed through ...
(see fastei-package for more details).
run_em(
object = NULL,
X = NULL,
W = NULL,
json_path = NULL,
method = "mult",
initial_prob = "group_proportional",
allow_mismatch = TRUE,
maxiter = 1000,
miniter = 0,
maxtime = 3600,
param_threshold = 0.001,
ll_threshold = as.double(-Inf),
compute_ll = TRUE,
seed = NULL,
verbose = FALSE,
group_agg = NULL,
mcmc_samples = 1000,
mcmc_stepsize = 3000,
mvncdf_method = "genz",
mvncdf_error = 1e-05,
mvncdf_samples = 5000,
adjust_prob_cond_method = "project_lp",
adjust_prob_cond_every = FALSE,
...
)
An object of class eim
, which can be created using the eim function. This parameter should not be used if either (i) X
and W
matrices or (ii) json_path
is supplied. See Note.
A (b x c)
matrix representing candidate votes per ballot box.
A (b x g)
matrix representing group votes per ballot box.
A path to a JSON file containing X
and W
fields, stored as nested arrays. It may contain additional fields with other attributes, which will be added to the returned object.
An optional string specifying the method used for estimating the E-step. Valid options are:
mult
: The default method, using a single sum of Multinomial distributions.
mvn_cdf
: Uses a Multivariate Normal CDF distribution to approximate the conditional probability.
mvn_pdf
: Uses a Multivariate Normal PDF distribution to approximate the conditional probability.
mcmc
: Uses MCMC to sample vote outcomes. This is used to estimate the conditional probability of the E-step.
exact
: Solves the E-step using the Total Probability Law.
For a detailed description of each method, see fastei-package and References.
An optional string specifying the method used to obtain the initial probability. Accepted values are:
uniform
: Assigns equal probability to every candidate within each group.
proportional
: Assigns probabilities to each group based on the proportion of candidates votes.
group_proportional
: Computes the probability matrix by taking into account both group and candidate proportions. This is the default method.
random
: Use randomized values to fill the probability matrix.
Boolean, if TRUE
, allows a mismatch between the voters and votes for each ballot-box, only works if method
is "mvn_cdf"
, "mvn_pdf"
, "mult"
and "mcmc"
. If FALSE
, throws an error if there is a mismatch. By default it is TRUE
.
An optional integer indicating the maximum number of EM iterations.
The default value is 1000
.
An optional integer indicating the minimum number of EM iterations. The default value is 0
.
An optional numeric specifying the maximum running time (in seconds) for the
algorithm. This is checked at every iteration of the EM algorithm. The default value is 3600
, which corresponds to an hour.
An optional numeric value indicating the minimum difference between
consecutive probability values required to stop iterating. The default value is 0.001
. Note that the algorithm will stop if either ll_threshold
or param_threshold
is accomplished.
An optional numeric value indicating the minimum difference between consecutive log-likelihood values to stop iterating. The default value is inf
, essentially deactivating
the threshold. Note that the algorithm will stop if either ll_threshold
or param_threshold
is accomplished.
An optional boolean indicating whether to compute the log-likelihood at each iteration. The default value is TRUE
.
An optional integer indicating the random seed for the randomized algorithms. This argument is only applicable if initial_prob = "random"
or method
is either "mcmc"
or "mvn_cdf"
.
An optional boolean indicating whether to print informational messages during the EM
iterations. The default value is FALSE
.
An optional vector of increasing integers from 1 to the number of columns in W
, specifying how to aggregate groups in W
before running the EM algorithm. Each value represents the highest column index included in each aggregated group. For example, if W
has four columns, group_agg = c(2, 4)
indicates that columns 1 and 2 should be combined into one group, and columns 3 and 4 into another. Defaults to NULL
, in which case no group aggregation is performed.
An optional integer indicating the number of samples to generate for the
MCMC method. This parameter is only relevant when method = "mcmc"
.
The default value is 1000
.
An optional integer specifying the step size for the mcmc
algorithm. This parameter is only applicable when method = "mcmc"
and will
be ignored otherwise. The default value is 3000
.
An optional string specifying the method used to estimate the mvn_cdf
method
via a Monte Carlo simulation. Accepted values are genz
and genz2
, with genz
set as the default. This parameter is only applicable when method = "mvn_cdf"
. See References for more details.
An optional numeric value defining the error threshold for the Monte Carlo
simulation when estimating the mvn_cdf
method. The default value is 1e-6
. This parameter is only relevant
when method = "mvn_cdf"
.
An optional integer specifying the number of Monte Carlo
samples for the mvn_cdf
method. The default value is 5000
. This argument is only applicable when method = "mvn_cdf"
.
An optional string indicating the method to adjust the conditional probability so that for each candidate, the sum product of voters and conditional probabilities across groups equals the votes obtained by the candidate. It can take values: ""
if no adjusting is made, lp
if the adjustment is based on a linear programming that penalizes with zero norm, project_lp
if the adjustment is performed using projection and linear programming (this is the default)
An optional boolean indicating whether to adjust the conditional probability on every iteration (if TRUE
), or only at the conditional probabilities obtained at the end of the EM algorithm (if FALSE
, this is the default). This parameter applies only if adjust_prob_conditional_method
is lp
or project_lp
.
Added for compability
The function returns an eim
object with the function arguments and the following attributes:
The estimated probability matrix (g x c)
.
A (g x c x b)
3d-array with the probability that a at each ballot-box a voter of each group voted for each candidate, given the observed outcome at the particular ballot-box.
A (g x c x b)
3d-array with the expected votes cast for each ballot box.
The log-likelihood value from the last iteration.
The total number of iterations performed by the EM algorithm.
The total execution time of the algorithm in seconds.
The final status ID of the algorithm upon completion:
0
: Converged
1
: Maximum time reached.
2
: Maximum iterations reached.
The finishing status displayed as a message, matching the status ID value.
The method for estimating the conditional probability in the E-step.
Aditionally, it will create mcmc_samples
and mcmc_stepsize
parameters if the specified method = "mcmc"
, or mvncdf_method
, mvncdf_error
and mvncdf_samples
if method = "mvn_cdf"
.
Also, if the eim object supplied is created with the function simulate_election, it also returns the real probability and unobserved votes with the name real_prob
and outcome
respectively. See simulate_election.
If group_agg
is different than NULL
, two values are returned: W_agg
a (b x a)
matrix with the number of voters of each aggregated group o each ballot-box, and group_agg
the same input vector.
This function can be executed using one of three mutually exclusive approaches:
By providing an existing eim
object.
By supplying both input matrices (X
and W
) directly.
By specifying a JSON file (json_path
) containing the matrices.
These input methods are mutually exclusive, meaning that you must provide exactly one of these options. Attempting to provide more than one or none of these inputs will result in an error.
When called with an eim
object, the function updates the object with the computed results.
If an eim
object is not provided, the function will create one internally using either the
supplied matrices or the data from the JSON file before executing the algorithm.
Thraves, C., Ubilla, P. and Hermosilla, D.: "Fast Ecological Inference Algorithm for the RxC Case". Aditionally, the MVN CDF is computed by the methods introduced in Genz, A. (2000). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics
The eim object implementation.
# \donttest{
# Example 1: Compute the Expected-Maximization with default settings
simulations <- simulate_election(
num_ballots = 300,
num_candidates = 5,
num_groups = 3,
)
model <- eim(simulations$X, simulations$W)
model <- run_em(model) # Returns the object with updated attributes
# Example 2: Compute the Expected-Maximization using the mvn_pdf method
model <- run_em(
object = model,
method = "mvn_pdf",
)
# Example 3: Run the mvn_cdf method with default settings
model <- run_em(object = model, method = "mvn_cdf")
# }
if (FALSE) { # \dontrun{
# Example 4: Perform an Exact estimation using user-defined parameters
run_em(
json_path = "a/json/file.json",
method = "exact",
initial_prob = "uniform",
maxiter = 10,
maxtime = 600,
param_threshold = 1e-3,
ll_threshold = 1e-5,
verbose = TRUE
)
} # }