Compute the Expected-Maximization Algorithm

Executes the Expectation-Maximization (EM) algorithm indicating the approximation method to use in the E-step. Certain methods may require additional arguments, which can be passed through ... (see fastei-package for more details).

Usage

run_em(
  object = NULL,
  X = NULL,
  W = NULL,
  json_path = NULL,
  method = "mult",
  initial_prob = "group_proportional",
  allow_mismatch = TRUE,
  maxiter = 1000,
  maxtime = 3600,
  param_threshold = 0.001,
  ll_threshold = as.double(-Inf),
  seed = NULL,
  verbose = FALSE,
  group_agg = NULL,
  mcmc_samples = 1000,
  mcmc_stepsize = 3000,
  mvncdf_method = "genz",
  mvncdf_error = 0.00001,
  mvncdf_samples = 5000,
  ...
)

Arguments

object

An object of class eim, which can be created using the eim function. This parameter should not be used if either (i) X and W matrices or (ii) json_path is supplied. See Note.

X

A (b x c) matrix representing candidate votes per ballot box.

W

A (b x g) matrix representing group votes per ballot box.

json_path

A path to a JSON file containing X and W fields, stored as nested arrays. It may contain additional fields with other attributes, which will be added to the returned object.

method

An optional string specifying the method used for estimating the E-step. Valid options are:

mult: The default method, using a single sum of Multinomial distributions.
mvn_cdf: Uses a Multivariate Normal CDF distribution to approximate the conditional probability.
mvn_pdf: Uses a Multivariate Normal PDF distribution to approximate the conditional probability.
mcmc: Uses MCMC to sample vote outcomes. This is used to estimate the conditional probability of the E-step.
exact: Solves the E-step using the Total Probability Law.

For a detailed description of each method, see fastei-package and References.

initial_prob

An optional string specifying the method used to obtain the initial probability. Accepted values are:

uniform: Assigns equal probability to every candidate within each group.
proportional: Assigns probabilities to each group based on the proportion of candidates votes.
group_proportional: Computes the probability matrix by taking into account both group and candidate proportions. This is the default method.
random: Use randomized values to fill the probability matrix.

allow_mismatch

Boolean, if TRUE, allows a mismatch between the voters and votes for each ballot-box, only works if method is "mvn_cdf", "mvn_pdf", "mult" and "mcmc". If FALSE, throws an error if there is a mismatch. By default it is TRUE.

maxiter

An optional integer indicating the maximum number of EM iterations. The default value is 1000.

maxtime

An optional numeric specifying the maximum running time (in seconds) for the algorithm. This is checked at every iteration of the EM algorithm. The default value is 3600, which corresponds to an hour.

param_threshold

An optional numeric value indicating the minimum difference between consecutive probability values required to stop iterating. The default value is 0.001. Note that the algorithm will stop if either ll_threshold or param_threshold is accomplished.

ll_threshold

An optional numeric value indicating the minimum difference between consecutive log-likelihood values to stop iterating. The default value is inf, essentially deactivating the threshold. Note that the algorithm will stop if either ll_threshold or param_threshold is accomplished.

seed

An optional integer indicating the random seed for the randomized algorithms. This argument is only applicable if initial_prob = "random" or method is either "mcmc" or "mvn_cdf".

verbose

An optional boolean indicating whether to print informational messages during the EM iterations. The default value is FALSE.

group_agg

An optional vector that refers to the group aggregation. It should contain the group indices to be aggregated. For example, c(2, 4) indicates that groups 1 and 2 should be aggregated to a single group and the columns 3 and 4 to another. Defaults to NULL.

mcmc_samples

An optional integer indicating the number of samples to generate for the MCMC method. This parameter is only relevant when method = "mcmc". The default value is 1000.

mcmc_stepsize

An optional integer specifying the step size for the mcmc algorithm. This parameter is only applicable when method = "mcmc" and will be ignored otherwise. The default value is 3000.

mvncdf_method

An optional string specifying the method used to estimate the mvn_cdf method via a Monte Carlo simulation. Accepted values are genz and genz2, with genz set as the default. This parameter is only applicable when method = "mvn_cdf". See References for more details.

mvncdf_error

An optional numeric value defining the error threshold for the Monte Carlo simulation when estimating the mvn_cdf method. The default value is 1e-6. This parameter is only relevant when method = "mvn_cdf".

mvncdf_samples

An optional integer specifying the number of Monte Carlo samples for the mvn_cdf method. The default value is 5000. This argument is only applicable when method = "mvn_cdf".

...

Added for compability

Value

The function returns an eim object with the function arguments and the following attributes:

prob

The estimated probability matrix (g x c).

cond_prob

A (b x g x c) 3d-array with the probability that a at each ballot-box a voter of each group voted for each candidate, given the observed outcome at the particular ballot-box.

logLik

The log-likelihood value from the last iteration.

iterations

The total number of iterations performed by the EM algorithm.

time

The total execution time of the algorithm in seconds.

status

The final status ID of the algorithm upon completion:

0: Converged
1: Maximum time reached.
2: Maximum iterations reached.

message

The finishing status displayed as a message, matching the status ID value.

method

The method for estimating the conditional probability in the E-step.

Aditionally, it will create mcmc_samples and mcmc_stepsize parameters if the specified method = "mcmc", or mvncdf_method, mvncdf_error and mvncdf_samples if method = "mvn_cdf".

Also, if the eim object supplied is created with the function simulate_election, it also returns the real probability with the name real_prob. See simulate_election.

Note

This function can be executed using one of three mutually exclusive approaches:

By providing an existing eim object.
By supplying both input matrices (X and W) directly.
By specifying a JSON file (json_path) containing the matrices.

These input methods are mutually exclusive, meaning that you must provide exactly one of these options. Attempting to provide more than one or none of these inputs will result in an error.

When called with an eim object, the function updates the object with the computed results. If an eim object is not provided, the function will create one internally using either the supplied matrices or the data from the JSON file before executing the algorithm.

References

Thraves, C., Ubilla, P. and Hermosilla, D.: "Fast Ecological Inference Algorithm for the RxC Case". Aditionally, the MVN CDF is computed by the methods introduced in Genz, A. (2000). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics

Examples

# \donttest{
# Example 1: Compute the Expected-Maximization with default settings
simulations <- simulate_election(
    num_ballots = 300,
    num_candidates = 5,
    num_groups = 3,
)
model <- eim(simulations$X, simulations$W)
model <- run_em(model) # Returns the object with updated attributes

# Example 2: Compute the Expected-Maximization using the mvn_pdf method
model <- run_em(
    object = model,
    method = "mvn_pdf",
)

# Example 3: Run the mvn_cdf method with default settings
model <- run_em(object = model, method = "mvn_cdf")
# }
if (FALSE) { # \dontrun{
# Example 4: Perform an Exact estimation using user-defined parameters

run_em(
    json_path = "a/json/file.json",
    method = "exact",
    initial_prob = "uniform",
    maxiter = 10,
    maxtime = 600,
    param_threshold = 1e-3,
    ll_threshold = 1e-5,
    verbose = TRUE
)
} # }