Runs the EM algorithm aggregating adjacent groups, maximizing the variability of macro-group allocation in ballot boxes.
Source:R/eim-class.R
get_agg_proxy.Rd
This function estimates the voting probabilities (computed using run_em) aggregating adjacent groups so that the estimated probabilities' standard deviation (computed using bootstrap) is below a given threshold. See Details for more information.
Usage
get_agg_proxy(
object = NULL,
X = NULL,
W = NULL,
json_path = NULL,
sd_statistic = "maximum",
sd_threshold = 0.05,
method = "mult",
feasible = TRUE,
nboot = 100,
allow_mismatch = TRUE,
seed = NULL,
...
)
Arguments
- object
An object of class
eim
, which can be created using the eim function. This parameter should not be used if either (i)X
andW
matrices or (ii)json_path
is supplied. See Note in run_em.- X
A
(b x c)
matrix representing candidate votes per ballot box.- W
A
(b x g)
matrix representing group votes per ballot box.- json_path
A path to a JSON file containing
X
andW
fields, stored as nested arrays. It may contain additional fields with other attributes, which will be added to the returned object.- sd_statistic
String indicates the statistic for the standard deviation
(g x c)
matrix for the stopping condition, i.e., the algorithm stops when the statistic is below the threshold. It can take the valuemaximum
, in which case computes the maximum over the standard deviation matrix, oraverage
, in which case computes the average.- sd_threshold
Numeric with the value to use as a threshold for the statistic (
sc_statistic
) of the standard deviation of the estimated probabilities. Defaults to 0.05.- method
An optional string specifying the method used for estimating the E-step. Valid options are:
mult
: The default method, using a single sum of Multinomial distributions.mvn_cdf
: Uses a Multivariate Normal CDF distribution to approximate the conditional probability.mvn_pdf
: Uses a Multivariate Normal PDF distribution to approximate the conditional probability.mcmc
: Uses MCMC to sample vote outcomes. This is used to estimate the conditional probability of the E-step.exact
: Solves the E-step using the Total Probability Law.
- feasible
Logical indicating whether the returned matrix must strictly satisfy the
sd_threshold
. IfTRUE
, no output is returned if the method does not find a group aggregation whose standard deviation statistic is below the threshold. IfFALSE
and the latter holds, it returns the group aggregation obtained from the DP with the the lowest standard deviation statistic. See Details for more information. Default isTRUE
.- nboot
Integer specifying how many times to run the EM algorithm.
- allow_mismatch
Boolean, if
TRUE
, allows a mismatch between the voters and votes for each ballot-box, only works ifmethod
is"mvn_cdf"
,"mvn_pdf"
,"mult"
and"mcmc"
. IfFALSE
, throws an error if there is a mismatch. By default it isTRUE
.- seed
An optional integer indicating the random seed for the randomized algorithms. This argument is only applicable if
initial_prob = "random"
ormethod
is either"mcmc"
or"mvn_cdf"
. Aditionally, it sets the random draws of the ballot boxes.- ...
Additional arguments passed to the run_em function that will execute the EM algorithm.
Value
It returns an eim object with the same attributes as the output of run_em, plus the attributes:
sd: A
(a x c)
matrix with the standard deviation of the estimated probabilities computed with bootstrapping. Note thata
denotes the number of macro-groups of the resulting group aggregation, it should be between1
andg
.nboot: Number of samples used for the bootstrap method.
seed: Random seed used (if specified).
sd_statistic: The statistic used as input.
sd_threshold: The threshold used as input.
is_feasible: Boolean indicating whether the statistic of the standard deviation matrix is below the threshold.
group_agg: Vector with the resulting group aggregation. See Examples for more details.
Additionally, it will create the W_agg
attribute with the aggregated groups, along with the attributes corresponding to running run_em with the aggregated groups.
Details
Groups need to have an order relation so that adjacent groups can be merged. Groups of consecutive column indices in the matrix W are considered adjacent. For example, consider the following seven groups defined by voters' age ranges: 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, and 80+. A possible group aggregation can be a macro-group composed of the three following age ranges: 20-39, 40-59, and 60+. Since there are multiple group aggregations, even for a fixed number of macro-groups, a Dynamic Program (DP) mechanism is used to find the group aggregation that maximizes the sum of the standard deviation of the macro-groups proportions among ballot boxes for a specific number of macro-groups. If no group aggregation standard deviation statistic meets the threshold condition, NULL
is returned.
To find the best group aggregation, the function runs the DP iteratively, starting with all groups (this case is trivial since the group aggregation is such that all macro-groups match exactly the original groups). If the standard deviation statistic (sd_statistic
) is below the threshold (sd_threshold
), it stops. Otherwise, it runs the DP such that the number of macro-groups is one unit less than the original number of macro-groups. If the standard deviation statistic is below the threshold, it stops. This continues until either the algorithm stops, or until no group aggregation obtained by the DP satisfies the threshold condition. If the former holds, then the last group aggregation obtained (before stopping) is returned; while if the latter holds, then no output is returned unless the user sets the input parameter feasible=FALSE
, in which case it returns the group aggregation that has the least standard deviation statistic, among the group-aggregations obtained from the DP.
Examples
# Example 1: Using a simulated instance
simulations <- simulate_election(
num_ballots = 400,
num_candidates = 3,
num_groups = 6,
group_proportions = c(0.4, 0.1, 0.1, 0.1, 0.2, 0.1),
lambda = 0.7,
seed = 42
)
result <- get_agg_proxy(
X = simulations$X,
W = simulations$W,
sd_threshold = 0.015,
seed = 42
)
result$group_agg # c(2 6)
#> [1] 2 6
# This means that the resulting group aggregation is conformed by
# two macro-groups: one that has the original groups 1 and 2; and
# a second that has the original groups 3, 4, 5, and 6.
# Example 2: Using the chilean election results
data(chile_election_2021)
niebla_df <- chile_election_2021[chile_election_2021$ELECTORAL.DISTRICT == "NIEBLA", ]
# Create the X matrix with selected columns
X <- as.matrix(niebla_df[, c("C1", "C2", "C3", "C4", "C5", "C6", "C7")])
# Create the W matrix with selected columns
W <- as.matrix(niebla_df[, c(
"X18.19", "X20.29",
"X30.39", "X40.49",
"X50.59", "X60.69",
"X70.79", "X80."
)])
solution <- get_agg_proxy(
X = X, W = W,
allow_mismatch = TRUE, sd_threshold = 0.03,
sd_statistic = "average", seed = 42
)
solution$group_agg # c(3, 4, 5, 6, 8)
#> [1] 3 4 5 6 8
# This means that the resulting group aggregation consists of
# five macro-groups: one that includes the original groups 1, 2, and 3;
# three singleton groups (4, 5, and 6); and one macro-group that includes groups 7 and 8.