vignettes/covariates.Rmd
covariates.RmdIn addition to the non-covariate approach presented in the main
vignette (see demonstration.Rmd), fastei also
supports a covariate approach that incorporates electoral-unit–level
covariates. This approach follows the same general strategy of
maximizing the log-likelihood via an EM algorithm, with the M-step
carried out through a multinomial formulation.
The covariate specification requires computing a Hessian matrix,
which can be computationally costly; therefore, an approximation is used
in practice. In addition, parameter updates within each M-step are
obtained via a Newton–Raphson procedure. Empirically, performing a
single Newton step (i.e., maxnewton = 1) is typically
sufficient to achieve convergence.
The main difference from the non-parametric approach is that the probability matrix is defined at the electoral-unit level rather than as a single global probability matrix. This enables the inclusion of electoral-unit covariates, which may better capture and explain variation in voting behavior across units. Consequently, convergence of the EM algorithm is assessed solely through increases in the log-likelihood.
Under this model, probabilities are computed using a multinomial softmax. Specifically, letting denote the probability associated with electoral unit , group , and candidate , we define:
Thus, the initial probabilities depend on the parameters
alpha and beta, which are set to zero by
default if not provided.
An election can be simulated using the function
simulate_election(). When employing the covariate approach,
it is necessary to specify both the number of covariates
(num_covariates) and the number of districts
(num_districts). In practice, a district may represent a
collection of ballot boxes that share the same covariate values. In the
example below, we simulate an election with three candidates, two
demographic groups, and 20 covariates.
library(fastei)
election <- simulate_election(
num_ballots = 100,
num_candidates = 3,
num_groups = 2,
num_covariates = 20,
num_districts = 10,
seed = 42
)Note that the covariate setting must be explicitly enabled by
specifying num_covariates and num_districts
distinctly from zero.
The results of the simulation can be inspected by printing the
resulting object. The simulated election is returned as an
eim object.
election## eim ecological inference model
## Candidates' vote matrix (X) [b x c]:
## [,1] [,2] [,3]
## [1,] 82 3 15
## [2,] 82 2 16
## [3,] 82 0 18
## [4,] 85 3 12
## [5,] 41 44 15
## .
## .
## .
## Group-level voter matrix (W) [b x g]:
## [,1] [,2]
## [1,] 73 27
## [2,] 70 30
## [3,] 73 27
## [4,] 76 24
## [5,] 68 32
## .
## .
## .
## Attribute matrix (V) [b x a]:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.1233006 0.7853494 0.04989215 0.73170760 0.44456953 0.97933430 0.3937769
## [2,] 0.3110496 0.4533034 0.18735671 0.31526080 0.06038559 0.49690343 0.1419089
## [3,] 0.9457392 0.1357424 0.98265941 0.38645401 0.32750602 0.09307467 0.2798067
## [4,] 0.5000251 0.8852210 0.32827407 0.33244599 0.87842905 0.21177366 0.5648222
## [5,] 0.1352305 0.3367135 0.17099639 0.08977797 0.93060489 0.93050075 0.9351395
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0.002378107 0.09091831 0.5763365 0.6525101 0.3220339 0.4700924 0.5247906
## [2,] 0.160429915 0.79859664 0.0736678 0.2795735 0.7582701 0.3650141 0.6568109
## [3,] 0.399272951 0.35978525 0.1646274 0.9799076 0.1044129 0.2801246 0.2295194
## [4,] 0.675319577 0.04048758 0.7398908 0.6438641 0.7102779 0.5997159 0.7212260
## [5,] 0.480372016 0.04108634 0.4757110 0.5825784 0.9664774 0.8185696 0.4907504
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 0.5132395 0.5815542 0.48275690 0.54551435 0.9213298 0.3968551
## [2,] 0.6307262 0.2401496 0.97917358 0.44745733 0.3626018 0.6969568
## [3,] 0.4187716 0.7218879 0.81151679 0.08388484 0.8551350 0.6593197
## [4,] 0.8792659 0.1459287 0.54291282 0.93013369 0.3009062 0.4073507
## [5,] 0.1079871 0.1528388 0.07236709 0.01644819 0.4656624 0.3069202
## .
## .
## .
The model is estimated using the same function as in the non-covariate case. A minimal example is shown below:
result <- run_em(election)The quality of the estimated solution can be assessed by computing the mean absolute error (MAE):
## [1] 0.01356109
If a global probability matrix—analogous to that obtained in the
non-parametric approach—is desired, it can be computed as a weighted
average using the group sizes contained in W. Formally, the
global probability for group
()
and candidate
()
is defined as
The following code snippet implements this aggregation (note that we
could also get the global matrix by using the as.matrix()
function):
global_prob <- function(W, prob) {
G <- dim(prob)[1]
C <- dim(prob)[2]
P <- matrix(NA_real_, G, C)
for (g in seq_len(G)) {
w <- W[, g]
P[g, ] <- colSums(t(prob[g, , ]) * w) / sum(w)
}
return(P)
}Using this function, the estimated global probability matrix can be obtained as follows:
estimated_global_probability <- global_prob(result$W, result$prob)
estimated_global_probability## [,1] [,2] [,3]
## [1,] 0.6225445 0.16127307 0.2161825
## [2,] 0.5684556 0.06672677 0.3648177
The MAE for the global probabilities is then computed by comparing the estimated matrix with the true global probabilities:
real_global_probability <- global_prob(election$W, election$real_prob)
parametric_global_mae <- mean(abs(estimated_global_probability - real_global_probability))
parametric_global_mae## [1] 0.007670176
Finally, the contribution of covariates to estimation accuracy can be evaluated by comparing the covariate model with the non-covariate approach. This is achieved by removing the covariate matrix V and re-estimating the model:
non_parametric_election <- election
non_parametric_election$V <- NULL
non_parametric_result <- run_em(non_parametric_election)
non_parametric_mae <- mean(abs(non_parametric_result$prob - real_global_probability))
improvement <- (non_parametric_mae - parametric_global_mae) /
non_parametric_mae * 100
improvement## [1] 9.298436
The resulting value represents the percentage reduction in MAE achieved by incorporating covariates, relative to the non-parametric baseline.
When working with a large number of covariates, it may be
advantageous to reduce dimensionality using Principal Component Analysis
(PCA). Dimensionality reduction can improve both computational
efficiency and model interpretability. The function PCA
allows the covariate matrix V to be transformed prior to
running the EM algorithm.
Dimensionality can be reduced either by explicitly specifying the
number of principal components to retain via the argument
components, or by providing a threshold for the proportion
of explained variance using sd_threshold. The example below
illustrates how PCA can be applied to the covariate matrix
V in the simulated election:
## [1] 100 2
As expected, the resulting covariate matrix contains only two columns, corresponding to the two retained principal components. The EM algorithm can then be run on this modified election object:
pca_result <- run_em(PCA_election)
pca_global_probability <- global_prob(pca_result$W, pca_result$prob)
pca_global_mae <- mean(abs(pca_global_probability - real_global_probability))In some cases, reducing the number of covariates can lead to improved estimation accuracy. This can be evaluated by comparing the PCA-based model with the non-parametric baseline:
pca_improvement <- (non_parametric_mae - pca_global_mae) /
non_parametric_mae * 100
pca_improvement## [1] 48.14994
The resulting value represents the percentage reduction in MAE achieved by applying PCA relative to the non-parametric approach.