C++ Program for Kemeny-Young Preference Aggregation

This note, along with linked code and executable files, provides an efficient implementation for calculating the Kemeny-Young ordering of a set of "candidates" who have been partially ranked by a set of "voters". Not all voters need rank all candidates.

The Kemeny-Young method^1-7 is a way to combine the ordered preference lists of multiple individuals into a single overall preference order. It can thus be viewed as a voting scheme that determines not just a single chosen winner, but an entire ordered list. It is often used in elections whose outcome is the selection of a slate of winners, for example for N vacant positions on a council when there are M>N candidates. After applying Kemeny-Young, the declared winners are its top N ranked candidates out of the full number M (whose ranks are all calculated).

Kemeny-Young is a Condorcet method, meaning that its first-ranked candidate would win against all other candidates in individual pairwise contests, whenever there is a candidate with this property. Similarly, Kemeny-Young's second-ranked candidate would win against all other candidates (except the first ranked); and so on.

Mathematically, a Kemeny-Young ordering is defined as one that minimizes the total number of discrepancies among all the voters in their pairwise preferences between all candidates. That is, if the Kemeny-Young ordering ranks candidate X ahead of candidate Y, then any individual voter who ranked Y ahead of X contributes +1 to the total discrepancy. A voter who chose to rank only X, or only Y, or neither, does not contribute to the discrepancy. The discrepancy is summed over all distinct pairs of candidates X and Y.

The Kemeny-Young ordering thus satisfies, loosely speaking, the most possible voters as regards their stated preferences. Note that it does not use any information about how much higher a voter ranks X as compared to Y, but only the fact of the relative ordering. In this regard it is similar to the score in Kendall's tau rank correlation coefficient.

More than one ordering can be tied for the smallest number of discrepancies, so the Kemeny-Young ordering is not necessarily unique. Furthermore, it is known that the problem of finding even a single minimum discrepancy ordering is NP hard.^8,9 So, in real life, any program for finding the K-Y ordering can be only heuristic. However fast heuristics, good enough for practical voting applications, do exist. We implement one such here.

It is known that ordering all the candidates by their mean rank (across all voters) provides, in most cases, a good starting guess. Indeed, under some asymptotic assumptions, mean rank is a good approximation to K-Y. So, we always start with this guess.

As an outer loop, we do Ntry independent greedy minimizations of the K-Y score and record the resulting ranks for each candidate. Since each minimization is partly stochastic (as described below), the results can be different on each try. We report for each candidate the smallest (best) rank seen, the largest (worst), the mean rank over the tries, and the standard deviation of ranks seen. When the number of voters is large it is often the case that only one single ordering is found repeatedly -- putatively (though not easily provably) the unique K-Y answer. When multiple orderings are found, all tied for best score, it is up to the user to decide how to use the output. We recommend increasing Ntry until the order of the mean ranks is stable and defining this order as the outcome. (See example below.)

A single greedy minimization proceeds by indefinitely repeating these steps: Draw a random permutation of the candidates. Then, in the order of this permutation, remove and re-insert each candidate in the order position that minimizes the K-Y score. If there is a tie between such positions, choose one randomly. After all candidates have been so moved, then, starting at the top of the list and moving down, consider in turn each window of size Mperm candidates. Trying all permutations of the candidates within the window, choose the permutation that minimizes the K-Y score. If all the steps indicated fail to decrease the K-Y score at all, then exit the indefinite greedy minimization loop.

Typical values for Ntry are between 4 and 200 (depending on your computer speed). Bigger is always better. Computation time is of course linear in Ntry. A typical value for Mperm is 7. Computational time increases exponentially as Mperm is increased, so bigger, while better, is not very practical.

As supplied, the program takes one command-line argument, an integer value Ntry. The program reads the input data from standard input and expects one line per voter. That line is a series of strings separated by space or tab, the first of which is an identifier for the voter. The voter identifiers are not used by the program, but are only to facilitate human preparation of the input file; they may be randomly assigned, or be all identical. Each input line is taken to be a different voter, regardless of its voter identifier string.

The second and subsequent strings on each line, space or tab separated, are the names or unique identifiers of candidates in the voter's preference order, from most preferred to least preferred. A voter may thus rank any number of candidates. However, a voter who ranks zero or one candidate will not contribute to the outcome.

Ten professors (Brown, Davis, Johnson, Jones, Miller, Moore, Smith, Taylor, Williams, and Wilson) each rank a subset of 13 students (Anderson, Banks, Carter, Garcia, Horton, Kelly, Lee, Meyer, O'Brien, Price, Rice, Schmidt, and Wang) for a departmental prize. The input file, votes.txt, looks like this:

Smith: Anderson Garcia Lee Johnson: Anderson Kelly Rice Garcia Williams: Horton Schmidt Wang Jones: Horton Banks Schmidt Rice Brown: Anderson Garcia Davis: O'Brien Wang Price Miller: Kelly Banks Wang Lee Price Wilson: Carter Price Moore: Carter Meyer Taylor: Horton Meyer

So, for example, Professor Miller's favorite student is Kelly; her least favorite (of those whom she ranked) is Price. Notice that the colons are interpreted by the program as merely part of each voter identifier string. However, they add readability to the file.

Note the use of redirection (< and >) to specify the input and output files. You can omit the output redirection (> results.txt), in which case the output will print to the screen. On a desktop machine, the program takes a few seconds to run, producing this output to the screen (stderr):

*** There are 13 candidates and 10 voters. *** *** Please be patient, while I crunch... *** Score: 35 preferences out of 35 are honored. *** Now sending results to output file.

candidate best worst average in 190 tied orderings Anderson 1 4 1.432 +- 0.601 Horton 1 5 2.153 +- 1.002 Kelly 2 5 2.847 +- 0.690 Banks 4 6 4.563 +- 0.706 O'Brien 1 9 5.132 +- 1.376 Carter 1 11 5.363 +- 1.798 Schmidt 5 8 6.837 +- 0.688 Rice 6 10 8.742 +- 1.062 Wang 8 11 9.163 +- 0.935 Meyer 7 13 9.868 +- 1.849 Garcia 8 11 10.337 +- 0.913 Lee 10 12 11.716 +- 0.536 Price 12 13 12.847 +- 0.360

For these data, we see that there are many possible K-Y orderings that satisfy all 35 of the pairwise preferences expressed by the voters (faculty). Indeed, four students are ranked first in at least one such ordering. However, a consistent ordering of the average ranks of each student within such orderings does emerge. (Not shown here, increasing Ntry to 2000 produces an identical order by average ranks.) We also see above that in 10 of 200 tries greedy minimization terminated on a K-Y score that was not equal to the best found. These 10 failed tries did not contribute to the results.

Here is an example where 330 papers submitted to a meeting (the candidates) were each evaluated by reviewers chosen from a pool of 141 experts (the voters). Each reviewer was assigned papers in his/her area of expertise, and each paper was assigned to at least 4 reviewers. Some reviewers had broad expertise and were assigned a sample of papers across multiple specialties: This is important so that the comparative preferences don't partition into disjoint sets in unrelated specialties.

1557: 5866 6013 5710 5860 5902 6028 1656: 5777 5652 5737 5660 5664 5959 5637 5905 5788 5924 5473 5843 1682: 5538 5449 6084 6010 5462 2217: 5915 5988 5757 5850 5978 5994

Here 1656, 1682, and 2217 identify reviewers, and the other numbers identify papers. Reviewer 1682 liked paper 5538 the best, 5462 the worst, out of the 5 that he was asked to review.

*** There are 330 candidates and 141 voters. *** *** Please be patient, while I crunch... *** Score: 4359 preferences out of 7393 are honored. *** Now sending results to output file.

Like many real-life elections, there is significant disagreement among the voters, so that only 4359/7393 = 0.590 of pairwise voter preferences are satisfied by the best ordering. The output file looks like this:

candidate best worst average in 1 tied orderings 5800 1 1 1.000 +- 0.000 5747 2 2 2.000 +- 0.000 5665 3 3 3.000 +- 0.000 5564 4 4 4.000 +- 0.000 ... (lines omitted) ... 5636 327 327 327.000 +- 0.000 6080 328 328 328.000 +- 0.000 5836 329 329 329.000 +- 0.000 6047 330 330 330.000 +- 0.000

Since the best score was found only once, we here get no information about the uncertainty of individual scores. As an alternative, you can recompile the program with the switch ALLSCORES set; this causes all greedy iterations, not just the best one, to contribute to the result. We don't recommend this, because the purpose of most elections is to reach a decisive outcome. Choosing a value for Ntry in advance and utilizing only the best scoring ordering seen seems fairer and more decisive. In the above example, running with Ntry=100, which takes about 10 minutes on a desktop machine, finds a best ordering that honors 4385/7393 = 0.593 of voter preferences, slightly larger than before. This score is also found only once, indicating that it is still likely not the best possible.

When the number of candidates is small, and the number of voters is large, as for a city council election with 35 candidates competing for 6 vacancies, and with 3000 voters, the program's heuristics should produce a unique best ordering, without ambiguity. After an initial linear-time digestion of the input file, the program should find that K-Y ordering rapidly. This is because it is exhaustive within every window of Mperm candidates, because there are not many such windows, and because the large number of voters makes ties in the greedy algorithm's stochastic steps unlikely.

The file kemenyyoung.zip contains three files: kemenyyoung.cpp is the single C++ source file, kemenyyoung.exe is a compiled binary for Windows XP and 7, and kemenyyoung is a compiled binary for Linux (g++ in Linux 2.6 on Ubuntu 10.04). The first line of the source file is Windows-specific and should be deleted for Linux compilation.

The source file references a few Numerical Recipes routines, available at the Numerical Recipes web site, but is otherwise conforming standard C++. A typical compilation on a Linux (after setting the environment variable CPLUS_INCLUDE_PATH to point to a directory with the Numerical Recipes routines) is

1. John G. Kemeny, "Mathematics without numbers", Daedalus 88 (1959), pp. 577-591

2. John G. Kemeny and James L. Snell, Mathematical Models in the Social Sciences, Ginn, Boston, 1960. 3. H. P. Young and A. Levenglick, "A Consistent Extension of Condorcet's Election Principle", SIAM Journal on Applied Mathematics 35, no. 2 (1978), pp. 285-300

4. H. P. Young, "Condorcet's Theory of Voting", American Political Science Review 82, no. 2 (1988), pp. 1231-1244

5. H. P. Young, "Optimal ranking and choice from pairwise comparisons", in Information pooling and group decision making edited by B. Grofman and G. Owen (1986), JAI Press, pp. 113-122

6. H. P. Young, "Optimal Voting Rules", Journal of Economic Perspectives 9, no.1 (1995), pp. 51-64

7. H. P. Young, "Group choice and individual judgements", Chapter 9 of Perspectives on public choice: a handbook, edited by Dennis Mueller (1997) Cambridge UP., pp.181-200

8. John Bartholdi, III, Craig Tovey, and Michael Trick. "Voting schemes for which it can be difficult to tell who won the election", Social Choice and Welfare, 6, 157-165 (1989).