Let’s make a voting system

Dietrich Epp, last updated 2012 Jul 31 01:20 UTC

The Ludum Dare voting system gets some flak, but let’s figure out what purpose it serves in the community, analyze it from a mathematical perspective, and see if we can improve it.

Ludum Dare is a grueling, 48-hour game development competition. Think of it a triathlon spent in front of a computer screen: each participant has to master game design, programming, and media creation. Almost everything has to be created from scratch during the 48 hours. Failure is common.

However, the contest is an undeniable success. The 23rd Ludum Dare saw over 1,400 successful participants, which creates a logistical headache: after the contest, the entries are ranked in several categories, and winners are identified. Sorting through 1,400 entries and rating them takes even more time than the contest itself!

Voting system

We can identify this as a simple range voting.

In this article, I sometimes draw a distinction between “rating system,” which refers to the system that generates ratings from 1–5, and “ranking system,” which refers to the system that generates rankings #1, #2, #3,…,#1402.

Voting accuracy

Posters on the Ludum Dare website and others have identified the voting system as “unfair” for a number of reasons.

Are these concerns reasonable? Can we create a system that addresses these concerns?

The reason we vote

Before we can address concerns about the voting system, we have to understand why the voting system exists in the first place. For me, there are two reasons.

  1. Voting is a source of quantitative feedback about the games we make.
  2. Voting lets us find the “top” games after the contest ends.

Getting quantitative feedback

For me, the quantitative feedback holds wonderful possibilities. I want to know how people rated my game. For the record, here are the ratings for my game (original data):

CategoryRatingRank
Coolness71%(bronze)
Innovation3.42#61
Graphics3.63#77
Overall3.29#100
Humor2.68#108
Theme3.25#155
Fun2.63#236
Mood2.41#314
Community2.33#404
Audio1.13#557

From this quantitative evaluation, I’ve gained some valuable feedback for my post-mortem analysis (it’s never to late for a post-mortem).

For me, these numbers are a great supplement to the comments. They make me feel confident about which areas I nailed and which areas I need to work on.

As a final note, I'm curious as to who rated my game’s audio as a “2” or higher. My game has no audio. Is a “1” reserved for “it makes my ears bleed”? A couple games I played did hurt my ears…

Finding top games

At the end of the contest, we want to see the best games. Kudos to whoever gets first place, of course. But I love looking at the top five or ten games. I like to download a cool game, play it, and get to say “I was a part of that.” Ranking the top games is not only a reward to the people who made those games, it is a reward to everyone who finds a fun game by looking at the rankings.

Does the system work?

For giving quantitative feedback, I say, “Yes, it does work.” But there are problems with the ranking system when you consider how the top games are selected. We will need to do a little bit of math.

Let’s start by looking at the top three games for LD #23 in the “overall” category.

GameVotesRating
Fracuum1194.43
Memento2164.39
Super Strict Farmer814.37

These numbers are fairly convenient for us. For Fracuum, a rating of 4.43 means that some people gave a rating of 4 and some people gave a rating of 5, and maybe a few people gave 3 or lower. We can actually calculate a lower bound on the variance of the votes, assuming the average is correct: the minimum variance that gives an average of 4.43 occurs when 43% of the voters vote 4, and 57% of the voters vote 5. The minimum variance is then:

σ2min = 0.57 * (4.43 – 4)2 + 0.43 * (4.43 – 5)2 ≈ 0.25.

Note that the variance exists, and it’s bounded (because you can’t vote outside the range 1–5). This lets us employ the Central Limit Theorem to come up with an estimate for the distribution of the final average. Assuming that the votes are independent of each other, then the central limit theorem tells us that the distribution of the average rating X̅ is approximately:

X̅ ~ normal(µ, σ2/n)

  • X̅: the observed average rating
  • µ: the “true” average rating
  • σ2: the variance of each vote
  • n: the number of votes

Central limit theorem visualized

Central limit theorem visualized

When we think about the distribution of the average vote, imagine running the contest over and over again, and plotting the results. Each time you run the contest, different people vote. When we think about the distribution of an individual vote, imagine selecting a person at random and asking them to vote this game.

By putting a lower bound on the variance for each individual vote, we get a lower bound on the variance for the average vote.

σ2/n ≥ σ2min/n = 0.25 / 119 = 0.0021.

We can do this for all the top games.

GameVotesRatingσ2/n
Fracuum1194.430.0021
Memento2164.390.0011
Super Strict Farmer814.370.0029

This lets us use normal distribution tables to answer the question,

What if Memento is really rated higher than Fracuum? What are the odds we’d still get the same results? What about Memento and Super Strict Farmer?

Well, we’re actually going to use a piece of software called R (www.r-project.org) to perform the calculations.

> pnorm((4.43 - 4.39) / sqrt(0.0021 + 0.0011))
[1] 0.76025
> pnorm((4.39 - 4.37) / sqrt(0.0011 + 0.0029))
[1] 0.6240852

Under this statistical model,

The likelihood distributions below show the amount of uncertainty in more vivid detail:

Likelihood distributions of top game ratings

Likelihood distributions of top game ratings

Discussion, and fixes

Based on the high chances of error we obtained by analyzing our model (38% is very high!), we can conclude that the current ranking system is bad at selecting a winner. It is very plausible that if you were to run the voting again, you would get a different answer for first, second, and third place.

The reason this happens in our model is because people are forced to choose between giving a 4 and a 5 as a rating—it is impossible for everyone to agree and give a 4.43. The discrete ratings put a lower bound on the variance, at least in certain cases. Additionally, while I can’t answer whether some people simply vote generous or vote harsh, if this is true, then it increases the variance further and makes the final accuracy worse.

We can think of some possible solutions to the ranking system’s poor accuracy:

  1. Abolish ratings altogether. This was proposed by Folis in his July 30, 2012 blog post. I am against this, because I believe I benefit from having my game rated, even if nobody else gets to see the ratings. I also think this removes some of the encouragement people have to vote on games.
  2. Abolish rankings. The ratings could be private to whoever received them, or the ratings could be public but the actual rankings (#1, #2, #3, etc.) could be removed from the website.
  3. Increase rating accuracy. If we increase the accuracy of the ratings, we also increase the accuracy of the rankings. For example, we could allow people to give ratings in increments of 0.1, which in theory would cut down the error by a factor of 10. The problems with this approach are fairly obvious: people are bad at giving ratings with such fine gradation. Analysis of the data could also allow us to lower error by identifying voting biases, but I believe this is also a lost cause.
  4. Create a new ranking system.

Reasons to vote

In my uninformed opinion, one of the strengths of the current system is that it actually encourages people to play games. We don’t just play games because we want to give feedback to the game developers, we also play games because we are hungry to see which games made it into the top ranks!

If voting on games didn’t have such a draw, if we eliminated voting altogether or even got rid of the rankings, I predict that the number of games played would drop.

Alternative system proposal

I propose a two phase system which keeps the benefits of the current system, and provides a more accurate ranking for the top games in each category. The system works in two phases.

The first phase of the system works exactly as the voting does now. Everyone plays everyone else’s game, and rates the games in each category. This phase is important to our community because the rating system ensures that everybody’s game gets played at least a few times, and you’re likely to get a couple good comments about your game.

The second phase is the “Ludum Dare finals.” The top 10 or so of the games are selected for the finals pool, and voters rank them relative to each other. This would require a completely new voting interface. Voters don’t have to rank all of the games in the final round, and ties are permissible.

The final ranking is determined using the ranked pairs method. For each game that a voter ranks, the vote counts as a vote for that game against every game that the voter ranks lower. The votes are tallied as pairs, so if I rank games A B C, then that counts for 1 vote A over B, 1 vote A over C, and 1 vote B over C. All the pairs are tallied and a final ranking is constructed by repeatedly adding the pair with the most votes to the final ranking, until a cycle is created in the ranking graph.

The advantages of this system:

Disadvantages:

Conclusion

I’ve showed that the existing system is a bit too inaccurate for the large number of games in recent competitions. With so many games and no accurate ranking system, the idea of ranking them is farcical, which is not fair to the participants.

The alternate system requires a bit of work (I can help!), but keeps most of the benefits of the existing system.

Feedback

I don’t have a comment system here, but you can reply to the post on the Ludum Dare blog: (Shall we vote?), or shout at me on Twitter (@DietrichEpp).