#dbmillion

Note: I wrote this article using up-to-date data on the Monday evening, shortly before the competition closed on the Tuesday.

So, Derren Brown has decided to run a competition where you guess a number between one and a million, and the person with the closest number to the number that he has (allegedly...) written down beforehand gets taken to dinner. All you have to do is tweet your entry with the hashtag #dbmillion.

I'm not one to let a nice dataset like this go past without doing something with it - while also increasing my chances of winning, assuming that it is random.

Code

The code took about an hour to write. It is very simple, and comes in two halves; the first half saves the tweets with the hashtag in real time, and the second half works backwards and saves old tweets with the hashtag, until it reaches the first occurrance. Because this is all public timeline stuff, no authentication was required - just liberal use of the max_id search parameter to walk backwards in time.

Once the tweets were saved, some more code was needed to munge the data. Again, fairly straightforward to write, though I had to make some assumptions. As a result, the data below comes with some caveats:

Anyone who posted more than one number in their tweet isn't included in my dataset. This omits some people who guessed in the form of "I'm 18! My guess is 23455", but there aren't many of those.
Anyone who guessed in the form of "Six thousand, one hundred and two" has been omitted. I didn't particularly want to get a natural language parser involved, regardless of how straightforward it would have been.

If anyone wants to repeat the analysis, then CakeMonitor has archived the tweets in question in a .csv which he has hosted on MediaFire

Digits

So, firstly a basic test of randomness. If Derren's Twitter followers are truly pulling from a uniform distribution, then we expect the digit '1' to occur the same number of times as the digit '9'. Clearly, this is not the case.

Graph showing relative frequency of digits

Looking at the graph, we see that the digit '1' appears nearly twice as often (6059 times) as the digit '9' (3135 times). I have no real explanation for why, in general, higher digits are less popular. The digits '3' and '5' seem less popular than they should be, compared to their neighbors if there is some trend here, but it's certainly not Bedford's Law.

Popular Numbers

Okay, so it's clear that we're not pulling from a random distribution, but we don't expect that to be the case - humans are notoriously bad at choosing a random number. Asked to pick a number between 1 and 100, '37' occurs a disproportionate amount, we are assured. If anyone knows an actual study where this is shown, however, I'd like to see it - I've never been able to track one down, and it's always passed around as common fact.

We have a million numbers (ignoring the fact that Derren asked for a number 'between' one and a million, as much of the internet seems to have done) that we could be picking from. As a result, we surely expect very few collisions. We have around 8000 entries for the competition, and so using the solution for a Birthday Problem where we have a year that lasts a million days and 8000 guests, for a truly random distribution there is a 99.9999999999% chance that we will have a single collision, so we are not surprised to see some. In fact, the expected number of collisions is 32.

We see significantly more than this number of collisions. I present the top ten numbers, in reverse order, with explanations.

270271, with 50 guesses. This is Derren Brown's birthday.
42, with 52 guesses. The answer to life, the universe, and everything.
666666, with 60 guesses. It's the devil's number, except twice so it's near the middle of the range of 1 to a million - it's more likely to be close to Derren's number! I assume that's the thought process.
7, with 62 guesses. Oh, trusty, 'random' 7. Glad you're able to join us.
400, with 64 guesses. '4' and some zeros - see below for a brief discussion of why the number 4 is making a strong showing in this competition.
2, with 64 guesses. Strictly, the smallest eligible number. Possible that there is some crosstalk here, as #dbmillion started trending, and getting a lot of spam. As a result, things like "U hav 2 see this! #dbmillion" with a link would be counted; I've not seen any examples of this though, and 2 does seem to be guessed a lot from looking at the timeline.
123456, with 70 guesses. Digits in order, up to a number that is between 1 and a million. 12345, interestingly, has far fewer guesses - only six.
100112, with 100 guesses. The closing date of the competition. I thought I was accidentally catching people tweeting the detail of the competition, but no, lots of people really are guessing this number. Fair play.
1, with 170 guesses. Staggering. Despite the fact that they were asked for a number between 1 and a million, 106 people have guessed '1'. Read (or listen to, in this case), the question, people. I guess a lot of people got excited at the prospect of dinner with Derren. And who can blame them.
4, with 176 guesses. Again, see below for a discussion of '4'.

The number 4

I think a lot of people have been trying to second-guess Derren, which has lead them to the number '4'. In the video, he describes that the card is attached to his forehead, before pausing and saying 'or, if you prefer, forr-ed'. I think a lot of people have been taking that as their cue to guess '4'. Later on in the video, he uses the fraction 'four-fifths' which a lot of people have been clutching onto as verification that '4' is related to the answer.

I'd be dismissing the '4' business as a red herring, if it wasn't for the restaurant that he was taking the winner to, which is the Ivy in London. That is to say, the IV. I'll be interested to see if there's a reveal at the end of this. If not, isn't it a shame that Derren can't just take a fan out to dinner without suspicion clouding his every move?

Early on in the proceedings, '4' was second to '1', but then Derren tweeted that lots of people were guessing '4'. Weirdly, this cause '4' to surge into the lead (and yes, my data does strip retweets before doing the analysis).

Other interesting data

Off the top of my head, some other interesting factoids from the dataset.

The lowest unguessed number was 53.
This is a representation of all the guesses. White means no guess, and dark blue indicates one guess. The scale goes all the way up to bright red, which is the number of guesses '4' got. The pixel in the top left of the graph is '0', and the pixel in the bottom right is '999999'. Each row contains 1000 numbers. The line across the top corresponds to people guessing low numbers (under 1000), and the line along the left hand side corresponds to people guessing round numbers (i.e. ending in '000'). It is interesting - but expected, I would argue - that the right hand and bottom edges are considerably less well defined. If anyone has any suggestions for how to present this data in a more intuitive way, I'd be very pleased to hear it.

My guess

What did I guess? Monday evening, the day before the competition closed, I guessed the number in the middle of the biggest unguessed gap, which at the time had a size of 4800. If we assume fair play all around, and that by guessing randomly we would we have a 1-in-N chance of winning, where N is the number of entries, if Derren's number truly is random then this will have increased my chances of winning by around 25. 25 times a small number is still very small, so I don't see myself having dinner with Derren, but I've given it a good shot nonetheless.

The answer

The number turned out to be 758031, from the man himself. I wasn't close! Congratulations to the winner, and hope they enjoy their dinner with Derren. Even without such a payoff, I'm still pretty happy with what I've achieved here, just out of curiousity's sake.