Tuesday, October 14, 2014

Dark Matter and Machine Learning

If you're a long time follower of this blog you'll probably have noticed I haven't posted much in the last two/three years. This is because I've been busy with university stuff. This blog is sort of related to that. The following section is an edited version of the background I wrote for my final year project report (hence why it's so formal). Enjoy.


Background - Dark Matter and DM-ICE

According to current theories, dark matter makes up 23% of the mass-energy density of the universe. However, to date, it hasn't been conclusively observed directly. Instead, it's existence is inferred from large-scale gravitational effects, like the rotation curves of galaxies, and gravitational lensing. Based on rotation curves, we find that the outer edges of galaxies move as if there is more matter present than what we can directly observe. Consequently, it has been theorised that there is a halos of dark matter around the outer edges of galaxies, including our Milky Way.


As our Solar System orbits around the galactic centre, it moves though this halo, experiencing an effective dark matter wind. This, combined with the tilt of the Earth's orbit around the sun, means we expect to see an annual modulation in the dark matter flux hitting the Earth. This modulation should have it's maximum around June when the Earth is moving into the wind, and it's minimum around December when it's moving away from the wind, regardless of location on Earth.

The current best candidate for dark matter are so-called Weakly Interacting Massive Particles (WIMPs). In direct detection experiments, we look at interactions between WIMPs and the nuclei in some target, such as Sodium Iodide (NaI) crystal. The WIMPs scatter elastically off nuclei in the target, and the recoil of the nuclei cause scintillation photons to be emitted, with summed energies of ~10keV range. Direct detection is hard, however, since WIMP interactions are rare events, where we expect << 1 count/day/kg. So identifying WIMP events, especially at low energies, requires significant background and noise suppression.

Various direct detection experiments (DAMA, CoGeNT, CRESST, CDMS, etc.) have already been run, but none yet have conclusively detected dark matter. Of particular interest from these experiments are the results from the DAMA/NaI and DAMA/LIBRA experiments. The DAMA experiments (Gran Sasso, Italy) ran direct detection using NaI(Tl) (Thallium doped Sodium Iodide) crystals, and observed an annual modulation in the 2-6keV energy region, which they claim is the result of the dark matter wind. However this interpretation is disputed. It is argued that the modulation could have come from some other unaccounted for signal – for example muon flux also has an annual modulation with peak around June (in the Northern Hemisphere).
Source - ArXiv:0804.2741
DM-ICE is a joint venture between University of Sheffield (UK), and University of Wisconsin (Madison, USA). The aim, along with other independent direct detection experiments being run around the globe, is to test the DAMA result by looking for this same modulation, and by ruling out other possible sources of modulation. By running at the South Pole, DM-ICE is in the opposite hemisphere to DAMA, so seasonal effects – such as temperature and muon flux – are reversed, while the dark matter modulation should stays the same.



Context

For my final year project I was looking in particular at the performance stability of the detectors (loss of light yield, etc.) and how to correct the data for those effects. I also did some preliminary modulation analysis - looking for evidence of dark matter in the data.

This blog post isn't really to do with any of the work I did on my project, though. Rather, this is looking at how to remove noise from the data. For my project, this work had already been done, so I didn't have to worry about it. So why am I looking at noise removal now?

When the academic year was over, my project supervisor asked if I wanted to carry on working on the project (funded) for the summer. It made sense, since there was still work to do, and since I already knew the project well. So mostly I was doing more of what I did for the project. But the supervisor also wanted to adapt the DM-ICE project into a simplified version for groups of 3rd year students to do.

Specifically, the students would have to figure out the best cuts to remove noise, then do a modulation analysis to decide if there was evidence of dark matter.

So I set up a simplified version of the data and wrote up a description of the project for the students. The supervisor then asked if I'd work through the project and write up a model answer so that he'd have something to mark against. Hence, it was my turn to try the noise removal stuff.



Pulse Classification

In a DM-ICE detector, we have an 8.5kg NaI(Tl) crystal between paired photo-multipliers (PMTs). These two PMTs - DM0 and DM1 - record scintillation events as voltage pulses (in ADC units). When the raw data is processed, we look for pulses from both of the PMTs which occur within 100 ns of each other. These pairs of pulses are assumed to have come from the same scintillation event.

At low energies, pulses look something like this


where the various spikes represent single photo-electron events (SPE). Unfortunately, at low energies there is also a significant amount of noise events - specifically, electromagnetic interference (EMI) that look something like this


and 'thin peaks', that look something like this


So the task is to identify and remove these EMI and thin peak events.

From looking at plots of the pulses, it's fairly straightforward to tell the different event types apart. But computers can 'look' at the pulses. Instead, we have to define some parameter - values that we can calculate that will tell the computer something about the shapes of the pulses.

Take, for example, the EMI pulses. From the plot we can see that EMI pulses oscillate rapidly between positive and negative ADC values, where the SPE and thin peak events don't. So we can define some parameter, which we'll call 'emi', which basically counts how many times the pulse goes from positive to negative (or negative to positive) in the first 40 bins of the pulse.

So since the emi value is going to be much higher for EMI-type pulses than for SPE or thin peaks, identifying and removing EMI events is relatively easy.

Separating the SPEs from the thin peaks is less straightforward. For this, we calculate various other parameters - for example, since thin peaks will typically only have one peak, while an SPE event will have several, we can look at the peak number. Other things we can look at are how quickly the pulse decays to zero (log-meantime), or how the pulse energy is distributed. The details of how these parameters are calculated aren't important to this blog.



Human Learning

What you can do, then, is look at a bunch of pulses - plot it, what type of event is it? What are it's parameter values? You collect together the results and you ask, what sort of emi values do EMI-type events have? What sort of peak numbers do SPEs and thin peaks have. How can you use that information to tell the peaks apart?

And that's fine. But it's also boring. You have to plot, and calculate, and record. Repeat. Analyse. Boring.

I subscribe to the philosophy of "why do yourself what you can make a computer do?". In practice this usually means making the work more 'complicated' in the short term, but once it's done it makes life a whole lot easier. And while the work is more complicated, it's at least more interesting as well.

So what I did was I wrote a program. It looks something like this
(I used Python with Numpy, Matplotlib for plotting, and PyQt4 for the GUI.)

Basically it plots the pulses and presents you with three buttons - signal (SPE), noise (thin peak), and EMI. On the front-end, all the user has to do is look at the plot and decide what type of event the pulse is. Meanwhile, on the back-end, the program calculates the parameters, and sorts them according to the user's classifications.

Once you've classified a set of pulses, you're presented with histograms of the different parameters, like this (for the emi value parameter)


where the different event types are plotted in different colours. This makes life a lot easier. For example, in the above, it's immediately apparent that EMI-type events can be removed from the data by cutting any event with an emi value above 25.



Machine Learning - Naive Bayes Classifiers

Lets make this more interesting. Sure, we can click through a bunch of peaks telling the program what's what. But wouldn't it be cooler if we could teach the program to tell the events apart itself?

Yes. Yes it would.

This is where machine learning comes in. Now there are various machine learning techniques, but the one I used in the program is called a 'Naive Bayes Classifier'. This is the sort of thing that's used in spam filters - it's a way of calculating, for example, what is the probability that a particular email is spam given that it mentions, say, penis enlargement or Nigerian princes.


Quick Introduction - Bayes' Theorem

Imagine there is a person. Sight unseen, what is the probability they are female? Well, given that there are roughly equal numbers of males and females in the world, the probability is about 50%. This is called the prior probability.

[Aside: in this example, we're assuming for simplicity that gender is a binary state.]

Now, we're given a new piece of information - this mystery person's height is 5'4 (five feet and four inches). Now the question is, what is the probability the mystery person is female, given they are 5'4? Well, females tend to be shorter than males (on average) so we might say that the mystery person is more likely to be female that male. But how do we quantify this new probability?

This is where Bayes' Theorem comes in. It looks like this

Basically, this calculates the probability of a hypothesis (H) given some observation/evidence (O). p(H|O) is called the posterior probability.

For our example, we have the hypothesis "female" (F) and the observation "h=5'4". So the terms for this example are:
  • p(F)  -  The probability of a person being female. This is our prior.  [ 50% ]
  • p(h=5'4|F)  -  The probability of a person being 5'4 given they are female. Or to put it more plainly, the probability of a female being 5'4.   [ ~15.1% ]*
  • p(h=5'4)  -  The probability of any person being 5'4. This can be calculated as the (weighted) average of the probabilities of a female being 5'4, and a male being 5'4. In other words, the probability can be expanded as

    p(h=5'4)  =  p(h=5'4|F) p(F) + p(h=5'4|M) p(M)

    where p(h=5'4|M) is the probability of a male being 5'4   [ ~1.8% ]*

Putting it all together, we calculate the (posterior) probability of the mystery person being female given their height is 5'4, as


As we expected, this probability is much greater than 50% - the mystery person is more likely to be female than male, given their height.

*Aside: These figures come from average heights for Americans aged 20-80 for 2007-2008.

Obviously the probabilities would vary depending on where the mystery person is from, how old they are, etc. Since all we know about this mystery person is their height, we shouldn't really make assumptions about their background. But for the sake of demonstration (and convenience), these values are sufficient.


Gaussian Naive Bayes - When the Training Set is Small

So how does this apply to noise removal?

Let's go back to EMI - we can ask, what is the probability that an event is EMI-type given it has an emi value of 30?

Using Bayes' Theorem, we have


Now we need to calculate the probabilities. We're going to assume we have a training set of (31) events that have already been classified. From this we can calculate/estimate these probabilities.

The prior, p(E), is pretty straightforward - what fraction of the events in the training set are EMI? This turns out to be around 22.6%

The probability of an EMI-type event having an emi value of 30 - p(emi=30|E) - is a little trickier to find. If our training set is small we may have no events with an emi value of exactly 30. But there could, at the same time, be several events with emi values slightly above or below 30.

Here then we have to make an assumption - we're going to assume that the distribution of emi values (for each event type) can be approximated by a normal distribution. This approach is called the 'Gaussian Naive Bayes', and is usually a reasonable approximation - if you look at the plot of emi values higher up, you'll see that they are roughly normally distributed.

So to find p(emi|E) we need to calculate the mean and standard deviation of the emi values for the EMI-type events in our training set. Obviously, when the training set is relatively small, the mean and standard deviation are going to have a large margin of error. So it's important to make sure we have a sufficiently large training set.

For p(emi=30|!E) - that is, the probability that an event that isn't EMI-type (signal or thin peak) has an emi value of 30 - we do the same procedure of calculating the mean and standard deviation of emi values for non-EMI events, again assuming a normal distribution.

For my training set of 31 events, the probability works out at p(E|emi=30)  =  97.96%

In other words, an event with an emi value of 30 is almost certainly an EMI-type event.


Machine Peak Classification

For the full Naive Bayes Classifier, you just repeat the process above for all the parameters, then combine the probabilities. For example, the probability that an event is noise given its log-meantime value and peak number, would be given by


It's 'naive' because it assumes all the parameters are independent of each other. This isn't strictly true, but it's an acceptable approximation.

In fact, in practice, rather than calculating the probability, we calculate the 'log-likelihood ratio'

where Pi are the various parameters. From that it's pretty straightforward to extract the probability that an event is noise given its parameters. Alternatively, we can just say that if the log-likelihood is greater than 0 - if the probability of being noise is greater than the probability of being signal - then the pulse is classified as noise.

The point here is that, rather than investigating the differences between the different pulse types ourselves, then hard coding how to tell pulse types apart - e.g. explicitly writing in the code that an event is EMI-type if it has an emi value greater than 25, etc. - we instead show the code some examples of different pulses, and it 'learns' to tell the difference itself. This saves us a lot of work. In fact, we never even have to know what the differences between pulse parameters are.

When you're using the program, you can have it display what the algorithm thinks pulses are (and the associated probabilities). So you can click through a bunch of pulses, telling the algorithm what they are, and the algorithm will tell you what it thinks they are. As you 'teach' it, the algorithm will get better at telling the difference between pulses. And when you're happy with how well its classifications match up with your own, you can press a button and it will classify the rest of the pulses for you. How cool is that?



Finding the Best Cuts

Once the algorithms is adequately trained, we could just set it to work, going through the data, classifying and removing noise events. But for their project, the students aren't looking at machine learning, or statistics, or anything like that. Instead, they're asked to look for cutoff conditions for each of the parameters - for example, a pulse is noise if it has emi > 25.

To get a better idea of the problem we can look at our histograms of the parameters, with the different pulse type plotted in different colours. This is where the pulse classification comes in useful - the more pulses we classify, the clearer the distributions of parameters.

When you're looking at the emi values, as in the plot higher up, the cutoff is pretty easy - there's a clear gap between emi values for EMI-type pulses, and those for signal and thin peaks. For other parameters, the distinction is less clear, and often we have to make some trade off between leaving behind noise events and removing signal events. For example


Here, if you want to keep all the signal, you leave behind 23 (thin peak) noise events (34%). If you want to remove all the noise you end up also removing 6 signal events (54%).

Aside: Since EMI-type events are easily distinguished and removed with the emi value cut, they are not taken into account when we look at cuts for the other parameters, which are meant to separate signal from thin peaks.

So the question is, how can we determine the 'best' cuts?

This is basically an optimisation problem - we want to find the cut for which the maximum amount of noise AND the minimum amount of signal is removed. To do this, we need to come up with some metric/way of scoring how effective a given cut is, then we look for the cut that scores best.

We'll define Ni as the number of noise events and Si the number of signal events removed by some cut Ci. Because there are generally more noise events than signal, we're going to score based on the fractions of noise and signal removed - (Ni/N) and (Si/S) - where N and S are the total number of noise and signal events respectively. So the optimisation problem is finding the cut that gives (Ni/N) closest to one and (Si/S) closest to zero, simultaneously.

One approach to this optimisation is a 'nearest neighbour search'. To make this clearer, we can look at a scatter plot with points representing the fractions of signal and noise removed for different cuts. In this set-up, optimising means looking for the point that is nearest the bottom-right corner of the graph (1,0).


We can find the distance of each point from (1,0) using Pythagoras, making the scoring function
which we want to minimise.

There is one caveat though - a cut that, for example, removes 10% of the signal and 60% of the noise will have the same distance score (0.41) as a cut that removes 40% of the signal and 90% of the noise. The latter removes more noise (almost all of it) but also removes more signal. In this case we have a choice - do we want to remove more noise, or save more signal?

To differentiate the two cases we can use the angle between the point and the noise-axis
So if, for example, we want to save as much signal as possible then we want to minimise the angle. Or perhaps for a better balance between removing signal and noise, we should prefer the cut that gives an angle closest to 45deg. For myself, I prefer to save as much signal as possible. In practice, however, having two or more cuts with the same distance score is rare.

When it comes to reporting the (distance) scores, they are re-formatted as
This gives a score of 1 for a cut that removes all noise and no signal, a score of -1 for a cut that removes all signal and no noise, a score of 0 for a cut that removes no signal or noise at all, etc. This score could be used in the optimisation algorithm (where it would need to be maximised), but the results would be the same, and the Pythagorean distance is easier to calculate.

Below is an example cut from this algorithm.
Notice that it sacrifices one signal event for 15 noise events, but prefers to keep one more signal event, rather than removing an extra 5 noise events.

Ultimately, this technique of removing noise with parameter cutoffs is less effective than the Naive Bayes, largely because it considers all the parameters individually/independently. However, there are things called 'support vector machines' (SVM) which would probably be ideal for this problem. They look for a dividing 'line' between signal and noise, as above, but they consider all of the parameters together. OpenCV has an implementation of SVM, so I might give that a try at some point.



Last Word

As it started out, I just wanted to throw together a quick GUI to make my life easier, without worrying about error handling or bugs or any of that other fiddly stuff. Then I decided I wanted to try out some machine learning. And the more I worked on the program, the more features I thought of, and added.

So now, I have a program designed specifically for removing noise from DM-ICE data. But since DM-ICE has already had its noise removed (as far as possible), the program is basically of no use to anyone. Well, except maybe to any students working on this project. But that would be cheating - hence why I'm wary of posting the code.

Still, it was a fun little project.


Oatzy.


[Keep an eye out for news on DM-ICE.]

Thursday, September 18, 2014

#IceBucketChallenge - A Viral Campaign

I know this blog is a little late to the game. In fact, I started writing it while it was still relevant... but then I got distracted. Such is life.


Background

You probably already know (or vaguely remember) what the Ice Bucket Challenge is. Basically if you're nominated you have to dump a bucket of ice water over your head, or else you have to donate money to some charity - most commonly ALSA, the Amyotrophic Lateral Sclerosis Association.

You then nominate 3 more people, who have 24 hours to do the same thing. In some variants, you also make a small donation even if you dump the ice water over your head, or else you make a bigger donation if you don't (some specify $100).

Anyway, what I was interested in is how the challenge spread - this was a 'viral' campaign in a very true sense. So why not try to model how the campaign spread as we would model the spread of a virus?


The Viral Model

I've written about this sort of thing several times before. The basic idea is this - the population is divided into three groups:

Susceptible (S) - The population that hasn't been exposed to a disease/virus, but is susceptible to infection
Infectious (I) - The people who have been exposed, and can infect other people
Recovered (R) - The people who have been infected, but have recovered. It's generally assumed that these people can't be re-infected. (But there are variations).

For the ice bucket challenge, we can look at a direct analogy as

S - Those who haven't been nominated
I - Those who have been nominated and are taking the challenge (so can nominate other people)
R - Those who have completed or those who declined the challenge (and can't nominate anyone else)

We can draw the model as a flowchart, showing how people move between the different groups,

Here, we've divided nominees into those who accept the challenge (S->I) and those who don't accept (S->R). So, now we can describe the model as a system of differential equations,


Where the factors (α, β, γ) describe what proportions of each group move to where.

The most interesting part is the 'S*I' terms in the first two equations - this basically says that the number of newly infected people is proportional to the number of susceptible people AND the number of infectious people.

This makes the system self-limiting, meaning that the number of infected people can't just grow to infinity - since that's not what we observe. Instead, what we see is an increase to some maximum, followed by a steady drop-off. For example,

We don't have hard data on how many people did the challenge over time, but we can get an idea of what happened by looking at the YouTube search numbers for the phrase 'ice bucket challenge' (above, via Google Trends).

I mean, it seems reasonable to assume some loose correlation between the number of people doing the challenge, the number of videos of people doing the challenge, and the number of searches for those videos. Searches peaked on August 21st, in case you were wondering.


A More Discrete Model

The downside to this 'S*I' term is that it makes the equations non-linear, meaning they can't be solved analytically - that is, you can't come up with an 'exact' equation for the number of people doing the challenge on a given day, for example.

But we can do a numerical simulation, instead.

In fact, this is a more reasonable way of looking at the system, since we're interested in a discrete time step of one day - i.e. the 24 hours nominees have to complete the challenge. For the simulation, we're also going to rounded the numbers of people in each group (after each time step) to whole numbers, since you can't (or shouldn't) divide a person into fractions.

Now, we need to define some parameters. First of all we need to define our start populations. We'll call the initial susceptible population S0 - this could be, for example, the Earth's population (which was around 7.16bn when I ran my simulations). We'll assume that one person is infected as a starting point  - the originator of the challenge (I0 = 1). And we'll assume that no-one else has done the challenge at the start, so R0 = 0.

For the constants (α,β,γ), the easiest one to define is γ - the 'recovery' rate. We assume that after the 24 hour challenge period nominees are no longer 'infectious', therefore γ = 1.

For α and β, we have the total rate of infection/nomination defined as (α+β). We're given that each challenge completer gets to nominate three new people, so we can define (α+β) such that in the first step (when there's only one infectious person) we have (α+β)*S0*I0 = (α+β)*S0 = 3. Therefore (α+β) = 3/S0.

Now, α and β are related to the proportions of nominees that accept and decline the challenge, respectively. So we can redefine the constants as α = a/S0 and β = b/S0, such that (a+b) = 3, or alternatively α = a/S0 and β = (3-a)/S0. So now we can look at 'a' as the average number of nominees who accept the challenge.

So to tie it all together, we have the system of (difference) equations,

And it's pretty straightforward to write some code that'll run through those equations.


So we've got our model, what now?

Having a model is all well and good, but why bother? Well, now we can start asking questions. For example, how fast does the campaign spread? How long will it take for the challenge to die out, and how many people will have taken part by that point? And what happens when we change the number of people who accept the challenge?

First of all, if we run the simulation (with a = 2.5) and plot I(n) - the number of people doing the challenge on any given day - we get something like this


[where I(n) has been normalised so that the maximum is 100, as Google Trends does].

This is about what we expected - a rise and a fall. Though it's worth noting the shape is a little different from the YouTube searches above; the simulation drops off quickly, while the search numbers have a longer tail. This could be because, even after the challenges are done, there's a latent interest in (re)watching the videos. Or it could just be that this model is not entirely accurate...


So what happens when we vary 'a'?

We can start by assuming everyone accepts the challenge (a = 3). In this case, eventually everyone in the world does the challenge - specifically within 21 days of the first challenge. But that isn't very realistic.

So let's say that on average 1.5 or 2 of those challenged accept. In these cases, the ice bucket phenomenon ends before everyone can be challenged. For a = 2, the challenge ends after 50 day, with ~13% of the population going unchallenged. On the other hand, for a = 1.5 the challenge takes significantly longer to end (over a year).

In fact it turns out that for any a < 1.6030165.. the challenge will take a significant amount of time to end - the number of 'infectious' people will eventually reach 1, and stay there until S <= S0/(2a) (remember we're rounding each group to the nearest whole number). For a <= 0.5 the challenge ends almost immediately.

Now, we know that for a = 3, the population S goes to zero (everyone is challenged), whereas for a = 2 the challenge stops before S can go to zero. So we have the question - what is the smallest value of 'a' for which S goes to zero? If you do a bit of interpolating, you can figure out that this critical value comes out at around 2.7405349.. At this value of 'a', the challenge ends after 23 days. In other words, if on average 2.74.. (~91%) of the people nominated accept the challenge, then after 23 days everyone in the world will have been challenged.

From what I've seen myself, it seems like nearer 1 in 3 people accept the challenge on average. So, as viral as the campaign was, it was never going to take over the world.

If you play around a bit, you'll find that the critical values of 'a' are dependent on the initial population (S0). But, as far as I can tell, it's not possible to derive these critical values analytically. (Answers in the comments if you can prove otherwise).


Why wait to be nominated..?

At this point, you're probably thinking this isn't a very realistic model. And you'd be right. So lets make it a bit more complicated.

In particular, lets add spontaneous participation - that is, people who aren't directly nominated, but who see all the other people doing the challenge, and decide they want to take part too.

We'll assume that this participation is proportional to those who have already done the challenge (I and R). So to start with, we need to separate the 'recovered' group into those who actually did the challenge (R), and those who declined (D).

The new model looks something like this,



With difference equations,

In the flowchart, we've introduced this new factor 'δ', the 'inspiration' rate. If we re-define it, like we did for the infection rate, as d/S0, then 'd' can be loosely interpreted as the average number of people inspired to take part by each person who's already done the challenge.

Now we can look at how this 'd' factor affects the things we looked at before - how long it takes for the challenge to end, etc.

Let's start by assuming that the average number of nominees accepting the challenge, 'a', is 1 (out of 3). What value of 'd' do we need for S to go to zero? Do a little interpolating again and you get d = 0.8668653 - that is, if each person who pours a bucket of water over their head inspires (on average) 0.867.. people to do the same, then everyone in the world will participate, within 26 days of the challenge starting. For a = 2, we need d = 0.4046021 for S goes to 0. And so on...

What's a plausible inspiration rate? For a normal person, probably zero, while for a celebrity... I don't know. But on average 'd' is probably very close to zero. I mean, we know for a fact that significantly less than the entire population of Earth has been nominated/taken part in the challenge.

If you plot I(n) for a = 1 and d = 0.1 you get something like this,


In this case, you have that same rise and fall - but this time, you have a longer tail, like we see in the YouTube searches. Is this proof that this iteration of the model is more accurate? Maybe. But as I pointed out before, the YouTube searches don't necessarily accurately represent the number of people taking the challenge over time.


Social Pressure

When a friend does a thing for charity, then publicly calls you out to take part, there's a certain amount of social pressure to comply. I mean, if a friend dumped water over their head (just because), then asked you to do the same thing, probably you'd look at them like they were a crazy person. Anyway, that kind of social pressure is implicitly included in the 'infection' rate - more social pressure, bigger 'a'.

Instead, the sort of social pressure I'm talking about here is the kind that goes: "I should accept the challenge because so many other people have already done it". Or alternatively, "it's okay for me to accept the challenge, since so many other people have already done it".

In other words, the more people accept and complete the challenge, the more likely a nominated person is to accept too. Mathematically, we can introduce this with a term in 'S*I*R'. Or alternatively, we can keep the term as 'a*S*I', but make the factor 'a' a function of R -> a(R).

Anyway, if you're interested, you can try investigating that yourself. Or try adapting the model in some other ways. But beyond a point you can end up complicating a model more than improving it.


A Network Theory Approach to Nominating

So I was eventually nominated for the challenge. But I'm a wimp, so I declined to dump ice water over my head, instead making a donation to the Motor Neuron Disease Association (the UK equivalent of ALSA).

For my nominations, I wanted to try and maximise spread. So I nominated the 3 of the people in my Twitter network who are the most active and well connected, and who I thought would be up for accepting the challenge. Plus, as a secondary effect, I figured they might nominate other people in my Twitter network - the network theory equivalent of wishing for more wishes.

In the end, one ignored the nomination, one acknowledged but didn't accept, and one accepted (in the form of a donation) but didn't nominate anyone else. So I guess that theory didn't quite pan out. But I did at least encourage more charitable giving.


So Yeah

If we had real world data we could maybe test the accuracy of these models. But even without, we can get a sense of how the challenge behaves - for example, we find that there's a critical 'challenge acceptance ratio' that determines whether the viral campaign will go 'pandemic'.

In theory, you could apply this sort of model to any viral campaign, or just anything that spreads 'virally'. The nice thing about the Ice Bucket Challenge in particular, though, is that it has well defined rules for how the challenge spreads from one person to the next.

So, yeah..


Oatzy.


[I need an editor, my pronouns are all over the place.]

Monday, August 11, 2014

How Long Will a Prime Number Be?

File this under things I think about when I'm trying to get to sleep.


The Question

The idea is this - pick a random integer between 0 and 9. If it's prime, stop. If it's not prime, pick another random integer and stick it on the end of the first number. If this new number is prime, stop. If it isn't prime, pick another random integer, and so on.

So, for example:

1 -> not prime
2 -> 12 -> not prime
7 -> 127 -> prime -> stop.

The question is this - on average, how long would we expect the number to be when we get a prime?


Sometime Programming Just Won't Cut It

Normally, this is where I'd bust out Python and write a little program to run through the procedure a few times and see what happens. The problem is, testing whether a number is prime tends to be really slooooow.

The 'easiest' algorithm, trial division - literally checking if the number is divisible by any of the integers less than it - has (worst case) speed O(sqrt(n)). What this means is, each time we add a digit to our number, it will take ~sqrt(10) = 3.16 times longer to check if it's prime. And that time soon adds up. There are quicker algorithms, for example AKS, but they're too complicated to bother with.

Instead, we can figure out the answer with some good old-fashioned maths.


Some Good Old-Fashioned Maths

Lets re-state the problem - what we want to know is the expected length of the prime number. Here we use the standard definition for the expectation value

So now the problem is find p(N), which in this case is the probability that a number of length N is prime. Now, as any romantic mathematician will tell you, prime numbers are mysterious things, which are scattered about the number line seemingly at random. However, what we do have is the 'Prime Number Theorem', which estimates the number of prime numbers smaller than n, as
where ln(n) is the natural logarithm. Here's a Numberphile video explaining the theorem.

So, for one digit numbers (numbers less than 10) there are ~10/log(10) = 4.2 prime numbers (well, technically 4, but this is an estimate). So the probability that a one digit number is prime is estimated as 0.42. The estimate gets better as the number of digits gets bigger.

Now, for numbers of two or more digits, the number of primes that are N digits long is (approximately) the number of primes less than 10^N minus the number of primes less than 10^(N-1). So then, the probability of an N-digit number being prime is the number of N-digit primes divided by the total number of N-digit numbers. It can be shown (homework) that this probability is given by

Plotting this function, we see that the probability goes to zero as N goes to infinity


In other words, the longer the number gets the less likely it is to be prime. In this case, you can probably guess what the expected length of our prime will be.

So we have

And as N->infinity, N p(N) -> 1/ln(10). What this means is the sum doesn't converge - it goes to infinity. So on average we expect our prime to be infinitely long.


So Basically

To cut a long story short, the procedure is most likely to stop when the number is only one digit long. Or else, the longer the number gets, the less likely the procedure is to stop. This is another reason why the programming solution was a non-starter.


Oatzy.


[Now stop doing maths and go to sleep.]