Questions Nobody Asked..: December 2010

Thursday, December 30, 2010

Model Trains

Continuing from the previous post on using PageRank to rank stations by 'popularity', and with the help of Gephi - which makes it ridiculously easy to built network graphs and work PageRank out - I built the Virgin Pendolino lines, and they look like this

I then worked out PageRanks and picked a route I'm failiar with - the one that goes between Coventry and Birmingham New Street (the brown line).

Old Model

Going back to the equation I outlined in the first train post, the number of passenger on the train at station i is modelled by

So now, we define

Where P(i) is the PageRank of station i, and a is some constant. We then work out 'passenger' numbers for each stop and get the graph below

Here the blue line is number of people boarding, the orange line is people getting off and the yellow is total on board.

And the interesting thing here is that while the on/off lines are quite squiggly, they seem to cancel each other out quite well leaving a fairly smooth curve for total passengers. Which you would kind of expect, since a popular station would have more people getting on AND getting off.

So the question then was, how significant is the wiggliness that's left in the yellow curve.

A Simpler Model

For this, we imagine a train with the same number of stops (10) as above, but where T(i,j)=1 for all stations i,j - i.e. there's only one person going between each pair of stops. And the nice thing about this is that the above monstrocity of an equation reduces to

where i is the current station's position on the route and n is the total number of stops.

And if you multiply that by some constant to get a best fit and plot them together, you get this (orange is the fit)

Which is a pretty damn good fit. Except for the first two stops. And what's weird about it is, you'd expect that the fact that there aren't any people getting off to cancel out the attenuation would mean the ideal curve would under-estimate. But it over estimates. I'm not sure how to explain it. But other than that..

Decent Approximation

And what the close fit suggests is that the 'popularity' of the station isn't as important a factor as how far along the route it is. Which, if true, simplifies the problem a massive amount.

But that first station is still a problem, because that's an error or ~20%. And considering <5% is usually considered the acceptable error limit, it's looking kind of bad.

But you could argue that since it doesn't have a knock-on effect for the later stops, and because the passengers on-board after the first stop is countable on the platform, it's an acceptable 'problem'.

Realistically, I should try the same for other routes to see if the same happens, or else find some real data to test it against. Because it could turn out I'm catastrophically wrong. Or I could turn out to be right. But for now, I'm content.

Other Factors

And that's that. So the only other thing that would affect passenger numbers on a per-station basis is rush-hour effects. This is significant over longer routes mainly, but it's significant bacause it affects the passengers getting on at a particular station differently to how it affects those getting off - this creates a disparity that doesn't cancel itself out like the above.

But at the same time, thanks to the Department of Transport, we do have numbers to work from for that. Then we just need to put together that model and multiply by some number - which can be found by counting, say, the number of people on the platform - and we hopefully have a pretty got estimation of whether or not you'll get a seat.

And I'll finally be able to let go of this madeness. Fingers crossed!

Oatzy.

Wednesday, December 29, 2010

Rush Hour

So this ultimately turned out to be more an exercise in practicing fitting functions to data. But nonetheless, here it is.

I was browsing the Department of Transport website (as you do) looking for data for the whole trains thing. I didn't find what I was looking for, but I did find this

This is, "Passenger numbers: by time of departure from station", and "represents rail travel in Great Britain as a whole, on an average weekday outside of school holidays". The bottom one is a breakdown by purpose of being on the train; the top on is total numbers.

So the first thing you notice is there are curves for commuters going to and coming back from work - rush-hours. Now I'd thought that there would be a minor peak around midday, but apparently not.

So if you wanted to get an approximate model of this graph, the first place you'd probably start would be notice that the commuters' curves are fairly 'normal' (in the mathematical sense.

So working on them separately, and using Mathematica's FindFit function you get your curves. You then combine them and adjust them upwards to include business/leisure and you get this

Which is acceptable - it fits the first half better than the second. But not amazing overall. The equation is this, for those interested

So just for the hell of it, I tried getting a better fit using a Fourier series. This time using the totals, and again, this was using Mathematica. But for a Fourier series you're combining Sines and Cosines.

The accuracy of the fit depends on how many terms you include. I tried various iterations until I found the most acceptable fit with the fewest terms, and that looks like this

Which in all fairness is a pretty good fit. Except for that it's got 15 terms and looks like this

Not what you would call elegant.

So that's that. I doubt anyone's actually interested, but for all the effort I put in to it, I thought it worth 'formally' writing it up. Otherwise it'd be lost forever somewhere on my laptop and in one of my various notebooks.

The other thing is, it could come in handy with the train problem. But that's a whole other story.

Anyway, I'm starting to ramble, so lets just leave it at that.

Oatzy.

Thursday, December 16, 2010

Weeping Angel Christmas Tree Topper

Template here, in case you fancied making one of your own,

Feel free to modify, redistribute, whatever.

For mine, I reinforced them with card, then in true Blue Peter style used a toilet roll tube and double-sided sticky tape to make the shape of it. In case you couldn't tell, they go back to back, so you just have to turn it to see whichever side.

Or you could do it as a hanging decoration like Andrew did.

For maximum effect, put it on the tree without telling anyone ;)

Oatzy.

Tuesday, December 14, 2010

Follow Up: Facebook Friends

Every freaking time! I do something, like graph my Facebook friends, and I'm one-up'd by Facebook going and doing this,

Hardly seems fair, since FB has the data to plot everyone. And geographically. But there you go.

Their explanation, and a hi-res version of the full image here.

Oatzy.

Sunday, December 12, 2010

Facebook Friends

They look something like this

There's 167 of them - a little over Dunbar's number, for what it's worth - shaded by number connections. And you've got this massive clump at the bottom and the rest more spread out. And you can split them into 3 major groups,

with everyone else just sort of scattered. The groups are only approximates, by the way; there's overlaps, outliers, omitted nodes, etc. I didn't include the name labels for clarity.

Or you can lay it out in a more circular way,

and again, you can see those same, vague groupings.

So that's all pretty and such.

Want You Own Graph?

1) Go to Facebook, and run the app netvizzz
2) Click "Create a gdf file from your personal network" and download
3) Download, install and run Gephi
4) Import your gdf file, play with layout, colour setting, whatever.
5) ???
6) PROFIT!

Fairly straight-forward.

And as a side note, I wish I'd known about this program sooner. Then, I wouldn't have had to work out PageRanks by hand.

Oatzy.

Wednesday, December 08, 2010

TrainRank

Or StationRank. Either way, it's a misnomer, since PageRank is named after it's creator, Larry Page. But I digress.

Anyway, the idea is basically this - going back to the previously mentioned train problem, in a moment of inspiration it seemed worth trying working out PageRanks for UK train network; this hopefully correlating somehow to where passengers on a given train are likely to be going.

Stations

First of all, there is a total of about 2,518 train stations in the UK, and damned if I'm going to (or even could) work out a PageRank for the entire network. Even if you only use the Virgin CrossCountry routes, you're still working with short of 100 stations. So for this I used the major stations on the CC line. There's about 30. Major stations, by the way, are the ones with a big circle on the map below

Obviously the simplification has an effect on the results. I compared to station use numbers for those stations included and got a correlation coefficient so close to 0 as to make no odds. But you have to bare in mind the use numbers include all train lines going in and out of each station (not just the Virgin CrossCountry lines).

Anyway, for demonstrative purposes it's good enough.

Ranks

So the TrainRanks are as follows

And at the very least, it seems to fit alright with my experience - my experience being limited to traveling between Sheffield and Coventry.

Random Trains

So what does it mean? For websites, it's based on this idea of a 'random surfer' clicking random links, resulting in probabilities of the surfer ending up on any given page.

So by analogue, we assume a train that moves randomly around the network; and that includes randomly changing direction and taking routes that wouldn't, in the real world, be valid. We then imagine a passenger on this slightly erratic train - a station's TrainRank is the probability the passenger will get off at that station.

Or alternatively, if there are 100 people on this train then, for example, about 7 of those people will get off at Birmingham New Street.

Now obviously, there are some problems with that definition, the most obvious being that that's just not how trains work. Similarly, I don't know if or how this would fit into my previous model. But it's interesting to consider, nonetheless.

Simplifying Routes

It does seem to make sense to limit the ranks to given networks, since passengers have to get off the train of one network to leave the station or get on a train for another network. But at the same time, what other networks call at a given station may have an effect on the probability a passenger will get off the train at that station. Maybe.

But on the other hand, it doesn't make sense to simplify to route level, since routes being straight lines, their TrainRanks would probably end up forming something close to a normal distribution - the middle station having the highest probability, and decreasing towards the ends.

Just some thoughts, anyway.

Oatzy.

Tuesday, December 07, 2010

Normal Icicles

Long story short, I was playing with my dad's fancy camera and took this (amongst others) photo of some icicles

Which I was pretty proud of. So proud in fact that I sat and stared at it for longer than is perhaps sane.

As you'll notice, the individual icicles are clustered into groups, with the longest of the groups in the middle and getting progressively smaller moving outwards.

Normal Distribution

The normal distribution is one of those things that pops up everywhere. For example if you measure the heights of a significantly large group of people and plot the results you'll get a graph shaped something similar to this

With the averages approximately in the middle, and the width/height of the curves based on the standard deviation of the sample.

Another nice example, go to Amazon and look at the customer ratings for anything. If a large enough number of people have rated it, you'll probably notice this same sort of shape; usually with the one star rating spiking to not fit the pattern (damn hipsters). Some are more pronounced fits than others.

And as I say, this sort of thing shows up all over the place. This is partly due to the central limit theorem, but that's a whole other kettle of fish.

Equations Everywhere

I was partly inspired by a program I recently watched - The Joy of Stats - and partly by the website "Found Functions", whose creator finds graphs (and their accompanying formulas) in photos of everyday scenes.

I flipped the photo vertically (for clarity) and skewed it slightly to try to account for the fact that it was taken side-on, then put some normal curves on it to hopefully prove I'm not just crazy and imagining it

And another one

They're not perfect fits, partly due to perspective. But you hopefully get the idea. Why do the icicles form like that? Because nature's just fantastic in that way.

Oatzy.

Monday, December 06, 2010

PageRank

In terms of getting some idea of a person's 'influence' on Twitter, working out (a variation on) PageRank is more effective that counting followers and followers of followers, but much less effective than using HP Labs' modified HITS algorithm.

But to count followers of followers I'd need to put in more effort than it's worth, and to do HITS I'd need to scrape an arse load of tweets to count who retweets who and how often.

PageRank

So PageRank basically measures how many people you're connected, as well as who those people are (connectors and all that). It's most associated with Google who use it as part of their search result ranking algorithm.

In this case we replace pages with users, and for links we say Alice following Bob is equivalent to page A linking to another page B.

Another Updated Graph

Working out PageRank was easier than the other two for one simple reason - I already had a lot of the data I needed, from previous network graphs. Obviously I had to update it first, and this is the graph as it stands (without me)

Which has become almost inexplicable more complex between the last one and this one. But there you go.

Results

Anyway, I followed the algebraic approach from the wiki article (looking like the easiest for what I had) and the results, in rank order, are as follows:

[When looking at the ranks, it might help to make sense of them more if you refer to the graph above.]

You have to bare in mind this particular PageRank calculation is only for my friend network, and assumes it's isolated from the rest of Twitter for simplicity. To get everyone's proper PageRank you'd have to analyse the whole of Twitter.

WTF?

The only question now is, what do the results actually mean? I'm not entirely clear on that.

In the case of websites,

PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page

I think in the Twitter case it's more to do with how tweets spread and who in the network is more (or less) likely to see them.

If I ever work it out I'll let you know. Otherwise, feel free to offer your own explanation.

Oatzy.

Saturday, December 04, 2010

Quick Look: Tumblr Reblogs

The nice thing about Tumblr, is it's kinda like Twitter in that you can repost something someone else posts; but unlike Twitter, it's easy to keep track of what's going where. And on top of that, things tend to spread further.

Basically, if you go to the page for a particular post, under 'notes' is a list of every like and reblog that post gets. So all you have to do is copy/paste all that into a text file, and load it into Google Refine (or spreadsheet, if you're so inclined); parse, rearrange, tidy, so you have two columns - reblogger, reblogged - load it into ManyEyes and you can make a network graph for who passed the picture on from whom.

Looking at the graph, you can then divide a post in to one of two groups:

1) Self-Centred

The example for this is a photoshop I did putting varies memes (that were big at the time) in to one "Ministry of Silly Walks" picture.

And the graph looks like this, with me at the centre

2) Fan-Centred

For this, the example is this Doctor Who pic - my most reblogged picture - which I actually stole from else where on the web (which is quite common for Tumblr)

And the graph in this case looks like this, with a fan page (note the name) being at the centre of most reblogs and me (red circle) pushed to one side.

Anyway.

None of this is actually important. Vaguely interesting maybe. It's basically what I'd've done a while back with retweets on Twitter, if Twitter would facilitate it. Why do things spread further on Tumblr? Damned if I know. The average retweet apparently only goes about two jumps from the originator.

But again, this is another good example of connectors and all that - how passing something through the right person/people can make it explode in popularity and spread massively.

In the Doctor Who one, "timelordian" only passes it on to one person from me. But if they weren't there, it might not have reached "thetardis", in which case, the bottom half of the graph would disappear. So again, you don't have to pass something on to a lot of people to have a massive effect. You just have to pass it on to the right person.

Oatzy.