Saturday, March 24, 2012

Some Twitter Infographics

I did some stuff like this before. And I figured, while I was updating my network graphs, why not update some of the other graphics?

And it helps that I worked out how to easily extract data from Twitter (see previous blog). The code is here. Again, rate limits apply.

Who Do I Follow?

This is one of the ones I did before - collect together the bios of the people I follow, then make a word cloud (using Wordle)
Basically, I follow a bunch of geeks and writers. Who like 'things'. So really, same as a year and a half ago.

I would point out though that 6 of the people I follow don't have bios, and about 7 just have lyrics.

Data here.

Who Tweets the Most?

These rates are worked out as (total tweets posted)/(total days online). Obviously, the actually post rate will vary over different time scales..
Bubble chart (made with ManyEyes) - bubbles sized by tweet rate (the numbers on some of the bubbles).

The graph below gives a better idea of relative rates, and 'rankings' (click to embiggen)
The blue line is actual values.

The orange is a logarithmic trend-line. It's a pretty good fit (R2=0.95); and, loosely speaking, it means ~70% of the tweets in my timeline come from ~30% of the people I follow. [cf: Pareto Principle]

You get similar log-shaped graphs when you split up the genders.

Full data here.

Chattiest Gender?

You can read all the explanation, caveats, etc. in the previous posts (here and here). I'm just going to go straight into the data.

I follow 27 men and 21 women (excluding celebrities, etc.). The stats are as follow:
Average = 6.21 tweets/day
Standard Deviation = 6.47

Average = 13.79 tweets/day
Standard Deviation = 14.27
For clarity, here's a  boxplot (made in R)
Basically, the women tweet more on average, and their rates are more spread out than for the men. In fact, roughly three quarters of the men tweet less than half of the women. Also, there's one outlier in the female group.

This is similar to what we found last time; although the women's average and spread aren't quite as high (average: 13.79 vs 19.21), and the men's average has increased slightly (6.21 vs 5.29).

If you take the ratio of the averages, the women tweet 2.15 times as much as the men. But maybe I just follow particularly chatty women..

Here's treemap (ManyEyes), which should give you a better idea of the gender balance (boxes sized by tweet rate)
Specifically, the graphic above is 62.5% purple (female).

Data here.

Where in the World Are My Followers?

The site I used last time doesn't seem to exist anymore. So I'm using MapMyFollowers instead. As the name suggests, these are my followers, rather than just the people I follow. Nonetheless..
Mostly in the UK and the US. As you'd probably expect.

I will point out though, some of the locations are a little suspect. Some people haven't made their location available so aren't included, and others seem to be in countries they couldn't possibly be in. But it's the best we can do.

Here's a zoom in on the UK

What Do I Tweet?

Made with Wordle, with data from TweetStats.

Words are sized by how often I tweet them; and by extension, @usernames are sized by how often I tweet those people.

In fact, here are the people I 'mention' the most (TweetStats)
Couldn't get a good source on who @replies me. That was one of the things Twoolr used to do..

When Do I Tweet?

Twoolr used to be awesome for Twitter statistics. But sadly, when they left beta, they started charging. And their free service went to shit. Luckily, I found TweetStats. Weirdly, it doesn't need you to log-in or anything, but somehow it can pull data on (nearly) all your tweets - beyond the 3,200 limit. Strange.

Here's some more graphs
Basically, I tweet most on a Friday and Saturday, and at around 1-2pm.

And I've never tweeted at 5am. But that's probably because I'm always asleep at 5am
Except that one time I got really drunk. (SleepBot)

How Much Do I Tweet?

This is another one I used to go to Twoolr for. And, to be fair, I still could. But that only goes as far back as April '10, and its graphics aren't as clear. Here's TweetStats again
Like I said before, I didn't tweet much in my first year. In fact, I only posted 36 tweets in all of 2009.

Now, the one problem with TweetStats is that 5 month gap in 2010. Why is this significant? Well, I was definitely tweeting during that time. In fact, by my estimates, over those 5 months I posted 5,724 tweets (~37tweets/day). So those 5 months account for 43% of all my tweets.

See, the thing is, in 2010, I was out of university, single, and unemployed. I posted a total 8,823 tweets - 24tweets/day. Since I've been back at university, that number's dropped to 11tweets/day.

That lull in Summer 2011 was when I was spending all my time on Tumblr and watching classic Doctor Who. Incidentally, I haven't posted on Tumblr since the start of September '11. It's terribly addictive, you see. I wouldn't recommend it; unless you're addicted to Doctor Who and Sherlock, and have lots of time on your hands..

So yeah.


[Self-indulgent statistics, and pretty illustrations.]

Wednesday, March 14, 2012

Friend Network Evolution

Back in February 2009, I created my Twitter account, upon the insistence of my then-girlfriend. I didn't get it. Back then, Facebook was where it was at, I didn't really get Twitter's appeal. I was pretty much just following the handful of people I knew in real life, and Stephen Fry.

So I didn't use it much. I'd pop up every now and then, post a couple tweets and give up on it again. At one point, I even developed an irrational dislike of it - whenever I saw a site had a "follow us on Twitter" button, it irked me for some reason.

But at some point, towards the end of 2009/early 2010, I gave it yet another try. I don't know why. And even when I started using it, I was resistant; still half-heartedly hating it. But what was different this time, is I started chatting with people, and I was introduced to new people.

People who don't get Twitter think it's just that thing where you can tell people when you're eating a sandwich. It's not. It's the people that make Twitter. (Tweens and arseholes notwithstanding.)

But I'm going off on a tangent.

By August 2010, I was well into Twitter - I was posting around 30 tweets per day, and I had around 30 friends*. And back then, I decided I wanted to see what my friends network looked like. So I broke out Python and the Twitter API, I pulled data, and I made the graph. Here's an updated version of it.
[click to embiggen]

Fairly small, and tidy, and relatively uncomplicated. The bulk on the right is the people I knew in real life (from school, etc.) with a few strands of new acquaintances. Note how tightly packed and interconnected they are. To the left is mostly people I met through Twitter - and in particular, through PkmnTrainerJ.

(In case you hadn't figured it out, the node and label sizes are proportional to number of connections.)

By December 2010, I decided to have a look again.
Again, this is an update of the version I originally posted; and in this case, I've tried to arrange it so that key people stay in approximately the same place.

So you still vaguely have that left-right divide, but now there's much more mixing in the middle. I'd made some new friends, but more interesting is the people who were already in the graph who formed new connections with others in my graph.

I'd also like to draw your attention to shinelikestars_ (formerly shinelikestars6) - take a look at the previous graph, can you spot him? From 2 shared connections to 8 in the space of two months. I don't think there's any sort of point I'm trying to make here. I'm just pointing it out 'cause it's interesting.

And for the next year and a half I didn't do any data collecting. It became too labourious - Twitter changed its API, so that my old code didn't work, and I had to do everything by hand.

So, the latest graph was March 2011 (technical details below). As you can imagine, a lot can happen in a year and a half.
First of all, the new people add, and the old people removed. But more importantly, look how much tighter, and how much more 'segmented' the graph is.

There are now three major groups, loosely centred on the three most connected of my friends.
On the far right are, again, the people I knew in real life. In particular, note how little that group has changed since the first graph.

In the middle, we have 'Shiney's People' - people I was introduced to by shinelikestars_. And on the left are the people I was introduced to by PkmnTrainerJ.

The smaller groups circled in red are cliques - smaller subgroups that, at least from my point of view, form their own little groupings, where (almost) everyone is interconnected. The bottom left 'clique', for example, is my parents and big sister.

And I suspect, if you were to extend the graph beyond my network, you would find that those cliques are just parts of larger interconnected groups.

In case you were wondering, PkmnTrainerJ and SallyBembridge are most connected, both with degree 14. shinelikestars_ is next most, with degree 10.

Notable disappearing nodes - Benjidoom, who deleted his account, then created a new, private one (benjirino); and AimlessAmy, who is a long story.

I should also point out that the people I follow who aren't friends with anyone else in my network do not appear in the pictured graphs. Not that they aren't as cool, they just don't join onto the graph.

* I use friend here to mean people who I follow and who follow me back. Though I would probably consider all the people in my current network (including those not pictured) friends to some degree.

Technical stuff

You can read details on how I collected the data before, in the previous blogs. But, as I say, those methods don't work anymore.

For this run, I read up on the API, and found some bits that don't need authentication to grab and manipulate.

First, you can grab a list of a user's friends with this URL<username>

This will give you a list of the friends ID numbers, so you also need to use this to grab usernames<idnumber>

There is also a URL to check if a user follows another user<idnumber1>&user_id_b=<idnumber2>

Which works through the browser, but I couldn't get to work in my code. So in place, I used the site; partly because it uses the URL scheme

Which is very convenient. Though I do worry all the requests might be putting strain on that site's server.

So, putting all that together with a bit of Python, you get something like this.

A few important points:

1) It will take a while to run. I have ~50 friends, and it took well over an hour to pull all the data. In terms of computational complexity, it's O(n^2), but each of those operations takes a significant amount of time.

2) Twitter has an API limit of 150 requests per hour. The number of API requests the code will make is ~ the number of friends being looked up. I think. Which means, if you have more than 100 or so friends, this code probably won't work. Sorry. There might be a way around it, but I don't know how.

3) Obviously, this doesn't work on protected accounts. So for those people you will have to grab data by hand. Though it's not too bad for a small enough number of people.

If you do want to use the code, I've made it so you just have to change the username at the top, and run it. You will need to install Python though.

For creating the graphs, I previously use ManyEyes. But I moved to using Gephi, because it allows for more customising. The output from the code is a text file with a list of name pairs, which you can import directly into Gephi. It will build the graph for you, and then you're free to play as you like.

Aaand... Yeah, I think that's about it.


[shinelikestars6 lost his red circle, on account of he isn't my nemesis anymore.]

Monday, March 05, 2012

So What Was the Best Day To Go Shopping?

Alright, let's be done with this.

Just a quick reminder - what I did was collect Foursquare check-in data for various shopping centres around the UK, in the hope that the data might show something interesting.

Previous blog posts on this data collecting - Best Day to Go Shopping, Panic Saturday, Christmas Eve.

Anyway, I've been collecting data for over 3 months now. And that seems like quite enough.

Here's a graph of (normalised) averaged check-ins on each day of the week for 4 periods:
DecAv (blue) is 21st Nov 2011 to 18th Dec 2011
ChrAv (grey) is 19th Dec 2011 to 1st Jan 2012
JanAv (orange) is 5th Jan 2012 to 2nd Feb 2012
FebAv (green) is 6th Feb 2012 to 4th Mar 2012

Aside from the two weeks either side of Christmas (grey) - when people, apparently, did their shopping more midweek - the pattern is basically the same.

For further clarity, here's the average of those three averages (excluding Christmas)
And here is the order of days, from least to most busy ('relative busyness' in brackets):

1) Wednesday (1.00)
2) Monday (1.01)
3) Tuesday (1.03)
4) Thursday (1.11)
5) Sunday (1.15)
6) Friday (1.24)
7) Saturday (1.78)

Note that the differences between Monday, Tuesday, and Wednesday are not statistically significant - they're essentially the same, and are likely to be as busy as each other/not noticeably different.

So, to answer the title question - Monday, Tuesday, and Wednesday are the best days to go shopping. At least, in as much as they're the days shopping centres are likely to be least busy. And, as you'd expect, Saturday is, by far, the worst/most busy.

And the last thing to point out is that these are the averages over 20 shopping centres for a ~3 month period - numbers for specific locations, and at different times (eg holidays) are likely to deviate from the averages.

And, basically, that's that.

If you're interested, you can see the raw check-in data here.


[That was definitely worth the effort.]

Saturday, March 03, 2012

Where are the Carriages?


So it's around 9am, and you're waiting for a train. The trains in your area are a bit scummy, but whatever, you can't drive, and you've got to get to work/uni somehow. It's pretty busy, what with it being 9am, and when the train finally pulls up.. it's a single carriage.

Now a single car has around 50 seats, and this train is packed tight, with people standing in the aisles and the doorways, being forced to get intimate. And with everyone on board the conductor can barely get through the door. There are clearly enough people on this train to fill two cars, with people still having to stand.

So what gives? You're pretty pissed about having to stand for half an hour, and you've been inadvertently touched by strangers in ways you're not comfortable with. So you get your complaining hat on, and you turn to the internet to unleash the fury.

The train company's website directs you to their Twitter feed, where a poor public relations person is taking a barrage of vitriol from other angry commuters, while doing their best to remain polite and professional.

Several other people have already made your complaint, but all they're getting in return is "Apologies we try to avoid this where possible", or that the short trains were "due to operational reasons". And that's not really a satisfying response. You're not even sure it means anything.


So Where Are The Carriages?

Disclaimer: I have no idea how the train company actually operates. I just like to write blogs about applied maths.

Here's the setup - you're in charge of logistics for the train company. You have a fixed number of trains/cars, and, for each day, a list of services and expected numbers of passengers.

How do you apportion the carriages between the services?

Obviously, every service needs at least one car, and services with more passengers need more cars.

So if you had, for example, three services with 50, 100, and 150 passengers respectively, and 6 cars, then it's easy to divvy them up - one for the first, two for the second, and three for the third.

But in general, it won't be possible to divide up the cars exactly like that; so what do you do about remainders?

This is actually similar to the problem of apportioning parliamentary seats between states in the US.

Currently there are 435 seats in the House of Representatives, which need to be shared out between the 50 states according to each state's population - the idea being that each seat should represent roughly the same number of voters.

There are various methods for doing this, but I'm only going to go over two of them.

The Hamilton method

Hamilton's Method is the easiest and most intuitive.

Say you've got only two services - service A has 85 passengers, service B has 115 passengers (200 passengers total). And you happen to have 4 carriages - a total 200 seats. So seats for everyone! Except service A needs 1.7 cars, and service B needs 2.3 cars. And you can't divide a car into two chunks.

But for starters, you can give one car to service A (50 seats), and two to service B (100 seats). So what about the fourth? Regardless of which service you give it to, some people are going to end up having to stand. So what you want to do is minimise that number.

For service A, 35 people need a seat. For service B, 15 people need a seat. So it makes sense to give the last car to service A. 15 people still have to stand, but - short of building more cars - there's really nothing you can do about that.

This is also known as the largest remainder method.

What if the passenger numbers were 75 and 125? Well in that case, I guess it would have to be a judgement call.

Anyway, that's the basic idea of the Hamilton method - divide up the cars as far as you can, then give the remaining car(s) to the service(s) which 'need them most'.

Things get trickier when you're dealing with larger numbers of passengers and cars, but ultimately it's pretty straight-forward.

The Huntington-Hill Method

Where trains differ from apportioning of parliamentary seats, is that a carriage has a fixed capacity, whereas the number of people a seat can represent is free to change.

And this means that you won't encounter some of the 'quirks' that arise from the Hamilton method - such as the Alabama Paradox (where increasing the total number of seats can mean a state losing seats), or the Population Paradox (where increasing a state's population can result in that state losing seats).

[You won't encounter them, but they're worth mentioning 'cause they're pretty cool.]

Huntington-Hill's Method - the one currently used by the House of Representatives - is, generally, a more powerful method than Hamilton's, and isn't susceptible to the various paradoxes. Another benefit is that it can be set up to guarantee that each state will get at least one seat.

It works by assigning seats, one by one, based on each state's 'priority quotient' - itself based on the state's population, and a 'modified divisor' based on the number of seats already allocated to that state.

But one advantage Hamilton's Method has over HHM, is that it will always give each service its ideal number of cars, either rounded up of down to the nearest whole number. So if a service's ideal number is 3.67, then Hamilton's method will assign either 3 or 4 cars to that service.

HHM, on the other hand, can 'violate quota' - that is, it can result in a given service being assigned more or fewer cars than it's ideal number (e.g. the 3.67 service might end up with only 2 cars). But this problem only occurs when the number of cars is fixed prior to apportioning. And sadly, it usually is.

As an aside, it turns out it's impossible to find a 'perfect' method of apportioning - one with neither paradoxes nor quota violating. Incidentally, the mathematical theory surrounding voting is quite fascinating. You know, if you're into that sort of thing.


Either method would work perfectly well. The problem is, while rounding up or down to the nearest car may seem trivial, the 50 seats that that one car represents can be a significant gain/loss for a service.

The lack of precise information can cause problems as well. You can only estimate how many passengers a given service will have in advance, and if you under-estimate, people are gonna be pretty pissed.

And on top of that, the numbers have to be recalculated from time to time as passenger numbers change..

Basically, getting the number of cars right can be tricky.

Of course, this all assumes that Northern Rail (or whatever company) has enough carriages to adequately satisfy its needs in the first place. Some how I doubt that's the case.

So ideally, NR needs to work out how many cars it actually needs, and build them. But I can't see that happening. And even if it did happen, it'd probably mean higher fares. And fares are pretty bad as they are.

Really, though, NR could do with a complete makeover just in general, given how shitty it is. Seriously, you should see the difference between them and the London Midland service. Ridiculous.

Incidentally, I also have strong feelings about the buses; don't even get me started. Trams are alright though - one every 10mins, seldom short of seats, sufficient leg-room.. bliss.

tl;dr - Where are the carriages, then? As it turns out, it's probably just that they don't have enough to go around.

This is what we get for privatising the trains...


[My complaining hat is a trilby. I like to look bad-ass while I'm complaining.]