I’ve been working on a nerd ethnography project with the GitHub API. There’s so much fun data to play with there that it’s inevitable that I’ll get a little distracted…
One distraction was the realization that I could use the search API to get a massive list of the top repos ordered by star count. Once I started looking at the results, I realized that star data is an interesting alternative metric for evaluating language popularity. Instead of looking at which languages people are actually writing new projects using, we can see which languages are used for the most popular projects.
What are stars?
In August 2012, GitHub announced a new version of their notification system that allowed users to easily mark a repository as interesting by “starring” it:
Stars are essentially lightweight bookmarks that are publicly visible. Even though they were introduced just over a year ago, all “watches” were converted to stars so there’s plenty of data.
Which are the most starred repos?
Let’s start by looking at the top 20:
Which languages have the top spots?
One approach is to look at the number of repos in different ranges for each language:
|Emacs Lisp||1||20||technomancy/emacs-starter-kit (477)|
|Visual Basic||1||1||bmatzelle/gow (800)|
|Common Lisp||1||google/lisp-koans (2889)|
|Ragel in Ruby Host||1||jgarber/redcloth (4829)|
The table is interesting, but it still doesn’t give us a good sense for how the middle languages (C#, Scala, Clojure, Go) compare. It also reveals that there are different star distributions within the languages. For instance, CSS makes a showing in the top 10 but it has way fewer representatives (174) in the top 5000 than PHP (248), Objective C (495), or Java (255).
Looking at the top repo for each language also exposes a weakness in the methodology: GitHub’s language identification isn’t perfect and there are number of polyglot projects. The top Java repo is Storm, which uses enough Clojure (20.1% by GitHub’s measure) to make this identification questionable when you take into account Clojure’s conciseness over Java’s.
What about star counts?
Looking at the results after ranking obscures the actual distribution of stars. Using a squarified treemap with star count for the size and no hierarchy is a compact way of visualizing the ranking while exposing details about the absolute popularity of each repo. The squarified treemap algorithm roughly maintains the order going from one corner to the other.
Here are the top 1000 repos, using stars for the size and language for the color:
(Language and repository name shown on mouseover, click to visit repository. A bit of a fail on touch devices right now.)
Despite being a little chaotic, we can start to see some of the details of the distributions. It still suffers from being difficult to glean information about the middling languages. The comparisons become a little easier if we group the boxes by language. That’s pretty easy, since that’s really the intended usage of treemaps.
Here are the top 5000 grouped by language:
Honestly, I’m not really in love with this visualization, but it was a fun experiment. I have some ideas for more effective representations, but I need to work on my d3.js-fu. Hopefully it serves as an inspirational starting point for someone else…
Firstly, GitHub’s API is really cool and can give you some insights that aren’t exposed through their UI. Like I said at the start of this post, I have another project that caused me to look at this API in the first place and I’m really excited for the possibilities with this data.
GitHub’s current UI is really focused on using stars to expose what’s trending and doesn’t really make it easy to see the all-time greatest hits. Perhaps the expectation is that everyone already knows these repos, but I certainly didn’t and I’ve discovered or rediscovered a few gems. My previous post came about because of my discovery of Font Awesome through this investigation.
I’ll close out with a couple questions (with no question marks) for the audience:
It’s not obvious to me how to best aggregate ranking data. I’d love to see someone else take this data and expose something more interesting. Even if you’re not going to do anything with the data, any ideas are appreciated.