Counting Stars on GitHub

I’ve been working on a nerd ethnography project with the GitHub API. There’s so much fun data to play with there that it’s inevitable that I’ll get a little distracted…

One distraction was the realization that I could use the search API to get a massive list of the top repos ordered by star count. Once I started looking at the results, I realized that star data is an interesting alternative metric for evaluating language popularity. Instead of looking at which languages people are actually writing new projects using, we can see which languages are used for the most popular projects.

What are stars?

In August 2012, GitHub announced a new version of their notification system that allowed users to easily mark a repository as interesting by “starring” it:

Stars are essentially lightweight bookmarks that are publicly visible. Even though they were introduced just over a year ago, all “watches” were converted to stars so there’s plenty of data.

Which are the most starred repos?

Let’s start by looking at the top 20:

Rank	Repository	Language	Stars
1	twbs/bootstrap	JavaScript	62111
2	jquery/jquery	JavaScript	27082
3	joyent/node	JavaScript	26352
4	h5bp/html5-boilerplate	CSS	23355
5	mbostock/d3	JavaScript	20715
6	rails/rails	Ruby	20284
7	FortAwesome/Font-Awesome	CSS	19506
8	bartaz/impress.js	JavaScript	18637
9	angular/angular.js	JavaScript	17994
10	jashkenas/backbone	JavaScript	16502
11	Homebrew/homebrew	Ruby	15065
12	zurb/foundation	JavaScript	14944
13	blueimp/jQuery-File-Upload	JavaScript	14312
14	harvesthq/chosen	JavaScript	14232
15	mrdoob/three.js	JavaScript	13686
16	vhf/free-programming-books	Unknown	13658
17	adobe/brackets	JavaScript	13557
18	robbyrussell/oh-my-zsh	Shell	13337
19	jekyll/jekyll	Ruby	13283
20	github/gitignore	Unknown	13128

If you want to play with the data yourself, I’ve put a cache of the top 5000 repositories here. I’ve also posted the Clojure code I wrote to collect the data at adereth/counting-stars.

Which languages have the top spots?

In Adam Bard’s Top Github Languages for 2013 (so far), he counted repo creation and found that JavaScript and Ruby were pretty close. The top star counts tell a very different story, with JavaScript dominating 7 of the top 10 spots. CSS was in 11th place in his analysis, but it’s 2 of the top 10 spots.

Observing that 7 of the top 10 spots are JavaScript gives a sense for both the volume and the relative ranking of JavaScript in that range of the leaderboard, but just seeing that another language is 50 of the top 5000 spots doesn’t give nearly as much color.

One approach is to look at the number of repos in different ranges for each language:

Language	1-10	1-100	1-1000	1-5000	Top Repository
JavaScript	7	54	385	1605	twbs/bootstrap (1)
CSS	2	8	41	174	h5bp/html5-boilerplate (4)
Ruby	1	9	153	786	rails/rails (6)
Python		5	64	420	django/django (44)
Unknown		5	30	138	vhf/free-programming-books (15)
C++		4	22	108	textmate/textmate (35)
PHP		3	38	248	symfony/symfony (58)
Shell		3	19	89	robbyrussell/oh-my-zsh (18)
Objective-C		2	89	495	AFNetworking/AFNetworking (30)
C		2	31	185	torvalds/linux (25)
Go		2	13	61	dotcloud/docker (45)
Java		1	32	255	nathanmarz/storm (56)
VimL		1	23	66	mathiasbynens/dotfiles (57)
CoffeeScript		1	22	80	jashkenas/coffee-script (43)
Scala			13	46	playframework/playframework (178)
C#			8	65	SignalR/SignalR (205)
Clojure			2	37	technomancy/leiningen (361)
Perl			2	26	sitaramc/gitolite (138)
ActionScript			2	10	mozilla/shumway (606)
Emacs Lisp			1	20	technomancy/emacs-starter-kit (477)
Erlang			1	15	erlang/otp (568)
Haskell			1	12	jgm/pandoc (740)
TypeScript			1	4	bitcoin/bitcoin (161)
Assembly			1	3	jmechner/Prince-of-Persia-Apple-II (269)
Elixir			1	2	elixir-lang/elixir (666)
Objective-J			1	2	cappuccino/cappuccino (667)
Rust			1	1	mozilla/rust (225)
Vala			1	1	p-e-w/finalterm (282)
Julia			1	1	JuliaLang/julia (356)
Visual Basic			1	1	bmatzelle/gow (800)
TeX				6	ieure/sicp (2441)
R				5	johnmyleswhite/ML_for_Hackers (2125)
Lua				4	leafo/moonscript (3351)
PowerShell				3	chocolatey/chocolatey (1580)
Prolog				3	onyxfish/csvkit (3498)
XSLT				2	wakaleo/game-of-life (1093)
Matlab				2	zk00006/OpenTLD (1292)
OCaml				2	MLstate/opalang (1380)
Dart				2	dart-lang/spark (1463)
Groovy				2	Netflix/asgard (1489)
Lasso				1	symfony/symfony-docs (2047)
LiveScript				1	gkz/LiveScript (2226)
Scheme				1	eholk/harlan (2648)
Common Lisp				1	google/lisp-koans (2889)
XML				1	kswedberg/jquery-tmbundle (2972)
Mirah				1	mirah/mirah (2985)
Arc				1	arclanguage/anarki (3389)
DOT				1	cplusplus/draft (3583)
Racket				1	plt/racket (3761)
F#				1	fsharp/fsharp (4518)
D				1	D-Programming-Language/phobos (4719)
Ragel in Ruby Host				1	jgarber/redcloth (4829)
Puppet				1	ansible/ansible-examples (4979)

The table is interesting, but it still doesn’t give us a good sense for how the middle languages (C#, Scala, Clojure, Go) compare. It also reveals that there are different star distributions within the languages. For instance, CSS makes a showing in the top 10 but it has way fewer representatives (174) in the top 5000 than PHP (248), Objective C (495), or Java (255).

Looking at the top repo for each language also exposes a weakness in the methodology: GitHub’s language identification isn’t perfect and there are number of polyglot projects. The top Java repo is Storm, which uses enough Clojure (20.1% by GitHub’s measure) to make this identification questionable when you take into account Clojure’s conciseness over Java’s.

What about star counts?

Looking at the results after ranking obscures the actual distribution of stars. Using a squarified treemap with star count for the size and no hierarchy is a compact way of visualizing the ranking while exposing details about the absolute popularity of each repo. The squarified treemap algorithm roughly maintains the order going from one corner to the other.

Here are the top 1000 repos, using stars for the size and language for the color:

(Language and repository name shown on mouseover, click to visit repository. A bit of a fail on touch devices right now.)

Despite being a little chaotic, we can start to see some of the details of the distributions. It still suffers from being difficult to glean information about the middling languages. The comparisons become a little easier if we group the boxes by language. That’s pretty easy, since that’s really the intended usage of treemaps.

Here are the top 5000 grouped by language:

Honestly, I’m not really in love with this visualization, but it was a fun experiment. I have some ideas for more effective representations, but I need to work on my d3.js-fu. Hopefully it serves as an inspirational starting point for someone else…

Conclusion

Firstly, GitHub’s API is really cool and can give you some insights that aren’t exposed through their UI. Like I said at the start of this post, I have another project that caused me to look at this API in the first place and I’m really excited for the possibilities with this data.

GitHub’s current UI is really focused on using stars to expose what’s trending and doesn’t really make it easy to see the all-time greatest hits. Perhaps the expectation is that everyone already knows these repos, but I certainly didn’t and I’ve discovered or rediscovered a few gems. My previous post came about because of my discovery of Font Awesome through this investigation.

I’ll close out with a couple questions (with no question marks) for the audience:

Through this lens, JavaScript is way more popular than other metrics seem to indicate. One hypothesis is that we all end up exposing things through the browser, so you end up doing something in JavaScript no matter what your language of choice is. I’m interested in other ideas and would also appreciate thoughts on how to validate them.
It’s not obvious to me how to best aggregate ranking data. I’d love to see someone else take this data and expose something more interesting. Even if you’re not going to do anything with the data, any ideas are appreciated.

What are stars?

Which are the most starred repos?

Which languages have the top spots?

What about star counts?

Conclusion

Comments