Halite is a new AI programming competition that was recently released by Two Sigma and Cornell Tech. It was designed and implemented by two interns at Two Sigma and was run as the annual internal summer programming competition.
While the rules are relatively simple, it proved to be a surprisingly deep challenge. It’s played on a 2D grid and a typical game looks like this:
Each turn, all players simultaneously issue movement commands to each of their pieces:
When two players’ pieces are adjacent to each other, they automatically fight. A much more detailed description is available on the Halite Game Rules page.
Bots are run as subprocesses that communicate with the game environment through STDIN
and STDOUT
, so it’s very simple to create bots in the language of your choice. While Python, Java, and C++ bot kits were all provided by the game developers, the community quickly produced kits for C#, Rust, Scala, Ruby, Go, PHP, Node.js, OCaml, C, and Clojure. All the starter packages are available on the Halite Downloads page.
The flow of all bots are the same:
The Clojure kit represents the game map as a 2D vector of Site
records:
1 2 3 4 5 6 

And movement instructions are simple keywords:
1


A simple bot that finds all the sites you control and issues random moves would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

There are currently almost 900 bots competing on the site, but there are only a handful written in Clojure! I’m sure the Clojure community could do some interesting things here, so head over to halite.io, signup using your Github account, and download the Clojure starter kit.
]]>I became a little obsessed with his work last year and I wanted a better way to experience his albums, so I created an annotated player using Clojure and d3 that shows relevant data and links about every track as it plays:
I have versions of this player for his two most recent albums:
Unfortunately, they only really work on the desktop right now.
I’ve released all the code that I used to collect the data and to generate the visualizations, but in this post I’m just going to talk about the first stage of the process: getting the details of which tracks were sampled at each time.
There’s an excellent (totally legal!) crowdsourced wiki called Illegal Tracklist that has information about most of the samples displayed like this:
At first, I used Enlive to suck down the HTML versions of the wiki pages, but I realized it might be cleaner to operate off the raw wiki markup which looks like this:
1 2 3 4 5 6 

I wrote a few specialized functions to pull the details out of the strings and into a data structure, but it quickly became unwieldy and unreadable. I then saw that this was a perfect opportunity to use Instaparse. Instaparse is a library that makes it easy to build parsers in Clojure by writing contextfree grammars.
Here’s the Instaparse grammar that I used that parses the Illegal Tracklists’ markup format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

The high level structure is practically selfdocumenting: each line in the wiki source is either a title track line, a sample track line, or a blank line and each type of line is pretty clearly broken down into named components that are separated by string literals to be ignored in the output. It does, however, become a bit nasty when you get to the terminal rules that are defined as regular expressions. Instaparse truly delivers on its tagline:
What if contextfree grammars were as easy to use as regular expressions?
The only problem is that regular expressions aren’t always easy to use, especially when you have to start worrying about not greedily matching the text that is going to be used by Instaparse.
Some people, when confronted with a problem, think “I know, I'll use Instaparse.” Now they have three problems. #clojure
— Matt Adereth (@adereth) October 24, 2014
Despite some of the pain of regular expressions and grammar debugging, Instaparse was awesome for this part of the project and I would definitely use it again. I love the organization that it brought to the code and the structure I got out was very usable:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 

Here’s a complete list of the references from the talk, in order of appearance:
Let us assume that $n = b^k$, where $b$ and $k$ are integers (the case where $n$ is not of this form will be treated in Sec. 7. The remedian with base $b$ proceeds by computing medians of groups of $b$ observations, yielding $b^{k1}$ estimates on which this procedure is iterated, and so on, until only a single estimate remains. When implemented properly, this method merely needs $k$ arrays of size $b$ that are continuously reused.
The implementation of this part in Clojure is so nice that I just had to share.
First, we need a vanilla implementation of the median function. We’re always going to be computing the median of sets of size $b$, where $b$ is relatively small, so there’s no need to get fancy with a linear time algorithm.
1 2 3 4 5 6 7 8 

Now we can implement the actual algorithm. We group, compute the median of each group, and recur, with the base case being when we’re left with a single element in the collection:
1 2 3 4 5 6 7 

Because partitionall
and map
both operate on and return lazy sequences, we maintain the property of only using $O(b \log_{b}{n})$ memory at any point in time.
While this implementation is simple and elegant, it only works if the size of the collection is a power of $b$. If we don’t have $n = b^k$ where $b$ and $k$ are integers, we’ll overweight the observations that get grouped into the last groups of size $< b$.
Section 7 of the original paper describes the weighting scheme you should use to compute the median if you’re left with incomplete groupings:
How should we proceed when the sample size $n$ is less than $b^k$? The remedian algorithm then ends up with $n_1$ numbers in the first array, $n_2$ numbers in the second array, and $n_k$ numbers in the last array, such that $n = n_1 + n_{2}b + … + n_k b^{k1}$. For our final estimate we then compute a weighted median in which the $n_1$, numbers in the first array have weight 1, the $n_2$ numbers in the second array have weight $b$, and the $n_k$ numbers in the last array have weight $b^{k1}$. This final computation does not need much storage because there are fewer than $bk$ numbers and they only have to be ranked in increasing order, after which their weights must be added until the sum is at least $n/2$.
It’s a bit difficult to directly translate this to the recursive solution I gave above because in the final step we’re going to do a computation on a mixture of values from the different recursive sequences. Let’s give it a shot.
We need some way of bubbling up the incomplete groups for the final weighted median computation. Instead of having each recursive sequence always compute the median of each group, we can add a check to see if the group is smaller than $b$ and, if so, just return the incomplete group:
1 2 3 4 5 6 7 8 9 10 

For example, if we were using the mutable array implementation proposed in the original paper to compute the remedian of (range 26)
with $b = 3$, the final state of the arrays would be:
Array  $i_0$  $i_1$  $i_2$ 

0  24  25  empty 
1  19  22  empty 
2  4  13  empty 
In our sequence based solution, the final sequence will be ((4 13 (19 22 (24 25))))
.
Now, we need to convert these nested sequences into [value weight]
pairs that could be fed into a weighted median function:
1 2 3 4 5 6 7 8 9 10 11 12 13 

Instead of weighting the values in array $j$ with weight $b^{j1}$, we’re weighting it at $\frac{b^{j1}}{b^{k}}$. Dividing all the weights by a constant will give us the same result and this is slightly easier to compute recursively, as we can just start at 1 and divide by $b$ as we descend into each nested sequence.
If we run this on the (range 26)
with $b = 3$, we get:
1 2 3 4 

Finally, we’re going to need a weighted median function. This operates on a collection of [value weight]
pairs:
1 2 3 4 5 6 7 8 9 10 11 12 

We can put it all together and redefine the remedian function to deal with the case where $n$ isn’t a power of $b$:
1 2 3 4 5 

The remedian is fun, but in practice I prefer to use the approximate quantile methods that were invented a few years later and presented in Approximate Medians and other Quantiles in One Pass and with Limited Memory by Manku, Rajagopalan, and Lindsay (1998). There’s a highquality implementation you can use in Clojure via Java interop in Parallel Colt’s DoubleQuantileFinderFactory.
]]>Cx Ce
to view the value of the last command in the minibuffer.
Being able to move my cursor to a subexpression and see the value of that expression immediately feels like a superpower. I love this ability and it’s one of the things that keeps me locked into Clojure+Emacs as my preferred enviroment.
This power can be taken to the next level by making custom evaluation commands that run whatever you want on the expression at your cursor.
Let’s start by looking at the Elisp that defines ciderevallastsexp
, which is what gets invoked when we press Cx Ce
:
1 2 3 4 5 6 7 

The important part is that we can use ciderlastsexp
to get the expression before the cursor as a string and we can evaluate a string by passing it to ciderinteractiveeval
. We’ll write some basic Elisp to make a new function that modifies the string before evaluation and then we’ll bind this function to a new key sequence.
The essential pattern we’ll use is:
1 2 3 4 5 6 7 8 

If you happen to still be using Swank or nrepl.el, you should use swankinteractiveeval
and swanklastsexp
or swankinteractiveeval
and nrepllastsexp
.
Let’s look at some of the useful things we can do with this…
I frequently deal with collections that are too big to display nicely in the minibuffer. It’s nice to be able to explore them with a couple keystrokes. Here’s a simple application of the pattern that gives us the size of the collection by just hitting Cc c
:
1 2 3 4 5 6 7 

Another useful one is to just show the nth value. This one is a little more interesting because it requires a parameter:
1 2 3 4 5 6 7 

If you just use Cc n
, Emacs defaults the parameter value to 1. You can pass an argument using Emacs’ universal argument functionality. For example, to get the 0^{th} element, you could either use Cu 0 Cc n
or M0 Cc n
.
Sometimes the best way to view a value is to look at it in an external program. I’ve used this pattern when working on Clojure code that generates SVG, HTML, and 3D models. Here’s what I use for 3D modeling:
1 2 3 4 5 6 7 8 9 

This writes the eval.scad
file to the root directory of the project. It’s great because OpenSCAD watches open files and automatically refreshes when they change. You can run this on an expression that defines a shape and immediately see the shape in another window. I used this technique in my recent presentation on 3D printing at the Clojure NYC meetup and I got feedback that this made the live demos easier to follow.
Here’s what it looks like when you Cc 3
on a nested expression that defines a cube:
If you have to use Swing, your pain can be slightly mitigated by having a quick way to view components. This will give you a shortcut to pop up a new frame that contains what your expression evaluates to:
1 2 3 4 5 6 7 8 9 10 11 

This plays nicely with Seesaw, but doesn’t presume that you have it on your classpath. Here’s what it looks like when you Cc f
at the end of an expression that defines a Swing component:
In A Few Interesing Clojure Microbenchmarks, I mentioned Hugo Duncan’s Criterium library. It’s a reliable way of measuring the performance of an expression. We can bring it closer to our fingertips by making a function for benchmarking an expression instead of just evaluating it:
1 2 3 4 5 6 7 8 

I find this simple pattern to be quite handy. Also, when I’m showing off the power of nrepl to the uninitiated, being able to invoke arbitrary functions on whatever is at my cursor looks like pure magic.
I hope you find this useful and if you invent any useful bindings or alternative ways of implementing this pattern, please share!
]]>When I first started trying to make models a month ago, I tried Blender. It’s an amazing beast, but after a few hours of tutorials it was clear that it would take a while to get proficient with it. Also, it is really designed for interactive modeling and I need something that I can programmatically tweak.
A couple of friends suggested OpenSCAD, which is touted as “the programmers’ solid 3D CAD modeler.” It provides a power set of primitive shapes and operations, but the language itself leaves a bit to be desired. This isn’t a beatuponSCAD post, but a few of the things that irked me were:
Fortunately, Matt Farrell has written scadclj, an OpenSCAD DSL in Clojure. It addresses every issue I had with OpenSCAD and lends itself to a really nice workflow with the Clojure REPL.
To get started using it, add the dependency on [scadclj "0.1.0"]
to your project.clj
and fire up your REPL.
All of the functions for creating 3D models live in the scadclj.model
namespace. There’s no documentation yet, so in the beginning you’ll have to look at the source for model.clj
and occassionally the OpenSCAD documentation. Fortunately, there really isn’t much to learn and it’s quite a revelation to discover that almost everything you’ll want to do can be done with a handful of functions.
Here’s a simple model that showcases each of the primitive shapes:
1 2 3 4 5 

Evaluating this gives us a data structure that can be converted into an .scad file using scadclj.scad/writescad
to generate a string and spit
.
1 2 

We’re going to use OpenSCAD to view the results. One feature of OpenSCAD that is super useful for this workflow is that it watches opened files and automatically refreshes the rendering when the file is updated. This means that we can just reevaluate our Clojure code and see the results immediately in another window:
scadclj makes all new primitive shapes centered at the origin. We can use the shape operator functions to move them around and deform them:
1 2 3 4 5 6 7 8 9 

I snuck union
into those examples. Shapes can also be combined using intersection
, difference
, and hull
. It’s pretty incredible how much can be done with just these. For example, here’s the latest iteration of my keyboard design built using cljscad:
Once your design is complete, you can use OpenSCAD to export it as an STL file which can then be imported to software like ReplicatorG or Makerware for processing into an .x3g file that can be printed:
]]>Almost two years ago, a coworker showed me some gorgeous code that used Clojure’s thrush macro and I fell in love. I found myself jonesing for Cx Ce
whenever I tried going back to Java. I devoured Programming Clojure, then The Joy of Clojure. In search of a purer hit, I turned to the source: McCarthy’s original paper on LISP. After reading it, I realized what someone could have told me that would have convinced me to invest the time 12 years earlier.
There’s a lot of interesting stuff in that paper, but what really struck me was that it felt like it fit into a theoretical framework that I thought I already knew reasonably well. This post isn’t about the power of LISP, which has been covered by others better than I could. Rather, it’s about where LISP fits in the world of computation.
None of what I’m about to say is novel or rigorous. I’m pretty sure that all the novel and rigorous stuff around this topic is 50 – 75 years old, but I just wasn’t exposed to it as directly as I’m going to try and lay out.
One of my favorite classes in school was 15453: Formal Languages, Automata, and Computation, which used Sipser’s Introduction to the Theory of Computation:
One aspect that I really enjoyed was that there was a narrative; we started with Finite State Automata (FSA), analyzed the additional power of Pushdown Automata (PDA), and saw it culminate in Turing Machines (TM). Each of these models look very similar and have a natural connection: they are each just state machines with different types of external memory.
The tape in the Turing Machine can be viewed as two stacks, with one stack representing everything to the left of the current position and the other stack as the current position and everything to the right. With this model, we can view the computational hierarchy (FSA –> PDA –> TM) as just state machines with 0, 1, or 2 stacks. I think it’s quite an elegant representation and it makes the progression seem quite natural.
A key insight along the journey is that these machines are equivalent in power to other useful systems. A sizable section in the chapter on Finite State Automata is dedicated to their equivalence with Regular Expressions (RegEx). Context Free Grammars (CFG) are actually introduced before Pushdown Automata. But when we get to Turing Machines, there’s nothing but a couple paragraphs in a section called “Equivalence with Other Models”, which says:
Many [languages], such as Pascal and LISP, look quite different from one another in style and structure. Can some algorithm be programmed in one of them and not the others? Of course not — we can compile LISP into Pascal and Pascal into LISP, which means that the two languages describe exactly the same class of algorithms. So do all other reasonable programming languages.
The book and class leave it at that and proceed onto the limits of computability, which is the real point of the material. But there’s a natural question that isn’t presented in the book and which I never thought to ask:
While we know that there are many models that equal Turing Machines, we could also construct other models that equal FSAs or PDAs. Why are RegExs and CFGs used as the parallel models of computation? With the machine model, we were able to just add a stack to move up at each level – is there a natural connection between RegExs and CFGs that we extrapolate to find their next level that is Turing equivalent?
It turns out that the answers to these questions were well covered in the 1950’s by the ChomskySchützenberger Hierarchy of Formal Grammars.
The lefthand side of the relations above are the automatonbased models and the righthand side are the languagebased models. The language models are all implemented as production rules, where some symbols are converted to other symbols. The different levels of computation just have different restrictions on what kind of replacements rules are allowed.
For instance RegExs are all rules of the form $A \to a$ and $A \to aB$, where the uppercase letters are nonterminal symbols and the lowercase are terminal. In CFGs, some of the restrictions on the righthand side are lifted. Allowing terminals to appear on the lefthand side lets us make rules that are conditional on what has already been replaced, which appropriately gets called “Context Sensitive Grammars.” Finally, when all the rules are lifted, we get Recursively Enumerable languages, which are Turing equivalent. The Wikipedia page for the hierarchy and the respective levels is a good source for learning more.
When you look at the definition of LISP in McCarthy’s paper, it’s much closer to being an applied version of Chomsky’s style than Turing’s. This isn’t surprising, given that they were contemporaries at MIT. In McCarthy’s History of Lisp, he expicitly states that making a usable version of this other side was his goal:
These simplifications made LISP into a way of describing computable functions much neater than the Turing machines or the general recursive definitions used in recursive function theory. The fact that Turing machines constitute an awkward programming language doesn’t much bother recursive function theorists, because they almost never have any reason to write particular recursive definitions, since the theory concerns recursive functions in general. They often have reason to prove that recursive functions with specific properties exist, but this can be done by an informal argument without having to write them down explicitly. In the early days of computing, some people developed programming languages based on Turing machines; perhaps it seemed more scientific. Anyway, I decided to write a paper describing LISP both as a programming language and as a formalism for doing recursive function theory.
Here we have it straight from the source. McCarthy was trying to capture the power of recursive definitions in a usable form. Just like the automata theorists, once the linguists theorist hit Turing completeness, they focused on the limits instead of the usage.
Theoreticians are more interested in the equality of the systems than the usability, but as practitioners we know that it matters that some problems are more readily solvable in different representations. Sometimes it’s more appropriate to use a RegEx and sometimes an FSA is better suited, even though you could apply either. While nobody is busting out the Turing Machine to tackle realworld problems, some of our languages are more influenced by one side or the other.
If you track back the imperative/functional divide to Turing Machines and Chomsky’s forms, some of the roots are showing. Turing Machines are conducive to a couple things that are considered harmful in larger systems: GOTObased^{1} and mutationcentric^{2} thinking. In a lot of cases, we’re finding that the languages influenced by the languageside are better suited for our problems. Paul Graham argues that the popular languages have been steadily evolving towards the LISPy side.
Anyway, this is a connection that I wish I had been shown at the peak of my interest in automata theory because it would have gotten me a lot more excited about LISP sooner. I think it’s interesting to look at LISP as something that has the same theoretical underpinnings as these other tools (RegEx and CFG) that we already acknowledged as vital.
Thanks to Jason Liszka and my colleagues at Two Sigma for help with this post!
]]>The most annoying part was dealing with GitHub’s rate limits, but after waiting a few hours I had them all on local disk and was able to play around. I haven’t gotten to dig into the data for the actual project I’m doing, but there were a couple simple queries that I thought were worth sharing.
I was able to download 10770 project.clj files. Here are the 50 most frequently included packages listed in their :dependencies
:
Dependency  Count 

org.clojure/clojurecontrib  1524 
compojure  1348 
hiccup  743 
cljhttp  738 
ring/ringjettyadapter  607 
cheshire  558 
org.clojure/data.json  552 
cljtime  526 
org.clojure/tools.logging  490 
enlive  444 
noir  388 
ring/ringcore  375 
ring  361 
org.clojure/tools.cli  348 
org.clojure/java.jdbc  344 
org.clojure/clojurescript  339 
org.clojure/core.async  235 
midje  227 
org.clojure/math.numerictower  219 
korma  206 
incanter  202 
seesaw  195 
overtone  172 
slingshot  160 
quil  158 
com.taoensso/timbre  150 
httpkit  149 
ring/ringdevel  145 
org.clojure/math.combinatorics  145 
org.clojure/core.logic  138 
environ  132 
aleph  132 
log4j  131 
ch.qos.logback/logbackclassic  125 
org.clojure/tools.nrepl  124 
congomongo  124 
com.datomic/datomicfree  123 
com.novemberain/monger  123 
libnoir  121 
org.clojure/core.match  118 
ring/ringjson  111 
clojure  110 
org.clojure/data.xml  110 
log4j/log4j  109 
mysql/mysqlconnectorjava  109 
postgresql/postgresql  107 
org.clojure/data.csv  101 
org.clojure/tools.trace  98 
org.clojure/tools.namespace  92 
ringserver  92 
I think it makes a nice hitlist of projects to check out!
A couple interesting things jumped out at me:
Just over half of the project.clj’s don’t contain a :license
. Here are the most popular:
License  Count 

EPL  4430 
MIT  336 
Apache  106 
BSD  92 
GPL  90 
LGPL  25 
CC  21 
WTFPL  18 
AGPL  11 
Mozilla  11 
The EPL’s dominance doesn’t come as a surprise, given Clojure’s use of it for the core libraries.
23 projects have “WTF” or “fuck” in their license string:
License  Count 

WTFPL  18 
Do What The Fuck You Want To Public License  3 
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2  1 
All Rights Reserved Muthafucka  1 
I’d like to share a mirror of just the project.clj files wrapped up in a single download, but I want to be conscientious of the variety of licenses. I’ll clean up the code for pulling and summarizing all this data soon so others can play with it. In the meantime, feel free to suggest other analyses that could be done on these…
]]>Last year I was working through 4Clojure and I had to reacquaint myself with how to implement one for Problem #125: Gus’s Quinundrum.
A few months later, I saw this tweet from Gary Trakhman:
So simple!
(defn sendtweet
[tweet]
(api/statusesupdate :oauthcreds mycreds
:params {:status tweet}))
— Gary Trakhman (@gtrakGT) November 20, 2013
Seeing him tweet source code that tweets got me thinking about code that tweets its own source code. Could a Quine Tweet be written? I took a stab at adapting my Clojure code for Gus’s Quinundrum, but I just couldn’t make it fit in 140 characters.
The next day, this came across my dash:
Hello world! – tweeting with #wolframlang on @Raspberry_Pi using Send["Twitter","Hello world!" …]
— Stephen Wolfram (@stephen_wolfram) November 21, 2013
Maybe this will enable my impossible dream of a Quine Tweet…
I finally got a Raspberry Pi running with the Wolfram Language and I made it happen:
\{o, c\} = FromCharacterCode\[\{\{92, 40\}, \{92, 41\}\}\] ; SendMessage\["Twitter", StringReplace\[InString\[$Line\], \{o > "", c > ""\}\]\]
— Matt Adereth (@adereth) January 8, 2014
If you paste it into a notebook and evaluate, you’ll get prompted for authorization and it’ll post itself. Here’s a brief explanation of what it does:
$Line
is the count of input expressions that have been evaluated.InString
is a function that gets the input for the i^{th} input expression. It returns a string that has some extra escaped parentheses.\\
. 40 and 41 are the codes for (
and )
. FromCharacterCode
can take a list of lists of ASCII codes and return a list of strings. The list is destructured into the variables o
(open) and c
(close).StringReplace
is then used to clean up the extra parentheses.SendMessage
is the new function in the Wolfram language that does all the hard work of posting.I don’t think this is really in the true spirit of a quine, as having something like InString
makes it a bit trivial, but you do what you must when you only have 140 characters!
So, can it be done in any other languages? Here’s what I think are fair restrictions:
Bonus points if you manage to make the tweet and source include #quine
!
I’ve been working on a nerd ethnography project with the GitHub API. There’s so much fun data to play with there that it’s inevitable that I’ll get a little distracted…
One distraction was the realization that I could use the search API to get a massive list of the top repos ordered by star count. Once I started looking at the results, I realized that star data is an interesting alternative metric for evaluating language popularity. Instead of looking at which languages people are actually writing new projects using, we can see which languages are used for the most popular projects.
In August 2012, GitHub announced a new version of their notification system that allowed users to easily mark a repository as interesting by “starring” it:
Stars are essentially lightweight bookmarks that are publicly visible. Even though they were introduced just over a year ago, all “watches” were converted to stars so there’s plenty of data.
Let’s start by looking at the top 20:
Rank  Repository  Language  Stars 

1  twbs/bootstrap  JavaScript  62111 
2  jquery/jquery  JavaScript  27082 
3  joyent/node  JavaScript  26352 
4  h5bp/html5boilerplate  CSS  23355 
5  mbostock/d3  JavaScript  20715 
6  rails/rails  Ruby  20284 
7  FortAwesome/FontAwesome  CSS  19506 
8  bartaz/impress.js  JavaScript  18637 
9  angular/angular.js  JavaScript  17994 
10  jashkenas/backbone  JavaScript  16502 
11  Homebrew/homebrew  Ruby  15065 
12  zurb/foundation  JavaScript  14944 
13  blueimp/jQueryFileUpload  JavaScript  14312 
14  harvesthq/chosen  JavaScript  14232 
15  mrdoob/three.js  JavaScript  13686 
16  vhf/freeprogrammingbooks  Unknown  13658 
17  adobe/brackets  JavaScript  13557 
18  robbyrussell/ohmyzsh  Shell  13337 
19  jekyll/jekyll  Ruby  13283 
20  github/gitignore  Unknown  13128 
If you want to play with the data yourself, I’ve put a cache of the top 5000 repositories here. I’ve also posted the Clojure code I wrote to collect the data at adereth/countingstars.
In Adam Bard’s Top Github Languages for 2013 (so far), he counted repo creation and found that JavaScript and Ruby were pretty close. The top star counts tell a very different story, with JavaScript dominating 7 of the top 10 spots. CSS was in 11th place in his analysis, but it’s 2 of the top 10 spots.
Observing that 7 of the top 10 spots are JavaScript gives a sense for both the volume and the relative ranking of JavaScript in that range of the leaderboard, but just seeing that another language is 50 of the top 5000 spots doesn’t give nearly as much color.
One approach is to look at the number of repos in different ranges for each language:
Language  110  1100  11000  15000  Top Repository 

JavaScript  7  54  385  1605  twbs/bootstrap (1) 
CSS  2  8  41  174  h5bp/html5boilerplate (4) 
Ruby  1  9  153  786  rails/rails (6) 
Python  5  64  420  django/django (44)  
Unknown  5  30  138  vhf/freeprogrammingbooks (15)  
C++  4  22  108  textmate/textmate (35)  
PHP  3  38  248  symfony/symfony (58)  
Shell  3  19  89  robbyrussell/ohmyzsh (18)  
ObjectiveC  2  89  495  AFNetworking/AFNetworking (30)  
C  2  31  185  torvalds/linux (25)  
Go  2  13  61  dotcloud/docker (45)  
Java  1  32  255  nathanmarz/storm (56)  
VimL  1  23  66  mathiasbynens/dotfiles (57)  
CoffeeScript  1  22  80  jashkenas/coffeescript (43)  
Scala  13  46  playframework/playframework (178)  
C#  8  65  SignalR/SignalR (205)  
Clojure  2  37  technomancy/leiningen (361)  
Perl  2  26  sitaramc/gitolite (138)  
ActionScript  2  10  mozilla/shumway (606)  
Emacs Lisp  1  20  technomancy/emacsstarterkit (477)  
Erlang  1  15  erlang/otp (568)  
Haskell  1  12  jgm/pandoc (740)  
TypeScript  1  4  bitcoin/bitcoin (161)  
Assembly  1  3  jmechner/PrinceofPersiaAppleII (269)  
Elixir  1  2  elixirlang/elixir (666)  
ObjectiveJ  1  2  cappuccino/cappuccino (667)  
Rust  1  1  mozilla/rust (225)  
Vala  1  1  pew/finalterm (282)  
Julia  1  1  JuliaLang/julia (356)  
Visual Basic  1  1  bmatzelle/gow (800)  
TeX  6  ieure/sicp (2441)  
R  5  johnmyleswhite/ML_for_Hackers (2125)  
Lua  4  leafo/moonscript (3351)  
PowerShell  3  chocolatey/chocolatey (1580)  
Prolog  3  onyxfish/csvkit (3498)  
XSLT  2  wakaleo/gameoflife (1093)  
Matlab  2  zk00006/OpenTLD (1292)  
OCaml  2  MLstate/opalang (1380)  
Dart  2  dartlang/spark (1463)  
Groovy  2  Netflix/asgard (1489)  
Lasso  1  symfony/symfonydocs (2047)  
LiveScript  1  gkz/LiveScript (2226)  
Scheme  1  eholk/harlan (2648)  
Common Lisp  1  google/lispkoans (2889)  
XML  1  kswedberg/jquerytmbundle (2972)  
Mirah  1  mirah/mirah (2985)  
Arc  1  arclanguage/anarki (3389)  
DOT  1  cplusplus/draft (3583)  
Racket  1  plt/racket (3761)  
F#  1  fsharp/fsharp (4518)  
D  1  DProgrammingLanguage/phobos (4719)  
Ragel in Ruby Host  1  jgarber/redcloth (4829)  
Puppet  1  ansible/ansibleexamples (4979) 
The table is interesting, but it still doesn’t give us a good sense for how the middle languages (C#, Scala, Clojure, Go) compare. It also reveals that there are different star distributions within the languages. For instance, CSS makes a showing in the top 10 but it has way fewer representatives (174) in the top 5000 than PHP (248), Objective C (495), or Java (255).
Looking at the top repo for each language also exposes a weakness in the methodology: GitHub’s language identification isn’t perfect and there are number of polyglot projects. The top Java repo is Storm, which uses enough Clojure (20.1% by GitHub’s measure) to make this identification questionable when you take into account Clojure’s conciseness over Java’s.
Looking at the results after ranking obscures the actual distribution of stars. Using a squarified treemap with star count for the size and no hierarchy is a compact way of visualizing the ranking while exposing details about the absolute popularity of each repo. The squarified treemap algorithm roughly maintains the order going from one corner to the other.
Here are the top 1000 repos, using stars for the size and language for the color:
(Language and repository name shown on mouseover, click to visit repository. A bit of a fail on touch devices right now.)
Despite being a little chaotic, we can start to see some of the details of the distributions. It still suffers from being difficult to glean information about the middling languages. The comparisons become a little easier if we group the boxes by language. That’s pretty easy, since that’s really the intended usage of treemaps.
Here are the top 5000 grouped by language:
Honestly, I’m not really in love with this visualization, but it was a fun experiment. I have some ideas for more effective representations, but I need to work on my d3.jsfu. Hopefully it serves as an inspirational starting point for someone else…
Firstly, GitHub’s API is really cool and can give you some insights that aren’t exposed through their UI. Like I said at the start of this post, I have another project that caused me to look at this API in the first place and I’m really excited for the possibilities with this data.
GitHub’s current UI is really focused on using stars to expose what’s trending and doesn’t really make it easy to see the alltime greatest hits. Perhaps the expectation is that everyone already knows these repos, but I certainly didn’t and I’ve discovered or rediscovered a few gems. My previous post came about because of my discovery of Font Awesome through this investigation.
I’ll close out with a couple questions (with no question marks) for the audience:
Through this lens, JavaScript is way more popular than other metrics seem to indicate. One hypothesis is that we all end up exposing things through the browser, so you end up doing something in JavaScript no matter what your language of choice is. I’m interested in other ideas and would also appreciate thoughts on how to validate them.
It’s not obvious to me how to best aggregate ranking data. I’d love to see someone else take this data and expose something more interesting. Even if you’re not going to do anything with the data, any ideas are appreciated.
]]>
Zach Tellman delivered a really informative and practical unsession at Clojure Conj 2013 entitled “Predictably Fast Clojure.” It was described as:
An exploration of some of the underlying mechanisms in Clojure, and how to build an intuition for how fast your code should run. Time permitting, we’ll also explore how to work around these mechanisms, and exploit the full power of the JVM.
I’d like to share a few interesting things that I learned from this talk and that I subsequently verified and explored.
It turns out that benchmarking is hard and benchmarking on the JVM is even harder. Fortunately, the folks at the Elliptic Group have thought long and hard about how to do it right and have written a couple of great articles on the matter. Hugo Duncan’s Criterium library makes it super easy to use these robust techniques.
All the benchmarks in this post were run on my dualcore 2.6 GHz Intel Core i5 laptop. The JVM was started with lein withprofile production repl
, which enables more aggressive JIT action at the cost of slower start times. If you try to use Criterium without this, you’ll get warnings spewed for every benchmark you run.
The first thing that he discussed was the relatively poor performance of first
on vectors.
For the tests, I made the some simple collections:
1 2 3 

And then I timed them each with first
and (nth coll 0)
:
The documentation says that first
“calls seq on its argument.” This is effectively true, but if you look at the source you’ll see that if the collection implements ISeq
, seq
doesn’t need to be called. As a result, the performance of first
on lists, which do implement ISeq
, is much better than on vectors, which don’t. Zach took advantage of this observation in his cljtuple library and made sure that tuples implement ISeq
.
What’s really interesting is that you can use (nth coll 0)
to get the first element of a vector faster that you can with first
. Unfortunately, this only does well with vectors. The performance is abysmal when applied to lists, so you should stick to first
if you don’t know the data structure you are operating on.
The apparent slowness of seq
on a vector made me wonder about the empty?
function, which uses seq
under the hood:
1 2 3 4 5 6 7 

If using seq
is so slow, perhaps we can get better performance by just getting the count of elements and testing if it’s zero:
Of course, this is a bad idea for lazy sequences and should probably be avoided, as we’ll incur a cost that is linear in the size of the sequence just to get the count.
I don’t think this will affect my day to day code, but it certainly is interesting and surfaced a bit more about how things actually work in Clojure.
This was a surprising one that also peeled back a layer on Clojure’s implementation. In Fogus’s Why Clojure might not need invokedynamic, but it might be nice, he explained:
Clojure’s protocols are polymorphic on the type of the first argument. The protocol functions are callsite cached (with no percall lookup cost if the target class remains stable). In other words, the implementation of Clojure’s protocols are built on polymorphic inline caches.
The consequence of this is that we will see worse performance if the type of the first argument to a protocol’s method keeps changing. I made a simple test to see how significant this is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

g
calls f
on both its arguments and we expect f
to perform best when it’s consistently called on a single type:
The expectation was correct. There was some subsequent talk about whether the penalty of this cache miss was predictable. Theoretically, the cost could be unbounded if you extend the protocol with enough types and have horrible luck with the hash codes of those types colliding, but my understanding of the caching logic is that it will usually be the small constant that we observed here.
You can see why by taking a look at how the cache works in MethodImplCache.java. The hash code of the class is shifted and masked by values that form a simple perfect hash, which is determined by the maybeminhash
function. The use of a perfect hash means that we should see consistent lookup times for even moderately large caches.
In the rare case that a perfect hash can’t be found by maybeminhash
, the cache falls back to using a PersistentArrayMap
, which can have slightly worse performance. In any case, I don’t think there’s much to worry about here.
One neat thing I discovered while testing all of this is that you don’t suffer this cachemiss penalty if you declare that you support a protocol in your deftype
or if you reify
, but you do if you use extendprotocol
:
1 2 3 4 5 6 7 8 9 10 11 12 

My understanding is that the declaration of a protocol results in the creation of function objects and in a corresponding interface. When the function is called, the first thing it does when trying to dispatch is see if the first argument implements the interface for the protocol that declared the function in the first place. If it did, the corresponding method on the object is called. If it doesn’t implement the interface, it next uses the MethodImplCache and has the potential to suffer from the cache miss. What’s great is that if the object does implement the interface, the most recent entry in the cache is unaffected.
We can verify that the reified object and the instance of the type that was deftyped with the protocol both implement the interface and the other one doesn’t:
1 2 3 4 5 6 7 8 

Often when we want to squeeze every last bit of performance, we use type hints to avoid reflection and to force the use of primitives. Zach demonstrated how to use Gary Trakhman’s no.disassemble to inspect the byte code of a function directly from the REPL.
I haven’t gotten to play with it yet, but the ability to quickly compare the byte code between two implementations in the REPL looked amazing.
Thanks to Zach Tellman for the informative presentation that motivated this and to David Greenberg for help investigating the protocol performance issues.
If there’s anything I got wrong, please let me know in the comments… thanks!
]]>Apache Commons Math is a Java library of mathematics and statistics components. It’s loaded with useful things including:
I highly recommend at least skimming the User Guide. It’s useful to know what’s already available and you may even discover a branch of mathematics that you find interesting.
As with most Java libraries, it’s generally pleasant to use from Clojure via interop. Of course, there are a few places where there’s unnecessary object constructiion just to get at methods that could easily be static and there are a few others where mutation rears its ugly head. For the nonstatic cases, it’s trivial enough to create a fn
that creates the object and calls the method you need.
Many of the methods in the library either accept or return matrices and vectors, using the RealMatrix and RealVector interfaces. While we could use interop to create and use these, it’s nice to be able to use them in idiomatic Clojure and even nicer to be able to seamlessly use them with pure Clojure data structures.
core.matrix is a library and API that aims to make matrix and array programming idiomatic, elegant and fast in Clojure. It features pluggable support for different underlying matrix library implementations.
For all my examples, I’ve included core.matrix as m
:
1


After implementing a few protocols, I was able to get full support for Apache Commons Math’s matrices and vectors into the core.matrix API, which I’ve released as adereth/apachecommonsmatrix.
Once you’ve loaded apachecommonsmatrix.core
, you can begin using the core.matrix
functions on any combination of Apache Commons Math matrices and vectors and any other implementation of matrix and vectors, including Clojure’s builtin persistent vectors.
Without this, you have to write some pretty cumbersome array manipulation code to get the interop to work. For instance:
1 2 3 4 

…versus:
1 2 3 

If you’re working from the REPL or otherwise don’t care about indirectly changing the behavior of your code, you could even avoid withimplementation
and just make :apachecommons
the default by evaluating:
1


Things become really convenient when you start combining Apache Commons Math data structures with Clojure’s. For example, we can multiply a RealMatrix
and a vector:
1 2 3 4 5 6 

Note that the type of the result depends on the implementation of the first parameter:
1 2 3 4 5 6 7 

It was really easy to follow the Implementation Guide for core.matrix that Mike Anderson wrote. There were just a handful of protocols that I needed to implement and I magically got all the functionality of core.matrix. The test framework is incredibly thorough and it immediately revealed a number of subtle bugs in my initial implementation. Overall, it was a great experience and I wish that all interfaces provided such nice documentation and testing.
If you’re doing any math on the JVM, you should at least check out what Apache Commons Math has to offer. If you’re using it in Clojure, I recommend using core.matrix instead of interop whenever possible. If you do try this out, please let me know if there’s anything missing or just send me a pull request!
]]>$$\rho_{X,Y}={E[(X\mu_X)(Y\mu_Y)] \over \sigma_X\sigma_Y}$$
The Pearson coefficient is 1 if the datasets have a perfectly positive linear relationship and 1 if they have a perfectly negative linear relationship. But what if our data has a clear positive relationship, but it’s not linear? Or what if our data isn’t even numeric and doesn’t have a meaningful way of computing the average, $\mu$, or standard deviation, $\sigma$?
In these cases, Kendall’s Tau is a useful way of measuring the correlation since it only requires that we have a total ordering for each of our datasets. For each pair of observations, $(x_1, y_1)$ and $(x_2, y_2)$, we call the pair concordant if: $$x_1 < x_2 \text{ and } y_1 < y_2$$ $$\text{or}$$ $$x_1 > x_2 \text{ and } y_1 > y_2$$ …and we call the pair discordant if: $$x_1 < x_2 \text{ and } y_1 > y_2$$ $$\text{or}$$ $$x_1 > x_2 \text{ and } y_1 < y_2$$ If $x_1 = x_2 \text{ or } y_1 = y_2$, the pair is neither concordant nor discordant.
Kendall’s Tau is then defined as: $$\tau = \frac{n_cn_d}{\frac{1}{2} n (n1) }$$ Where $n_c$ is the number of concordant pairs and $n_d$ is the number of discordant pairs. Since $n (n1) / 2$ is the total number of pairs, this value ranges from 1 to 1.
Unfortunately, this approach doesn’t deal well with tied values. Consider the following set of $(x,y)$ observations: $$(1,1), (1,1), (2,2), (3,3)$$ There’s a perfectly positive linear relationship between X and Y, but only 5 of the 6 pairs are concordant. For this case we want to use the $\tau_B$ modified version:
$$\tau_B = \frac{n_cn_d}{\sqrt{(n_0n_1)(n_0n_2)}}$$
…where:
$$n_0 = n(n1)/2$$ $$n_1 = \text{Number of pairs with tied values in } X$$ $$n_2 = \text{Number of pairs with tied values in } Y$$
We can compute $\tau_B$ in $O(n^{2})$ by looking at every pair of observations and tallying the number of concordant, discordant, and tied pairs. Once we have the tallies, we’ll apply the formula:
1 2 3 4 5 

For a given pair of observations, we’ll construct a map describing which tallies it will contribute to:
1 2 3 4 5 6 7 8 

Now we need a way of generating every pair:
1 2 3 4 5 6 7 

Finally, we put it all together by computing the relations tally for each pair and combining them using mergewith
:
1 2 3 4 5 6 7 8 9 

In 1966, William R. Knight was a visiting statistician at the Fisheries Research Board of Canada. He wrote:
The problem of calculating Kendall’s tau arose while attempting to evaluate species associations in catches by the Canadian east coast offshore fishery. Sample sizes ranging up to 400 were common, making manual calculations out of the question; indeed, an initial program using an asymptotically inefficient method proved expensively slow.
Necessity is the mother of invention, so he came up with a clever algorithm for computing Kendall’s Tau in $O(n \log{n})$ which he published in his paper entitled “A Computer Method for Calculating Kendall’s Tau with Ungrouped Data”.
First, sort the observations by their $x$ values using your favorite $O(n \log{n})$ algorithm. Next, sort that sorted list by the $y$ values using a slightly modified merge sort that keeps track of the size of the swaps it had to perform.
Recall that merge sort works as follows:
(description and animation from Wikipedia)
The trick is performed when merging sublists. The list was originally sorted by $x$ values, so whenever an element from the second sublist is smaller than the next element from the first sublist we know that the corresponding observation is discordant with however many elements remain in the first sublist.
We can implement this modified merge sort by first handling the case of merging two sorted sequences:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

Now, we can do the full merge sort by applying that function to piece sizes that double until the whole collection is covered by a single sorted piece:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

The only thing we are missing now is the tallies of tied pairs. We could use clojure.core/frequencies
, but Knight’s original paper alludes to a different way which takes advantage of the fact that at different stages of the algorithm we have the list sorted by $X$ and then $Y$. Most implementations do something like:
1 2 3 4 5 6 

Now we have all the pieces, so we just have to put them together:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

There are certainly many things I would write differently above if I was really trying for performance. The goal here was to clearly illustrate the algorithm and maintain the asymptotic runtime characteristics.
Also, I recently submitted a patch to the Apache Commons Math library that contains an implementation of this in pure Java if that’s your thing.
I think this algorithm is a clever little gem and I really enjoyed learning it. Deconstructing a familiar algorithm like merge sort and utilizing its internal operations for some other purpose is a neat approach that I’ll definitely keep in my algorithmic toolbox.
]]>use
it, you can write things like:
Binet’s Fibonacci Number Formula:
1 2 3 4 

1 2 3 

InclusionExclusion Principle:
1 2 3 4 

Instructions for use are on the project’s Github page. The full list of implemented symbols is in src/unicode_math/core.clj.
]]>I’m talking about the Kahan Summation algorithm. Maybe it gets ignored because it’s covered halfway through the paper. Despite being buried, you can tell it’s important because the author uses uncharacteristally strong language at the end of the section on the algorithm:
Since these bounds hold for almost all commercial hardware, it would be foolish for numerical programmers to ignore such algorithms, and it would be irresponsible for compiler writers to destroy these algorithms by pretending that floatingpoint variables have real number semantics.
Whoa. Let’s not be foolish!
We’re going to be computing a partial sum of the Harmonic Series:
It’s another nice example because it contains terms that can’t be represented precisely in floating point and the true sum diverges.
Let’s start by computing the sum with infinite precision. Clojure’s Ratio
class represents values internally using BigInteger
to separately store the numerator and denominator. The summation happens using the gradeschool style of making the denominators match and summing the numerators, so we have the exact running sum throughout. At the very end, we convert the number to a floating point double:
1 2 3 4 5 6 7 

For the first 10,000 elements, we’ll see numerical differences starting at the 14th decimal place, so just focus on the last two digits in the results.
As expected, we see a slightly different result if we compute the sum of doubles:
1 2 3 4 5 6 7 

One approach that will get more accurate results is to use an arbitrary precision representation of the numbers, like BigDecimal
. If we naively try to convert harmonicratios
to BigDecimal
, we get an ArithmeticException
as soon as we hit 1/3:
1 2 3 4 5 6 7 8 

We need to explicitly set the precision that we want using a MathContext
. Let’s use 32 decimal places for good measure:
1 2 3 4 5 6 7 8 9 10 11 

Now, let’s see how Kahan Summation algorithm performs on doubles:
1 2 3 4 5 6 7 8 

Everything but vanilla summation of doubles has given us the same answer!
To be fair to doubles, we are summing them in what intuitively is a poor order. The smallest values are being added to the largest intermediate sums, preventing their loworder bits from accumulating. We can try to remedy this by reversing the order:
1 2 

Well, that’s different. This is the first time we’re seeing the floating point noise lead to something larger than the infinite precision answer.
For just a couple additional floating point operations per element, we get a result that competes with the more expensive arbitrary precision solutions. It also does better than the naive approach of presorting, which is both more expensive and eliminates the ability to deal with the data in a streaming fashion.
In a subsequent post, I plan on covering how Kahan Summation can be used effectively in a mapreduce framework.
]]>