Distributed Black-Box Optimization Talk at QCon

Himmelblau’s Function, a popular test function for black-box optimization

I gave a talk on Distributed Black-Box Optimization at QCon NY 2017. The video and slides are available here.

For reference, here are all the papers that are mentioned in the talk in the order they are covered:

O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu1, M. Zhang (2017) CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics NSDI’17.

J. A. Nelder R. Mead (1965) A Simplex Method for Function Minimization. The Computer Journal, Vol. 7, Issue 4.

J. E. Dennis Jr., Torczon (1991) Direct Search Methods on Parallel Machines. SIAM Journal on Optimization, 1991 Vol. 1.

J. Mockus (1989) Bayesian Approach to Global Optimization.

J. Wang, S. C. Clark, E. Liu, P. I. Frazier (2016) Parallel Bayesian Global Optimization of Expensive Functions. arXiv:1602.05149.

K. Swersky, J. Snoek, R. P. Adams (2014) Freeze-Thaw Bayesian Optimization. arXiv:1406.3896.

V. Dalibard, M. Schaarschmidt, E. Yoneki (2017) BOAT: Building Auto-Tuners with Structured Bayesian Optimization. Proceedings of the 26th International Conference on World Wide Web.

Playing With Wolfram Playing Cards

A few weeks ago, I knocked my favorite mug off the counter and it shattered. RIP:

I Heart Mathematica Mug

Like a good consumer, I immediately went to the Wolfram store to buy a replacement, but they no longer make it. They did have a deck of playing cards which promised a unique illustration and generating code on each card. How could I pass it up?

The cards arrived a few days ago and they were way cooler than I imagined. As I flipped through the deck, I realized a lot of the cards would be really interesting as starting points for animations. I’ve been going to town with them and posting to Twitter in an epic thread.

I’ll lay them out here grouped by concept… enjoy!

Just Rotating Squares

Imaginary Numbers

Trigonometry

Miscellaneous

Grand Finale

Thanks for making it this far! Your reward is the K♣:

King of Clubs

Writing a Halite Bot in Clojure

Halite: May the Best Bot Win

Halite is a new AI programming competition that was recently released by Two Sigma and Cornell Tech. It was designed and implemented by two interns at Two Sigma and was run as the annual internal summer programming competition.

While the rules are relatively simple, it proved to be a surprisingly deep challenge. It’s played on a 2D grid and a typical game looks like this:

Each turn, all players simultaneously issue movement commands to each of their pieces:

  1. Move to an adjacent location and capture it if you are stronger than what’s currently there.
  2. Stay put and build strength based on the production value of your current location.

When two players’ pieces are adjacent to each other, they automatically fight. A much more detailed description is available on the Halite Game Rules page.

Bots are run as subprocesses that communicate with the game environment through STDIN and STDOUT, so it’s very simple to create bots in the language of your choice. While Python, Java, and C++ bot kits were all provided by the game developers, the community quickly produced kits for C#, Rust, Scala, Ruby, Go, PHP, Node.js, OCaml, C, and Clojure. All the starter packages are available on the Halite Downloads page.

Clojure Bot Basics

The flow of all bots are the same:

  1. Read the initial state (your player ID and the starting game map conditions)
  2. Write your bot’s name
  3. Read the current gam emap
  4. Write your moves
  5. GOTO #3

The Clojure kit represents the game map as a 2D vector of Site records:

1
2
3
4
5
6
(defrecord Site
    [^int x
     ^int y
     ^int production
     ^int strength
     ^int owner])

And movement instructions are simple keywords:

1
(def directions [:still :north :east :south :west])

A simple bot that finds all the sites you control and issues random moves would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(ns MyBot
  (:require [game] [io])
  (:gen-class))

(defn random-moves
  "Takes your bot's ID and a 2D vector of Sites and returns a map from site to direction"
  [my-id game-map]
  (let [my-sites (->> game-map
                      flatten
                      (filter #(= (:owner %) my-id)))]
    (zipmap my-sites (repeatedly #(rand-nth game/directions)))))

(defn -main []
  (let [{:keys [my-id productions width height game-map]} (io/get-init!)]
    (println "MyFirstClojureBot")
    (doseq [turn (range)]
      (let [game-map (io/create-game-map width height productions (io/read-ints!))]
        (io/send-moves! (random-moves my-id game-map))))))

Next Steps

There are currently almost 900 bots competing on the site, but there are only a handful written in Clojure! I’m sure the Clojure community could do some interesting things here, so head over to halite.io, sign-up using your Github account, and download the Clojure starter kit.

Bag of Little Bootstraps Presentation at PWL SF

I recently gave a talk on the Bag of Little Bootstraps algorithm at Papers We Love Too in San Francisco:

My part starts around 31:27, but you should watch the excellent mini talk at the beginning too! For reference, here are all the papers that are mentioned in the talk:

Quenouille, M. H. (1956) Notes on Bias in Estimation. Biometrika, Vol. 43, No. 3 / 4.

Jaeckel, L. (1972) The infinitesimal jackknife. Bell Laboratories Memorandum, #MM 72-1215-11.

Miller, R. G. (1974) The jackknife: a review. Biometrika 61 1-15.

Efron, B. (1979) Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):1-16.

Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. (2012) A scalable bootstrap for massive data.

Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. (2012) The big data bootstrap.

Also, in case there’s any confusion, I don’t train Arabian horses.

Visualizing Girl Talk: Parsing With Clojure’s Instaparse

Greg Gillis, also known as Girl Talk, is an impressive DJ who creates mega-mashups consisting of hundreds of samples. His sample selections span 7 decades and dozens of genres. Listening to his albums is a bit like having a concentrated dump of music history injected right into your brain.

Neo: "I know pop culture"

I became a little obsessed with his work last year and I wanted a better way to experience his albums, so I created an annotated player using Clojure and d3 that shows relevant data and links about every track as it plays:

Image of player

I have versions of this player for his two most recent albums:

Unfortunately, they only really work on the desktop right now.

Parsing Track Data

I’ve released all the code that I used to collect the data and to generate the visualizations, but in this post I’m just going to talk about the first stage of the process: getting the details of which tracks were sampled at each time.

There’s an excellent (totally legal!) crowd-sourced wiki called Illegal Tracklist that has information about most of the samples displayed like this:

  1. “Oh No” - 5:39
  2. “Let It Out” - 6:29

At first, I used Enlive to suck down the HTML versions of the wiki pages, but I realized it might be cleaner to operate off the raw wiki markup which looks like this:

1
2
3
4
5
6
!! 1. "Oh No" - 5:39
* 0:03 - 2:08 [[http://en.wikipedia.org/wiki/Black%20Sabbath | Black Sabbath]] - "[[http://en.wikipedia.org/wiki/War%20Pigs | War Pigs]]"
* 0:13 - 0:15 [[http://en.wikipedia.org/wiki/Tupac_Shakur | 2Pac]] featuring [[http://en.wikipedia.org/wiki/K-Ci%20&%20JoJo | K-Ci & JoJo]] - "[[http://en.wikipedia.org/wiki/How%20Do%20U%20Want%20It | How Do U Want It]]"
* 0:15 - 0:15 [[http://en.wikipedia.org/wiki/Jay-Z | Jay-Z]] - "[[http://en.wikipedia.org/wiki/99%20Problems | 99 Problems]]"
* 0:20 - 2:02 [[http://en.wikipedia.org/wiki/Ludacris | Ludacris]] featuring [[http://en.wikipedia.org/wiki/Mystikal | Mystikal]] and [[http://en.wikipedia.org/wiki/I-20%20%28rapper%29 | I-20]] - "[[http://en.wikipedia.org/wiki/Move%20Bitch | Move Bitch]]"
* 0:20 - 0:54 JC featuring [[http://en.wikipedia.org/wiki/Yung%20Joc | Yung Joc]] - "Vote 4 Me"

I wrote a few specialized functions to pull the details out of the strings and into a data structure, but it quickly became unwieldy and unreadable. I then saw that this was a perfect opportunity to use Instaparse. Instaparse is a library that makes it easy to build parsers in Clojure by writing context-free grammars.

Here’s the Instaparse grammar that I used that parses the Illegal Tracklists’ markup format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
wiki-line = title-track-line | sample-track-line | <''>

title-track-line = <'!! '> track-number <'. '> track-name <' - '> track-time
track-number = #'\\d+'
track-name = #'[^-]+(?= - )'
track-time = time

sample-track-line = <'* '> start-time <' - '> end-time <' '> artist-name <' - '> sample-name
artist-name = (link | artist-plain-text)*
artist-plain-text = #'[^\\[]+(?= - )' | #'[^\\[]+(?=\\[)'

sample-name = (link | sample-plain-text)*
sample-plain-text = #'[^\\[]*'

link = <'[['> url <' | '> text <']]'>
url = #'[^|]+(?= | )'
text = #'[^]]*'

start-time = time
end-time = time
<time> = #'\\d+:\\d+'

The high level structure is practically self-documenting: each line in the wiki source is either a title track line, a sample track line, or a blank line and each type of line is pretty clearly broken down into named components that are separated by string literals to be ignored in the output. It does, however, become a bit nasty when you get to the terminal rules that are defined as regular expressions. Instaparse truly delivers on its tagline:

What if context-free grammars were as easy to use as regular expressions?

The only problem is that regular expressions aren’t always easy to use, especially when you have to start worrying about not greedily matching the text that is going to be used by Instaparse.

Despite some of the pain of regular expressions and grammar debugging, Instaparse was awesome for this part of the project and I would definitely use it again. I love the organization that it brought to the code and the structure I got out was very usable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
({:tag :wiki-line,
  :content
  ({:tag :title-track-line,
    :content
    ({:tag :track-number, :content ("1")}
     {:tag :track-name, :content ("\"Oh No\" ")}
     {:tag :track-time, :content ("5:39")})})}
 {:tag :wiki-line,
  :content
  ({:tag :sample-track-line,
    :content
    ({:tag :start-time, :content ("0:03")}
     {:tag :end-time, :content ("2:08")}
     {:tag :artist-name,
      :content
      ({:tag :link,
        :content
        ({:tag :url,
          :content ("http://en.wikipedia.org/wiki/Black%20Sabbath")}
         {:tag :text, :content ("Black Sabbath")})})}
     {:tag :sample-name,
      :content
      ({:tag :sample-plain-text, :content ("\"")}
       {:tag :link,
        :content
        ({:tag :url,
          :content ("http://en.wikipedia.org/wiki/War%20Pigs")}
         {:tag :text, :content ("War Pigs")})}
       {:tag :sample-plain-text, :content ("\"")})})})}
 {:tag :wiki-line,
  :content
  ({:tag :sample-track-line,
    :content
    ({:tag :start-time, :content ("0:13")}
     {:tag :end-time, :content ("0:15")}
     {:tag :artist-name,
      :content
      ({:tag :link,
        :content
        ({:tag :url,
          :content ("http://en.wikipedia.org/wiki/Tupac_Shakur")}
         {:tag :text, :content ("2Pac")})}
       {:tag :artist-plain-text, :content (" featuring ")}
       {:tag :link,
        :content
        ({:tag :url,
          :content ("http://en.wikipedia.org/wiki/K-Ci%20&%20JoJo")}
         {:tag :text, :content ("K-Ci & JoJo")})})}
     {:tag :sample-name,
      :content
      ({:tag :sample-plain-text, :content ("\"")}
       {:tag :link,
        :content
        ({:tag :url,
          :content
          ("http://en.wikipedia.org/wiki/How%20Do%20U%20Want%20It")}
         {:tag :text, :content ("How Do U Want It")})}
       {:tag :sample-plain-text, :content ("\"")})})})}
       ...

Clojure/conj Talk on 3D Printing Keyboards

I gave a talk on building an ergonomic keyboard using Clojure at Clojure/conj 2015:

Here’s a complete list of the references from the talk, in order of appearance:

Presentation on the Mode Tree at Papers We Love Too

I recently gave a mini talk on The Mode Tree: A Tool for Visualization of Nonparametric Density Features at Papers We Love Too in San Francisco. The talk is just the first 10 minutes:

I did the entire presentation as one huge sequence of animations using D3.js. The Youtube video doesn’t capture the glory that is SVG, so I’ve posted the slides.

I also finally got to apply the technique that I wrote about in my Colorful Equations with MathJax post from over a year ago, only instead of coloring explanatory text, the colors in the accompanying chart match:

$$ \definecolor{kernel}{RGB}{217,86,16} \definecolor{kde}{RGB}{203,23,206} \definecolor{point}{RGB}{18,110,213} \definecolor{width}{RGB}{45,177,93} \color{kde}\hat{f}_{\color{width}h}\color{black}(x) \color{black} = \frac{1}{n\color{width}h}\color{black}\sum\limits_{i=1}^n \color{kernel}K\color{black}(\frac{x-\color{point}X_i}{\color{width}h}) $$

Any questions or feedback on the presentation are welcome… thanks!

SA Profile Keys on a Kinesis Advantage

Despite investing significant time into the Ergodox, I still prefer the Kinesis Advantage:

Kinesis Advantage

I’ve been (slowly) working on building my own dream keyboard, but in the meantime I make the occasional ridiculous tweak to my Kinesis. This weekend I got 3 lbs. of reject keys from Signature Plastics and I was fortunate enough to get enough keycaps to try something that I’ve been wanting to test for a while.

Kinesis Advantage with DSA keycaps

Almost a year ago, I saw Jesse Hallett’s excellent post detailing his Kinesis Advantage with DSA keycaps. His results were impressive:

Kinesis Advantage with DSA Keycaps

Source: Lesse Hallett’s Blog

DSA keycaps are low and uniform, with spherical tops:

DSA Keycap Profile

Source: Signature Plastics

They are particularly well suited for non-standard keyboard layouts, like the ErgoDox or Kinesis Advantage, because most other keycaps have variable heights and angles that were designed for traditional staggered layouts on a plane.

SA keycaps

Personally, I’m a big fan of the SA profile:

SA Keycap Profile

They’re super tall and are usually manufactured with thick ABS plastic and given a smooth, high-gloss finish. I’m a bit of a retro-fetishist and these hit all the right spots. It’s no surprise that projects and group buys that use SA profile often have a retro theme:

For this project I decided to go with all Row 3 profile SA keys in the two main key wells. Row 3 is the home row on traditional layouts and the keycaps are completely symmetric, just like the DSA profile keys. I was concerned about two main things:

  1. The curvature of the key well combined with the height might mean that the keys hit each other towards the top.
  2. Jesse ran into one spot in the corner of the keywell closest to the thumb clusters where the keycap hit the plastic. This would almost certainly be a problem with the much larger SA keys.

The easiest way to find out was to pop off all the keycaps:

Kinesis Advantage with Keycaps Removed

…and to see what fit:

Kinesis Advantage with Keycaps Removed

Fortunately, the height was not an issue. The angles are all perfect and the top of the keys form a relatively tight surface. The smaller gaps between the tops combined with the extreme smoothness of the SA keycaps results in a nice effect where I’m able to just glide my fingers across them without having to lift, similar to the feel of the modern Apple keyboards.

Unfortunately, 3 keys at the bottom of each keywell wouldn’t fit. I busted out the Dremel and carved out a little from the bottom, making room for those beautiful fat caps:

Kinesis Advantage with Keycaps Removed

Kinesis Advantage with Keycaps Removed

I used Row 1 keycaps for the 1x keys in the thumb clusters. The extra height makes them not quite as difficult to reach:

Kinesis Advantage with Keycaps Removed

Conclusion

Overall, I’m pretty happy with how it turned out:

Kinesis Advantage with Keycaps Removed

It’s a pleasure to type on and I’m convinced that SA profile keys are going to be what I use when I eventually reach my final form.

If you’re interested in doing this yourself, you should know that it’ll be difficult to source the 1.25x keys that are used on the sides of the Kinesis Advantage. A lot of the keysets that are being sold for split layouts are targeting the Ergodox, which uses 1.5x keys on the side. It does look like the Symbiosis group buy includes enough 1.25x’s in the main deal, but you’d need to also buy the price Ergodox kit just to get those 4 2x’s. If you don’t mind waiting an indefinite length of time, the Round 5a group buy on Deskthority is probably your best bet for getting exactly what you want.

Poisonous Shapes in Algebra and Graph Theory

I recently started learning about Algebraic Lattices because I was interested in how they relate to Automata Theory. It turns out they’re fundamental to a bunch of other things that I’ve always wanted to learn more about, like CRDTs and Type Theory.

I stumbled upon an interesting property about lattices that reminded me of something from Graph Theory…

image/svg+xml

Distributive Lattices

First, let’s start with the mathy definition of a lattice given by Wikipedia:

If $(L, \leq)$ is a partially ordered set, and $S \subseteq L$ is an arbitrary subset, then an element $u \in L$ is said to be an upper bound of $S$ if $s \leq u$ for each $s \in S$. A set may have many upper bounds, or none at all. An upper bound $u$ of $S$ is said to be its least upper bound, or join, or supremum, if $u \leq x$ for each upper bound $x$ of $S$. A set need not have a least upper bound, but it cannot have more than one. Dually, $l \in L$ is said to be a lower bound of $S$ if $l \leq s$ for each $s \in S$. A lower bound $l$ of $S$ is said to be its greatest lower bound, or meet, or infimum, if $x \leq l$ for each lower bound $x$ of $S$. A set may have many lower bounds, or none at all, but can have at most one greatest lower bound.

A partially ordered set $(L, \leq)$ is called a join-semilattice and a meet-semilattice if each two-element subset $\{a,b\} \subseteq L$ has a join (i.e. least upper bound) and a meet (i.e. greatest lower bound), denoted by $a \vee b$ and $a \wedge b$, respectively. $(L, \leq)$ is called a lattice if it is both a join- and a meet-semilattice.

That’s a lot to take in, but it’s not so bad once you start looking at lattices using Hasse diagrams. A Hasse diagram is a visualization of a partially ordered set, where each element is drawn below the other elements that it is less than. Edges are drawn between two elements $x$ and $y$ if $x < y$ and there is no $z$ such that $x < z < y$.

For example, if we define the $\leq$ operator to be “is a subset of”, we get a partial ordering for any powerset. Here’s the Hasse diagram for the lattice $(\mathcal{P}(\{1,2,3\}), \subseteq)$:

image/svg+xml {1,2,3} {1,2} {1,3} {2,3} {1} {2} {3} {}

In this case, the $\subseteq$ operator results in $\vee$ (the “join”) being union of the sets and $\wedge$ (the “meet”) as the intersection of the sets. It’s a tragic failure of notation that $\vee$ means you should find the common connected element that is above the pair in the diagram and $\wedge$ means you should look below.

Another example is the set of divisors of a number, where the $\leq$ operator means “divides into”. Here’s the Hasse diagram for all the divisors of 30:

image/svg+xml 30 6 10 15 2 3 5 1

In this case, $\vee$ ends up meaning Least Common Multiple and $\wedge$ means Greatest Common Divisor.

Yet another example is the set of binary strings of a fixed length $n$, with $S \leq T$ meaning $S_i \leq T_i$ for $1 \leq i \leq n$:

image/svg+xml 111 011 101 110 001 010 100 000

In this case, $\vee$ is the binary OR operator and $\wedge$ is the binary AND operator.

These examples are interesting because they’re all the same structure! If we can prove a fact about the structure, we get information about all three systems. These happen to be Boolean lattices, which have the distributive property:

$$a \vee (b \wedge c) = (a \vee b) \wedge (a \vee c)$$ $$\forall a, b, c \in L$$

Now here’s the neat part. There are fundamentally only two lattices that aren’t distributive:

image/svg+xml 1 a b c 0
The Diamond Lattice

$$a \vee (b \wedge c) \stackrel{?}{=} (a \vee b) \wedge (a \vee c)$$ $$a \vee 0 \stackrel{?}{=} 1 \wedge 1$$ $$a \neq 1$$ …and:

image/svg+xml c a b 1 0
The Pentagon Lattice

$$a \vee (b \wedge c) \stackrel{?}{=} (a \vee b) \wedge (a \vee c)$$ $$a \vee 0 \stackrel{?}{=} 1 \wedge c$$ $$a \neq c$$

Any lattice that has a sub-lattice that is isomorphic to either the Diamond or Pentagon, referred to as $M_3$ and $N_5$ respectively, is not distributive. It’s not too hard to see, just verify that the corresponding elements that violate distribution above will fail no matter what else is going on in the lattice.

The surprising part is that every non-distributive lattice must contain either $M_3$ or $N_5$! In a sense these are the two poisonous minimal shapes: their presence kills distributivity.

This alone seems like a cool little fact, but what struck me was that this isn’t the first time that I’ve seen something like this…

Planar Graphs

In Graph Theory, there’s a property of graphs called planarity. If you can draw the graph without any of the edges crossing, the graph is planar. The Wikipedia page on planar graphs is filled with awesome facts. For example, a planar graph can’t have an average degree greater than 6 and any graph that can be drawn planarly with curvy edges can be drawn planarly with straight edges, but the most famous result concerning planar graphs is Kuratowski’s Theorem:

A finite graph is planar if and only if it does not contain a subgraph that is a subdivision of $K_5$ (the complete graph on five vertices) or of $K_{3,3}$ (the complete bipartite graph on six vertices)

image/svg+xml
$K_5$, the complete graph on five vertices

image/svg+xml
$K_{3,3}$, the complete bipartite graph on six vertices

It turns out there are fundamentally only two graphs that aren’t planar! Any graph that “contains” either of these two graphs is not planar and any graph that isn’t planar must “contain” one of them. The meaning of “contains” is a little different here. It’s not enough to just find vertices that form $K_5$ or $K_{3,3}$ with the edges that connect them. What we need to check is if any subgraph can be “smoothed” into one of these two graphs.

For example, this graph isn’t planar because the red vertex can be smoothed away, leaving $K_{3,3}$:

image/svg+xml

And?

Unfortunately, I’m just learning about lattice theory and I have no idea if there’s any formal connection between these similar concepts. I have found that there is a branch of lattice theory that is concerned with the planarity of the Hasse diagrams, so it’s not like any algebraist hasn’t thought of this before. If anyone has a formal connection or an example of other cases where a pair of structures have this kind of “poisonous” property, I’d love to hear.

Addendum

Nerkbot over at /r/puremathematics was kind enough to share:

You might be interested in the Robertson-Seymour theorem which says that for any graph property that is closed under taking minors there is a finite list of “forbidden minors” or as you call them “poisonous” graphs. Kuratowski’s theorem is one particular example of this.

This relates also to the concept of well-quasi-orders. A quasi-ordered set $(X, \leq)$ is a well-quasi-order if every infinite sequence ($x_1, x_2, …$) with elements in $X$ has some $i < j$ with $x_i \leq x_j$. An equivalent statement of the Robertson-Seymour theorem is that the set of all finite graphs ordered by the minor relation is well-quasi-ordered.

Another example where “forbidden minors” appear is matroids. Matroids also have minors but unlike graphs they are not well-quasi-ordered under the minor relation. However certain minor closed properties do have a finite list of forbidden minors. For example Rota’s conjecture, which was announced to be proven very recently, says that for any finite field the set of matroids that are representable over that field are characterized by a finite list of forbidden minors.

Silverman’s Mode Estimation Method Explained

I started digging into the history of mode detection after watching Aysylu Greenberg’s Strange Loop talk on benchmarking. She pointed out that the usual benchmarking statistics fail to capture that our timings may actually be samples from multiple distributions, commonly caused by the fact that our systems are comprised of hierarchical caches.

I thought it would be useful to add the detection of this to my favorite benchmarking tool, Hugo Duncan’s Criterium. Not surprisingly, Hugo had already considered this and there’s a note under the TODO section:

1
2
Multimodal distribution detection.
Use kernel density estimators?

I hadn’t heard of using kernel density estimation for multimodal distribution detection so I found the original paper, Using Kernel Density Estimates to Investigate Multimodality (Silverman, 1981). The original paper is a dense 3 pages and my goal with this post is to restate Silverman’s method in a more accessible way. Please excuse anything that seems overly obvious or pedantic and feel encouraged to suggest any modifications that would make it clearer.

What is a mode?

The mode of a distribution is the value that has the highest probability of being observed. Many of us were first exposed to the concept of a mode in a discrete setting. We have a bunch of observations and the mode is just the observation value that occurs most frequently. It’s an elementary exercise in counting. Unfortunately, this method of counting doesn’t transfer well to observations sampled from a continuous distribution because we don’t expect to ever observe the exact some value twice.

What we’re really doing when we count the observations in the discrete case is estimating the probability density function (PDF) of the underlying distribution. The value that has the highest probability of being observed is the one that is the global maximum of the PDF. Looking at it this way, we can see that a necessary step for determining the mode in the continuous case is to first estimate the PDF of the underlying distribution. We’ll come back to how Silverman does this with a technique called kernel density estimation later.

What does it mean to be multimodal?

In the discrete case, we can see that there might be undeniable multiple modes because the counts for two elements might be the same. For instance, if we observe:

$$1,2,2,2,3,4,4,4,5$$

Both 2 and 4 occur thrice, so we have no choice but to say they are both modes. But perhaps we observe something like this:

$$1,1,1,2,2,2,2,3,3,3,4,9,10,10$$

The value 2 occurs more than anything else, so it’s the mode. But let’s look at the histogram:

That pair of 10’s are out there looking awfully interesting. If these were benchmark timings, we might suspect there’s a significant fraction of calls that go down some different execution path or fall back to a slower level of the cache hierarchy. Counting alone isn’t going to reveal the 10’s because there are even more 1’s and 3’s. Since they’re nestled up right next to the 2’s, we probably will assume that they are just part of the expected variance in performance of the same path that caused all those 2’s. What we’re really interested in is the local maxima of the PDF because they are the ones that indicate that our underlying distribution may actually be a mixture of several distributions.

Kernel density estimation

Imagine that we make 20 observations and see that they are distributed like this:

We can estimate the underlying PDF by using what is called a kernel density estimate. We replace each observation with some distribution, called the “kernel,” centered at the point. Here’s what it would look like using a normal distribution with standard deviation 1:

If we sum up all these overlapping distributions, we get a reasonable estimate for the underlying continuous PDF:

Note that we made two interesting assumptions here:

  1. We replaced each point with a normal distribution. Silverman’s approach actually relies on some of the nice mathematical properties of the normal distribution, so that’s what we use.

  2. We used a standard deviation of 1. Each normal distribution is wholly specified by a mean and a standard deviation. The mean is the observation we are replacing, but we had to pick some arbitrary standard deviation which defined the width of the kernel.

In the case of the normal distribution, we could just vary the standard deviation to adjust the width, but there is a more general way of stretching the kernel for arbitrary distributions. The kernel density estimate for observations $X_1,X_2,…,X_n$ using a kernel function $K$ is:

$$\hat{f}(x)=\frac{1}{n}\sum\limits_{i=1}^n K(x-X_i)$$

In our case above, $K$ is the PDF for the normal distribution with standard deviation 1. We can stretch the kernel by a factor of $h$ like this:

$$\hat{f}(x, h)=\frac{1}{nh}\sum\limits_{i=1}^n K(\frac{x-X_i}{h})$$

Note that changing $h$ has the exact same effect as changing the standard deviation: it makes the kernel wider and shorter while maintaining an area of 1 under the curve.

Different kernel widths result in different mode counts

The width of the kernel is effectively a smoothing factor. If we choose too large of a width, we just end up with one giant mound that is almost a perfect normal distribution. Here’s what it looks like if we use $h=5$:

Clearly, this has a single maxima.

If we choose too small of a width, we get a very spiky and over-fit estimate of the PDF. Here’s what it looks like with $h = 0.1$:

This PDF has a bunch of local maxima. If we shrink the width small enough, we’ll get $n$ maxima, where $n$ is the number of observations:

The neat thing about using the normal distribution as our kernel is that it has the property that shrinking the width will only introduce new local maxima. Silverman gives a proof of this at the end of Section 2 in the original paper. This means that for every integer $k$, where $1<k<n$, we can find the minimum width $h_k$ such that the kernel density estimate has at most $k$ maxima. Silverman calls these $h_k$ values “critical widths.”

Finding the critical widths

To actually find the critical widths, we need to look at the formula for the kernel density estimate. The PDF for a plain old normal distribution with mean $\mu$ and standard deviation $\sigma$ is:

$$f(x)=\frac{1}{\sigma\sqrt{2\pi}}\mathrm{e}^{–\frac{(x-\mu)^2}{2\sigma^2}}$$

The kernel density estimate with standard deviation $\sigma=1$ for observations $X_1,X_2,…,X_n$ and width $h$ is:

$$\hat{f}(x,h)=\frac{1}{nh}\sum\limits_{i=1}^n \frac{1}{\sqrt{2\pi}}\mathrm{e}^{–\frac{(x-X_i)^2}{2h^2}}$$

For a given $h$, you can find all the local maxima of $\hat{f}$ using your favorite numerical methods. Now we need to find the $h_k$ where new local maxima are introduced. Because of a result that Silverman proved at the end of section 2 in the paper, we know we can use a binary search over a range of $h$ values to find the critical widths at which new maxima show up.

Picking which kernel width to use

This is the part of the original paper that I found to be the least clear. It’s pretty dense and makes a number of vague references to the application of techniques from other papers.

We now have a kernel density estimate of the PDF for each number of modes between $1$ and $n$. For each estimate, we’re going to use a statistical test to determine the significance. We want to be parsimonious in our claims that there are additional modes, so we pick the smallest $k$ such that the significance measure of $h_k$ meets some threshold.

Bootstrapping is used to evaluate the accuracy of a statistical measure by computing that statistic on observations that are resampled from the original set of observations.

Silverman used a smoothed bootstrap procedure to evaluate the significance. Smoothed bootstrapping is bootstrapping with some noise added to the resampled observations. First, we sample from the original set of observations, with replacement, to get $X_I(i)$. Then we add noise to get our smoothed $y_i$ values:

$$y_i=\frac{1}{\sqrt{1+h_k^2/\sigma^2}}(X_{I(i)}+h_k \epsilon_i)$$

Where $\sigma$ is the standard deviation of $X_1,X_2,…,X_n$, $h_k$ is the critical width we are testing, and $\epsilon_i$ is a random value sampled from a normal distribution with mean 0 and standard deviation 1.

Once we have these smoothed values, we compute the kernel density estimate of them using $h_k$ and count the modes. If this kernel density estimate doesn’t have more than $k$ modes, we take that as a sign that we have a good critical width. We repeat this many times and use the fraction of simulations where we didn’t find more than $k$ modes as the p-value. In the paper, Silverman does 100 rounds of simulation.

Conclusion

Silverman’s technique was a really important early step in multimodality detection and it has been thoroughly investigated and improved upon since 1981. Google Scholar lists about 670 citations of this paper. If you’re interested in learning more, one paper I found particularly helpful was On the Calibration of Silverman’s Test for Multimodality (Hall & York, 2001).

One of the biggest weaknesses in Silverman’s technique is that the critical width is a global parameter, so it may run into trouble if our underlying distribution is a mixture of low and high variance component distributions. For an actual implementation of mode detection in a benchmarking package, I’d consider using something that doesn’t have this issue, like the technique described in Nonparametric Testing of the Existence of Modes (Minnotte, 1997).

I hope this is correct and helpful. If I misinterpreted anything in the original paper, please let me know. Thanks!