Fmty Dmpty*

A recent post on Chris Siebenmann's wiki My theory on why Go's gofmt has wound up being accepted attracted some conversation on the r/golang ¹ community. One commenter had this to say:

gofmt didnt take off. people complain constantly about it. the community hated that the code had to be formatted a certain way for the compiler to even accept the code. the go team forced it until the community just gave up.

My experience of gofmt and the uptake in the community has been completely the opposite, as was Chris'. Instead of taking this comment or my own biases at face value, I decided to download Russ Cox's recent go corpus dataset which is designed to capture a cross section of popular Go projects for use in guiding future tooling and language discussions.

At first, I was going to just run a shell pipeline on the corpus to see how many names gofmt -l output compared to the number of Go files. However, my own vanity led me first to search the corpus for code I wrote. There was some in there, but only because it had been vendored by another project.

This led me to realize that there may be significant duplication in the corpus: files may be vendored more than once, and there may be different revisions of the "same" file. For fairness, I attempted to unravel some of that in the statistics. I decided to measure some things about vendoring as well. Since I wanted to get some more sophisticated numbers, I wrote a go program to examine the corpus .

For a rough idea of the size of the data, the corpus was ~1.1GB of go source + other files. The program I wrote is far from the most efficient possible way to do this, but it took around 3 minutes to run on my computer. Here is a snippet of its output; I've also uplaoded full output that contains the paths of all the files that failed a gofmt test:

stats:
directories:        20185
projects:           1127
projects (top lvl)  724
projects w/ vend    60
go files (*.go):    62783
vendored go files:  33314
duplicated paths:   23452
duplicated files:   14784
unique files:       47999

fmt stats:
fmted:                 47427
unfmted:               572
% fmted:               98.81%
proj w/ unfmted:       101
proj w/ unfmted vend:  38
proj w/ unfmted files: 129
vend unfmted files:    179

I defined a project to be its base import path; for most things, that means site.com/user/project. I searched for 3 different vendor styles when determining whether something was vendored:

Godeps style: Godeps/_workspace/src: 489 files
Official style: /vendor/: 32429 files
A style I don't recognize: /_vendor/`: 1394 files

Assuming that the program more or less does the right thing, I've come to the following conclusions:

gofmt is very widely adopted, 98.81% of the code was properly formatted.
~~Vendoring is widely used in the corpus~~; Edit I ran some different numbers related to vendoring. It has a large impact on the corpus. Over half of the go source files are actually vendored dependencies. However, only 8% of the projects included in the corpus that are not themselves being vendored are using vendoring.
There is a huge amount of repetition in the corpus; nearly 40% of paths appear twice, and 23% of the files are a duplicate of another file (some files appear many times).
Vendored files accounted for nearly 1/3rd of all unfmted files.
Nearly 9% of projects had at least one unvendored, unfmted file, higher than I'd expected.
Large projects like k8s are not immune. Neither are Google and golang.org/x/ projects.
"being fmted" is a moving target, which might actually inflate the amount of unfmted code in the corpus. A possible example, import reordering was added shortly before version 1, but some old code may have made its way to the corpus formatted without it:

--- /tmp/gofmt765000816 2017-03-26 16:44:09.922414345 -0400
+++ /tmp/gofmt517673743 2017-03-26 16:44:09.922414345 -0400
@@ -8,8 +8,8 @@
    "crypto"
    "crypto/ecdsa"
    "crypto/elliptic"
-   "crypto/subtle"
    "crypto/rand"
+   "crypto/subtle"
    "errors"
    "io"
    "math/big"

There are a lot of other things we could do with this corpus, though first and foremost I think we should improve it. There are some clear issues with using it as a cross section of community Go code at large.

It also brings up some concerns about vendoring as a de-facto way of dealing with dependencies. Over 15% of files in the corpus are different revisions of the "same" file elsewhere in the corpus, vendored across a bunch of different projects. It would take some more sophisticated analysis to determine how many of these are applications vendoring things, which is generally ok, and how many are libraries vendoring things, which is generally not ok.

Beyond that, It would be interesting to see how many errors vet can find, and classify those to see what the most common mistakes Go programmers make are. Maybe measurements like these should be built into the go reportcard or into gddo itself, and inform classification of these errors for people who do training and outreach.

1. I'm not linking this thread or any user comments because I find some of the behaviour in it distasteful.

A note on title: most of the Go core team and many people outside it seem to pronounce fmt as "fumpt". I steadfastly refuse to adopt this and pronouce fmt "format" for the same reason I pronounce pkg "package" and /usr "user". If there was a spoken english version of gofmt, maybe we could all just agree.

Mar 26 2017

jmoiron plays the blues

Fmty Dmpty*