31 为被测对象建立性能基准 - HelloWorld开发者社区

为被测对象建立性能基准

著名计算机科学家、《计算机程序设计艺术》的作者Donald.E.Knuth(唐纳德.E.克努特，中文名高德纳) 曾说过：“过早优化是万恶之源(premature optimization is the root of all evil)”，这一名言长久以来被很多开发者奉为圭臬。而关于这句名言的解读也像“编程语言战争”一样成为了程序员界的常设话题。

笔者认为之所以对这句话的解读出现“见仁见智”的情况，是因为这句话本身缺少上下文：

被优化的对象是什么类型的程序？
优化什么？设计、性能、资源占用还是…？
优化的指标是什么？

不同开发者看问题的视角不同，所处的上下文不同，得出的解读自然也不会相同。Android 界开源大神杰克·沃顿(Jake Wharton)就曾提出过这样一个观点：过早的引用“过早优化是万恶之源”是一切龟速软件之源（Premature quoting of “premature optimization is the root of all evil” is the root of all slow software）。

是否优化、何时优化实质上是一个决策问题，但决策不能靠直觉，要靠数据说话。借用上面名言中的句型：“没有数据支撑的过早决策是万恶之源”。

Go 语言最初被其设计者们定位为“系统级编程语言”，这说明高性能一直是 Go 核心团队的目标之一。很多来自动态类型语言开发者转到 Go 语言显然也是为了性能（相对于动态类型语言），Gopher 们对 Go GC 延迟的敏感性也都是关注性能的表现。性能优化也是优化的一种，作为一名 Go 开发者，在 Go 中我们该如何做出是否对代码进行性能优化的决策呢？我们可以通过为被测对象建立性能基准的方式去获得决策是否优化的支撑数据，同时我们也可以判断出对代码所做的任何更改是否对代码性能有所影响。

1. 性能基准测试在 Go 语言中是“一等公民”

在前面的章节中，我们已经接触过许多的性能基准测试（benchmark test）。和上一节所讲的模糊测试的境遇不同，性能基准测试在 Go 语言中是和普通的单元测试一样被原生支持的，得到的是 “一等公民” 的待遇。我们可以像普通单元测试那样在*_test.go文件中创建被测对象的性能基准测试，每个以Benchmark前缀开头的函数都会被当作一个独立的性能基准测试：

func BenchmarkXxx(b *testing.B) {
    //... ...
}

下面是一个对多种字符串连接方法的性能基准测试（改编自前面“了解 string 实现原理和高效使用”一节）：

// benchmark_intro_test.go 
package main

import (
    "fmt"
    "strings"
    "testing"
)

var sl = []string{
    "Rob Pike ",
    "Robert Griesemer ",
    "Ken Thompson ",
}

func concatStringByOperator(sl []string) string {
    var s string
    for _, v := range sl {
        s += v
    }
    return s
}

func concatStringBySprintf(sl []string) string {
    var s string
    for _, v := range sl {
        s = fmt.Sprintf("%s%s", s, v)
    }
    return s
}

func concatStringByJoin(sl []string) string {
    return strings.Join(sl, "")
}

func BenchmarkConcatStringByOperator(b *testing.B) {
    for n := 0; n < b.N; n++ {
        concatStringByOperator(sl)
    }
}

func BenchmarkConcatStringBySprintf(b *testing.B) {
    for n := 0; n < b.N; n++ {
        concatStringBySprintf(sl)
    }
}

func BenchmarkConcatStringByJoin(b *testing.B) {
    for n := 0; n < b.N; n++ {
        concatStringByJoin(sl)
    }
}

上面的源文件中定义了三个性能基准测试：BenchmarkConcatStringByOperator、BenchmarkConcatStringBySprintf和BenchmarkConcatStringByJoin，我们可以一起运行这三个基准测试：

$go test -bench . benchmark_intro_test.go 
goos: darwin
goarch: amd64
BenchmarkConcatStringByOperator-8       12810092            88.5 ns/op
BenchmarkConcatStringBySprintf-8         2777902           432 ns/op
BenchmarkConcatStringByJoin-8           23994218            49.7 ns/op
PASS
ok      command-line-arguments    4.117s

也可以通过正则匹配选择其中一个或几个运行：

$go test -bench=ByJoin ./benchmark_intro_test.go
goos: darwin
goarch: amd64
BenchmarkConcatStringByJoin-8       23429586            49.1 ns/op
PASS
ok      command-line-arguments    1.209s

我们关注的是 go test 输出结果中的第三列的那个值。以BenchmarkConcatStringByJoin为例，其第三列的值为49.1 ns/op，该值表示BenchmarkConcatStringByJoin这个基准测试中 for 循环的每次循环平均执行时间为49.1 ns，即op就代表每次循环操作。这里 for 循环调用的是concatStringByJoin，即执行一次concatStringByJoin的平均时长为49.1 ns。

性能基准测试还可以通过传入-benchmem命令行参数输出内存分配信息(与基准测试代码中显式调用b.ReportAllocs的效果是等价的)：

$go test -bench=Join ./benchmark_intro_test.go -benchmem
goos: darwin
goarch: amd64
BenchmarkConcatStringByJoin-8      23004709       48.8 ns/op      48 B/op     1 allocs/op
PASS
ok      command-line-arguments    1.183s

这里输出的内存分配信息告诉我们：每执行一次concatStringByJoin平均进行一次内存分配，每次平均分配 48 字节的数据。

2. 顺序执行和并行执行的性能基准测试

根据是否并行执行，Go 的性能基准测试可以分为两类：一类是顺序执行的性能基准测试，其代码写法如下：

func BenchmarkXxx(b *testing.B) {
    //... ...
    for i := 0; i < b.N; i++ {
        //被测对象的执行代码
    }
}

前面的对多种字符串连接方法的性能基准测试就归属于这类。关于顺序执行的性能基准测试的执行过程原理，我们可以通过下面例子来说明：

// benchmark-impl/sequential_test.go 
package bench

import (
    "fmt"
    "sync"
    "sync/atomic"
    "testing"

    tls "github.com/huandu/go-tls"
)

var (
    m     map[int64]struct{} = make(map[int64]struct{}, 10)
    mu    sync.Mutex
    round int64 = 1
)

func BenchmarkSequential(b *testing.B) {
    fmt.Printf("\ngoroutine[%d] enter BenchmarkSequential: round[%d], b.N[%d]\n",
        tls.ID(), atomic.LoadInt64(&round), b.N)
    defer func() {
        atomic.AddInt64(&round, 1)
    }()

    for i := 0; i < b.N; i++ {
        mu.Lock()
        _, ok := m[round]
        if !ok {
            m[round] = struct{}{}
            fmt.Printf("goroutine[%d] enter loop in BenchmarkSequential: round[%d], b.N[%d]\n",
                tls.ID(), atomic.LoadInt64(&round), b.N)
        }
        mu.Unlock()
    }
    fmt.Printf("goroutine[%d] exit BenchmarkSequential: round[%d], b.N[%d]\n",
        tls.ID(), atomic.LoadInt64(&round), b.N)
}

运行这个例子：

$go test -bench . sequential_test.go 

goroutine[1] enter BenchmarkSequential: round[1], b.N[1]
goroutine[1] enter loop in BenchmarkSequential: round[1], b.N[1]
goroutine[1] exit BenchmarkSequential: round[1], b.N[1]
goos: darwin
goarch: amd64
BenchmarkSequential-8       
goroutine[2] enter BenchmarkSequential: round[2], b.N[100]
goroutine[2] enter loop in BenchmarkSequential: round[2], b.N[100]
goroutine[2] exit BenchmarkSequential: round[2], b.N[100]

goroutine[2] enter BenchmarkSequential: round[3], b.N[10000]
goroutine[2] enter loop in BenchmarkSequential: round[3], b.N[10000]
goroutine[2] exit BenchmarkSequential: round[3], b.N[10000]

goroutine[2] enter BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] enter loop in BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] exit BenchmarkSequential: round[4], b.N[1000000]

goroutine[2] enter BenchmarkSequential: round[5], b.N[65666582]
goroutine[2] enter loop in BenchmarkSequential: round[5], b.N[65666582]
goroutine[2] exit BenchmarkSequential: round[5], b.N[65666582]
65666582            20.6 ns/op
PASS
ok      command-line-arguments    1.381s

我们看到：

BenchmarkSequential 被执行了多轮（见输出结果中的round值）；
每一轮执行，for 循环的b.N值均不相同，依次为 1、100、10000、1000000 和 65666582；
除 b.N 为 1 的首轮，其余各轮均在一个 goroutine（goroutine[2]）中顺序执行。

默认情况下，每个性能基准测试函数（比如：BenchmarkSequential）的执行时间为 1 秒。如果执行一轮所消耗的时间不足 1 秒，那么 go test 会按近似顺序增加 b.N 的值：1、2、3、5、10、20、30、50、100 等。如果当 b.N 较小时，基准测试执行可以很快完成，那么 go test 基准测试框架将跳过中间的一些值，选择较大些的值，比如就像这里b.N从 1 直接跳到 100。选定新的 b.N 之后，go test 基准测试框架会启动新一轮性能基准测试函数的执行，直到某一轮执行所消耗的时间超出 1 秒。上面例子中最后一轮的b.N值为 65666582，这个值应该是 go test 根据上一轮执行后得到的每次循环平均执行时间计算出来的。go test 发现：如果将上一轮每次循环平均执行时间与再扩大 100 倍的 N 值相乘，那下一轮的执行时间会超出 1 秒很多，于是 go test 用 1 秒与上一轮每次循环平均执行时间一起估算了一个循环次数，即上面的65666582。

如果基准测试仅运行 1 秒，并且在这 1 秒内仅运行 10 轮迭代，那么这些基准测试运行所得的平均值可能会有较高的标准偏差。如果基准测试运行了数百万或数十亿次迭代，那么其所得平均值可能更趋于准确。要增加迭代次数，可以使用-benchtime命令行选项来增加基准测试执行的时间。

下面的例子中，我们通过go test的命令行参数-benchtime将 1 秒这个默认性能基准测试函数执行时间改为 2 秒：

$go test -bench . sequential_test.go -benchtime 2s
... ...

goroutine[2] enter BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] enter loop in BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] exit BenchmarkSequential: round[4], b.N[1000000]

goroutine[2] enter BenchmarkSequential: round[5], b.N[100000000]
goroutine[2] enter loop in BenchmarkSequential: round[5], b.N[100000000]
goroutine[2] exit BenchmarkSequential: round[5], b.N[100000000]
100000000            20.5 ns/op
PASS
ok      command-line-arguments    2.075s

我们看到性能基准测试函数执行时间改为 2 秒后，最终轮的b.N的值可以增大到100000000。

我们也可以通过-benchtime手动指定b.N的值，这样 go test 就会以你指定的 N 值作为最终轮的循环次数：

$go test -v -benchtime 5x -bench . sequential_test.go
goos: darwin
goarch: amd64
BenchmarkSequential

goroutine[1] enter BenchmarkSequential: round[1], b.N[1]
goroutine[1] enter loop in BenchmarkSequential: round[1], b.N[1]
goroutine[1] exit BenchmarkSequential: round[1], b.N[1]

goroutine[2] enter BenchmarkSequential: round[2], b.N[5]
goroutine[2] enter loop in BenchmarkSequential: round[2], b.N[5]
goroutine[2] exit BenchmarkSequential: round[2], b.N[5]
BenchmarkSequential-8              5          5470 ns/op
PASS
ok      command-line-arguments    0.006s

上面的每个性能基准测试函数(比如：BenchmarkSequential)虽然实际执行了多轮，但也仅算一次执行。有时候考虑到性能基准测试单次执行的数据不具代表性，我们可能会显式要求 go test 多次执行以收集多次数据，并将这些数据经过统计学方法处理后的结果作为最终结果。通过-count命令行选项可以显式指定每个性能基准测试函数执行次数：

$go test -v -count 2 -bench . benchmark_intro_test.go
goos: darwin
goarch: amd64
BenchmarkConcatStringByOperator
BenchmarkConcatStringByOperator-8       12665250            89.8 ns/op
BenchmarkConcatStringByOperator-8       13099075            89.7 ns/op
BenchmarkConcatStringBySprintf
BenchmarkConcatStringBySprintf-8         2781075           433 ns/op
BenchmarkConcatStringBySprintf-8         2662507           433 ns/op
BenchmarkConcatStringByJoin
BenchmarkConcatStringByJoin-8           23679480            49.1 ns/op
BenchmarkConcatStringByJoin-8           24135014            49.6 ns/op
PASS
ok      command-line-arguments    8.225s

我们看到上面例子每个性能基准测试函数都被执行了两次（当然每次执行实质上都会运行多轮（b.N 不同）），输出了两个结果。

另外一类性能基准测试则是并行执行的，其代码写法如下：

func BenchmarkXxx(b *testing.B) {
    //... ...
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            // 被测对象的执行代码
        }
    }
}

并行执行的基准测试主要用于为包含多 goroutine 同步设施(比如：互斥锁(mutex)、读写锁(rwlock)、原子操作等)的被测代码建立性能基准。相比于顺序执行的基准测试，并行执行的基准测试更能真实反映出多 goroutine 情况下，被测代码在 goroutine 同步上的真实消耗。比如下面这个例子：

//benchmark_paralell_demo_test.go 
package paralelldemo

import (
    "sync"
    "sync/atomic"
    "testing"
)

var n1 int64

func addSyncByAtomic(delta int64) int64 {
    return atomic.AddInt64(&n1, delta)
}

func readSyncByAtomic() int64 {
    return atomic.LoadInt64(&n1)
}

var n2 int64
var rwmu sync.RWMutex

func addSyncByMutex(delta int64) {
    rwmu.Lock()
    n2 += delta
    rwmu.Unlock()
}

func readSyncByMutex() int64 {
    var n int64
    rwmu.RLock()
    n = n2
    rwmu.RUnlock()
    return n
}

func BenchmarkAddSyncByAtomic(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            addSyncByAtomic(1)
        }
    })
}
func BenchmarkReadSyncByAtomic(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            readSyncByAtomic()
        }
    })
}

func BenchmarkAddSyncByMutex(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            addSyncByMutex(1)
        }
    })
}
func BenchmarkReadSyncByMutex(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            readSyncByMutex()
        }
    })
}

运行该性能基准测试：

$go test -v -bench . benchmark_paralell_demo_test.go -cpu 2,4,8
goos: darwin
goarch: amd64
BenchmarkAddSyncByAtomic
BenchmarkAddSyncByAtomic-2        75208119            15.3 ns/op
BenchmarkAddSyncByAtomic-4        70117809            17.0 ns/op
BenchmarkAddSyncByAtomic-8        68664270            15.9 ns/op
BenchmarkReadSyncByAtomic
BenchmarkReadSyncByAtomic-2       1000000000             0.744 ns/op
BenchmarkReadSyncByAtomic-4       1000000000             0.384 ns/op
BenchmarkReadSyncByAtomic-8       1000000000             0.240 ns/op
BenchmarkAddSyncByMutex
BenchmarkAddSyncByMutex-2         37533390            31.4 ns/op
BenchmarkAddSyncByMutex-4         21660948            57.5 ns/op
BenchmarkAddSyncByMutex-8         16808721            72.6 ns/op
BenchmarkReadSyncByMutex
BenchmarkReadSyncByMutex-2        35535615            32.3 ns/op
BenchmarkReadSyncByMutex-4        29839219            39.6 ns/op
BenchmarkReadSyncByMutex-8        29936805            39.8 ns/op
PASS
ok      command-line-arguments    12.454s

我们看到上面例子中通过-cpu 2,4,8命令行选项告知 go test 将每个性能基准测试函数分别在GOMAXPROCS等于 2、4、8 的情况下各运行一次。从测试的输出结果，我们可以很容易地看出不同被测函数的性能随着GOMAXPROCS增大之后的性能变化情况。

和顺序执行的基准测试不同，并行执行的基准测试会启动多个 goroutine 并行执行基准测试函数中的循环，我们也用一个例子来说明一下其执行流程：

//benchmark-impl/paralell_test.go 
package bench

import (
    "fmt"
    "sync"
    "sync/atomic"
    "testing"

    tls "github.com/huandu/go-tls"
)

var (
    m     map[int64]int = make(map[int64]int, 20)
    mu    sync.Mutex
    round int64 = 1
)

func BenchmarkParalell(b *testing.B) {
    fmt.Printf("\ngoroutine[%d] enter BenchmarkParalell: round[%d], b.N[%d]\n",
        tls.ID(), atomic.LoadInt64(&round), b.N)
    defer func() {
        atomic.AddInt64(&round, 1)
    }()

    b.RunParallel(func(pb *testing.PB) {
        id := tls.ID()
        fmt.Printf("goroutine[%d] enter loop func in BenchmarkParalell: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N)
        for pb.Next() {
            mu.Lock()
            _, ok := m[id]
            if !ok {
                m[id] = 1
            } else {
                m[id] = m[id] + 1
            }
            mu.Unlock()
        }

        mu.Lock()
        count := m[id]
        mu.Unlock()

        fmt.Printf("goroutine[%d] exit loop func in BenchmarkParalell: round[%d], loop[%d]\n", tls.ID(), atomic.LoadInt64(&round), count)
    })

    fmt.Printf("goroutine[%d] exit BenchmarkParalell: round[%d], b.N[%d]\n",
        tls.ID(), atomic.LoadInt64(&round), b.N)
}

我们以-cpu=2运行该例子：

$go test -v  -bench . paralell_test.go -cpu=2
goos: darwin
goarch: amd64
BenchmarkParalell

goroutine[1] enter BenchmarkParalell: round[1], b.N[1]
goroutine[2] enter loop func in BenchmarkParalell: round[1], b.N[1]
goroutine[2] exit loop func in BenchmarkParalell: round[1], loop[1]
goroutine[3] enter loop func in BenchmarkParalell: round[1], b.N[1]
goroutine[3] exit loop func in BenchmarkParalell: round[1], loop[0]
goroutine[1] exit BenchmarkParalell: round[1], b.N[1]

goroutine[4] enter BenchmarkParalell: round[2], b.N[100]
goroutine[5] enter loop func in BenchmarkParalell: round[2], b.N[100]
goroutine[5] exit loop func in BenchmarkParalell: round[2], loop[100]
goroutine[6] enter loop func in BenchmarkParalell: round[2], b.N[100]
goroutine[6] exit loop func in BenchmarkParalell: round[2], loop[0]
goroutine[4] exit BenchmarkParalell: round[2], b.N[100]

goroutine[4] enter BenchmarkParalell: round[3], b.N[10000]
goroutine[7] enter loop func in BenchmarkParalell: round[3], b.N[10000]
goroutine[8] enter loop func in BenchmarkParalell: round[3], b.N[10000]
goroutine[8] exit loop func in BenchmarkParalell: round[3], loop[4576]
goroutine[7] exit loop func in BenchmarkParalell: round[3], loop[5424]
goroutine[4] exit BenchmarkParalell: round[3], b.N[10000]

goroutine[4] enter BenchmarkParalell: round[4], b.N[1000000]
goroutine[9] enter loop func in BenchmarkParalell: round[4], b.N[1000000]
goroutine[10] enter loop func in BenchmarkParalell: round[4], b.N[1000000]
goroutine[9] exit loop func in BenchmarkParalell: round[4], loop[478750]
goroutine[10] exit loop func in BenchmarkParalell: round[4], loop[521250]
goroutine[4] exit BenchmarkParalell: round[4], b.N[1000000]

goroutine[4] enter BenchmarkParalell: round[5], b.N[25717561]
goroutine[11] enter loop func in BenchmarkParalell: round[5], b.N[25717561]
goroutine[12] enter loop func in BenchmarkParalell: round[5], b.N[25717561]
goroutine[12] exit loop func in BenchmarkParalell: round[5], loop[11651491]
goroutine[11] exit loop func in BenchmarkParalell: round[5], loop[14066070]
goroutine[4] exit BenchmarkParalell: round[5], b.N[25717561]
BenchmarkParalell-2       25717561            43.6 ns/op
PASS
ok      command-line-arguments    1.176s

我们看到，针对BenchmarkParalell基准测试的每一轮执行，go test都会启动GOMAXPROCS数量的新 goroutine，这些 goroutine 共同执行b.N次循环，每个 goroutine 会尽量相对均衡地分担循环次数。

3. 使用性能基准比较工具

现在我们已经可以通过 go 原生提供的性能基准测试为被测对象建立性能基准了。但被测代码更新前后的性能基准比较依然要靠人工计算和肉眼比对，十分不方便。为此，Go 核心团队先后开发了两款性能基准比较工具：benchcmp和benchstat。

benchcmp 上手快，简单易用，输出的比较结果无需参考文档帮助即可自行解读。下面我们看一个使用 benchcmp 进行性能基准比较的例子。

// benchmark-compare/strcat_test.go 
package main

import (
    "strings"
    "testing"
)

var sl = []string{
    "Rob Pike ",
    "Robert Griesemer ",
    "Ken Thompson ",
}

func Strcat(sl []string) string {
    return concatStringByOperator(sl)
}

func concatStringByOperator(sl []string) string {
    var s string
    for _, v := range sl {
        s += v
    }
    return s
}

func concatStringByJoin(sl []string) string {
    return strings.Join(sl, "")
}

func BenchmarkStrcat(b *testing.B) {
    for n := 0; n < b.N; n++ {
        Strcat(sl)
    }
}

上面例子中的被测目标为Strcat，最初Strcat使用通过 Go 原生的操作符("+")连接的方式实现了字符串的连接，我们采集一下它的性能基准数据：

$go test -run=NONE -bench . strcat_test.go > old.txt

然后，我们升级Strcat的实现，采用strings.Join函数来实现多个字符串的连接：

func Strcat(sl []string) string {
    return concatStringByJoin(sl)
}

我们再采集优化后的性能基准数据：

$go test -run=NONE -bench . strcat_test.go > new.txt

接下来就轮到benchcmp登场了：

$benchcmp old.txt new.txt 
benchmark             old ns/op     new ns/op     delta
BenchmarkStrcat-8     92.4          49.6          -46.32%

我们看到：benchcmp接受被测代码更新前后的两次性能基准测试结果文件：old.txt和new.txt，并将两个文件中的相同基准测试(比如这里的BenchmarkStrcat)的输出结果进行比较。

如果我们使用-count对BenchmarkStrcat执行多次，那么benchcmp给出的结果如下：

$go test -run=NONE -count 5 -bench . strcat_test.go > old.txt 
$go test -run=NONE -count 5 -bench . strcat_test.go > new.txt 

$benchcmp old.txt new.txt                                    
benchmark             old ns/op     new ns/op     delta
BenchmarkStrcat-8     92.8          51.4          -44.61%
BenchmarkStrcat-8     91.9          55.3          -39.83%
BenchmarkStrcat-8     96.1          52.6          -45.27%
BenchmarkStrcat-8     89.4          50.2          -43.85%
BenchmarkStrcat-8     91.2          51.5          -43.53%

如果我们给benchcmp传入-best命令行选项，benchcmp将分别从 old.txt 和 new.txt 中挑选性能最好的一条数据，然后进行比对：

$benchcmp -best old.txt new.txt                              
benchmark             old ns/op     new ns/op     delta
BenchmarkStrcat-8     89.4          50.2          -43.85%

benchcmp还可以按性能基准数据前后变化的大小对输出结果进行排序(通过-mag命令行选项)：

$benchcmp -mag old.txt new.txt
benchmark             old ns/op     new ns/op     delta
BenchmarkStrcat-8     96.1          52.6          -45.27%
BenchmarkStrcat-8     92.8          51.4          -44.61%
BenchmarkStrcat-8     89.4          50.2          -43.85%
BenchmarkStrcat-8     91.2          51.5          -43.53%
BenchmarkStrcat-8     91.9          55.3          -39.83%

不过性能基准测试的输出结果受到很多因素的影响，比如：同一测试的运行次数；性能基准测试与其他正在运行的程序共享一台机器；运行测试的系统本身就在虚拟机上，与其他虚拟机共享硬件；现代机器的一些节能和功率缩放（比如 CPU 的自动降频和睿频）等。这些因素都会造成即便是对同一个基准测试进行多次运行，输出的结果可能也有较大偏差。但benchcmp工具并不关心这些结果数据是否在统计学层面是有效的，只是对结果做简单对比。

为了提高对性能基准数据比对的科学性，Go 核心团队又开发了benchstat这款工具以替代benchcmp。下面我们用benchstat比较一下上面例子中的性能基准数据：

$benchstat old.txt new.txt 
name      old time/op  new time/op  delta
Strcat-8  92.3ns ± 4%  52.2ns ± 6%  -43.43%  (p=0.008 n=5+5)

我们看到即便我们的 old.txt 和 new.txt 中各自有 5 次运行的数据，但benchstat不会像benchcmp那样输出 5 行比较结果，而是输出一行经过统计学方法处理后的比较结果。以第二列数据92.3ns ± 4%为例，这是benchcmp对old.txt中的数据进行处理后的结果：± 4%是样本数据中最大值和最小值距样本平均值的最大偏差百分比。如果这个偏差百分比数值大于 5%，则说明样本数据质量不佳，有些样本数据是不可信的，由此可以看出我们这里 new.txt 中的样本数据就是质量不佳的。

benchstat 输出结果的最后一列(delta)为两次基准测试对比的变化量，我们看到采用strings.Join方法连接字符串的平均耗时要比采用原生操作符连接字符串的性能减少 43%，这个指标后面括号中的p=0.008是一个用于检验两个样本集合的均值是否有显著差异的指标。benchstat 支持两种检验算法，一种是 UTest(Mann Whitney UTest，曼-惠特尼 U 检验)，UTest 也是默认检验算法；另外一种是 Welch T 检验(TTest)。一般 p 值小于 0.05 的结果是可接受的。

上述两款工具也都支持对内存分配数据情况的前后比较，这里以 benchstat 为例：

$go test -run=NONE -count 5 -bench . strcat_test.go -benchmem > old_with_mem.txt
$go test -run=NONE -count 5 -bench . strcat_test.go -benchmem > new_with_mem.txt

$benchstat old_with_mem.txt new_with_mem.txt 
name      old time/op    new time/op    delta
Strcat-8    90.5ns ± 1%    50.6ns ± 2%  -44.14%  (p=0.008 n=5+5)

name      old alloc/op   new alloc/op   delta
Strcat-8     80.0B ± 0%     48.0B ± 0%  -40.00%  (p=0.008 n=5+5)

name      old allocs/op  new allocs/op  delta
Strcat-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)

关于内存分配情况对比的输出独立于执行时间的输出，但结构上是一致的(输出列含义相同)，这里就不再赘述了。

Go 核心团队已经将benchcmp工具打上了“不建议使用(deprecation)”的标签，因此这里也建议大家以后使用benchstat来进行性能基准数据的比较。

4. 排除额外干扰，让基准测试更精确

从前面对顺序执行和并行执行的基准测试原理的介绍，我们知道每个基准测试都可能会运行多轮，每个BenchmarkXxx函数可能都会被重入执行多次。有些复杂的基准测试，在真正执行For循环之前或者在每个循环中除了执行真正的被测代码之外，可能还需要做一些测试准备工作，比如建立基准测试所需的测试上下文环境等。如果不做特殊处理，这些测试准备工作所消耗的时间也会被算入最终结果中，这就会导致最终基准测试的数据受到干扰而不足够精确。为此，testing.B中提供了多种灵活操控基准测试计时器的方法，通过这些方法可以排除掉额外干扰，让基准测试结果更能反映被测代码的真实性能情况。我们来看一个例子：

// benchmark_with_expensive_context_setup_test.go 
package benchmark

import (
    "strings"
    "testing"
    "time"
)

var sl = []string{
    "Rob Pike ",
    "Robert Griesemer ",
    "Ken Thompson ",
}

func concatStringByJoin(sl []string) string {
    return strings.Join(sl, "")
}

func expensiveTestContextSetup() {
    time.Sleep(200 * time.Millisecond)
}

func BenchmarkStrcatWithTestContextSetup(b *testing.B) {
    expensiveTestContextSetup()
    for n := 0; n < b.N; n++ {
        concatStringByJoin(sl)
    }
}

func BenchmarkStrcatWithTestContextSetupAndResetTimer(b *testing.B) {
    expensiveTestContextSetup()
    b.ResetTimer()
    for n := 0; n < b.N; n++ {
        concatStringByJoin(sl)
    }
}

func BenchmarkStrcatWithTestContextSetupAndRestartTimer(b *testing.B) {
    b.StopTimer()
    expensiveTestContextSetup()
    b.StartTimer()
    for n := 0; n < b.N; n++ {
        concatStringByJoin(sl)
    }
}

func BenchmarkStrcat(b *testing.B) {
    for n := 0; n < b.N; n++ {
        concatStringByJoin(sl)
    }
}

在这个例子中，我们来对比一下无需建立测试上下文、建立测试上下文以及在对计时器控制下建立测试上下文等几种情况的基准测试数据：

$go test -bench . benchmark_with_expensive_context_setup_test.go
goos: darwin
goarch: amd64
BenchmarkStrcatWithTestContextSetup-8                      16943037            65.9 ns/op
BenchmarkStrcatWithTestContextSetupAndResetTimer-8         21700249            52.7 ns/op
BenchmarkStrcatWithTestContextSetupAndRestartTimer-8       21628669            50.5 ns/op
BenchmarkStrcat-8                                          22915291            50.7 ns/op
PASS
ok      command-line-arguments    9.838s

我们看到：如果不通过 testing.B 提供的计数器控制接口对测试上下文带来的消耗进行隔离，最终基准测试得到的数据（如例子中的BenchmarkStrcatWithTestContextSetup)将偏离准确数据(例子中的BenchmarkStrcat）好远。而通过 testing.B 提供的计数器控制接口对测试上下文带来的消耗进行隔离后，得到的基准测试数据（如上面例子中的BenchmarkStrcatWithTestContextSetupAndResetTimer和BenchmarkStrcatWithTestContextSetupAndRestartTimer）则是很接近于真实数据。

虽然上面例子中，ResetTimer和StopTimer/StartTimer组合都能实现相同的对测试上下文带来的消耗进行隔离的目的，但二者还是有差别的。ResetTimer并不停掉计时器（无论计时器是否在工作），而是将已消耗的时间、内存分配计数器等全部清零，这样即便计数器依然在工作，它仍然需要从 0 开始重新记；而StopTimer只是简单的停掉一次基准测试运行的计时器，当调用StartTimer后，计时器恢复正常工作。

但这样一来，将ResetTimer或StopTimer用在每个基准测试的 For 循环中都是有副作用的。前面提到默认情况下，每个性能基准测试函数的执行时间为 1 秒。如果执行一轮所消耗的时间不足 1 秒，那么会修改b.N值并启动新的一轮执行。这样一旦在 For 循环中使用StopTimer，那么想要真正运行 1 秒就要等待很长时间；而如果在 For 循环中使用了ResetTimer，由于其每次执行都会将计数器数据清零，因此这轮基准测试将一直执行下去，无法退出。综上，**尽量不要在基准测试的 For 循环中使用ResetTimer！**但可以在限定条件的前提下在 For 循环中使用StopTimer/StartTimer，就像下面 Go 标准库中这样：

// $GOROOT/src/runtime/map_test.go
func benchmarkMapDeleteInt32(b *testing.B, n int) {
        a := make(map[int32]int, n)
        b.ResetTimer()
        for i := 0; i < b.N; i++ {
                if len(a) == 0 {
                        b.StopTimer()
                        for j := i; j < i+n; j++ {
                                a[int32(j)] = j
                        }
                        b.StartTimer()
                }
                delete(a, int32(i))
        }
}

我们看到上面标准库的测试代码中虽然在基准测试的 For 循环中使用了StopTimer，但其使用是在if len(a) == 0这个限定条件下的，即StopTimer方法不会在每次循环中都会被调用。

5. 小结

无论你是否认为性能很重要，请都为你的被测代码（尤其是位于系统关键业务路径上的代码）建立性能基准。如果你编写的是供其他人使用的软件包，则更应如此。只有这样，我们才能至少保证后续对代码做的修改不会带来性能的回退。并且已经建立的性能基准可以为后续进行是否进行进一步优化的决策提供数据支撑，而不是靠程序员的直觉。

本章要点：

性能基准测试在 Go 语言中是“一等公民”，在 Go 中我们可以很容易为被测代码建立性能基准；
了解 Go 的两种类型的性能基准测试的执行原理；
使用性能比较工具协助解读测试结果数据，优先使用benchstat工具；
使用testing.B提供的定时器操作方法排除额外干扰，让基准测试更精确。