bench

package

v2.0.0-dev0.0.8 Latest Latest Go to latest Published: May 22, 2024 License: BSD-3-Clause Imports: 12 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

bench

This is a standard benchmarking system for Axon. It runs 5 layer fully connected networks of various sizes, with the number of events and epochs adjusted to take roughly an equal amount of time overall.

First, build the executable:

$ go build

run_bench.sh is a script that runs standard configurations -- can pass additional args like threads=2 to test different threading levels.
bench_results.md has the algorithmic / implementational history for different versions of the code, on the same platform (macbook pro).
run_hardware.sh is a script specifically for hardware testing, running standard 1, 2, 4 threads for each network size, and only reporting the final result, in the form shown in:
bench_hardware.md has standard results for different hardware.

Benchmark Results

For the code, see bench_test.go

BenchmarkNeuronFun

Goal: See how well the Neuron function scales to more threads

How: We construct a network just like in bench.go, but without setting up the Synapses to make construction faster. In the inner loop of the benchmark, we just call NeuronFun.

Notes:

Because we call NeuronFun without calling SendSpike in between, the Neuron[] slice will remain cached if it's small. However, when comparing the timings of NeuronFun during the benchmarks and the timings of NeuronFun during full runs of the benchmark net, they are the same. This probably means that we're compute-bound, hence caching vs not caching doesn't matter.

Observed outcome:

The NeuronFun scales linearly with threads, as long as there is enough work to divide among the threads.

Napkin math

Back-of-the-envelope memory demand calculations for the major parts of Axon.

Network:

bench.go
2048 neurons
5 layers
10 epochs
1 thread

NeuronFun

This function updates the neuron state. It's called once for every cycle.

Total time: 21s out of 70s total

Size of neuron struct: ~80 parameters * 32bit = 320B
5 layers * 2048 neuronsPerLayer * 320B = 3.2MB for the complete Neuron[] slice
10 epochs * 10 patterns * 4 quarters * 50 cycles = 20K calls to NeuronFun
Each call to NeuronFun iterates over the whole Neuron[] slice, updating the parameters of each neuron
Effective Bandwidth (just Neuron[] slice!): 20K * 3.2MB = 64GB of memory access, 64GB/21s = 3GB/s

Conclusions / Thoughts (for NeuronFun)

The Neuron slice is very small, and the effective bandwidth that we're hitting is far away from what is possible -> not bandwidth bound
The Neuron slice is not being cached, as it get's evicted from the cache by the SynapseFun that comes after.
Memory for the Neuron slice only grows linearly. Memory access is predictable, and can be made sequential.
This function is easily parallelizable, and very amendable to converting to GPU.
- It seems like the current parallelization effort works well, giving us linear scaling as long as the problem size is big enough.
We're far away from saturating the memory bandwidth. Why is it still sort-of slow?
- Could be that we're doing a lot of computation
- Our memory access is sort-of random (most of the struct members are fairly spread out) -> from the benchmarks, doesn't look like this explains more than 2x.

SendSpikeFun

This function sends a spike to receiving pathways if the neuron has spiked. It's called once for every cycle, just like NeuronFun.

Total time: 26s out of 70s total, almost all of it spent in Axon.SendSpike

Projection GBuf (conductance buffer) size is #RecvNeurons*(timeDelay+1). It's a ring buffer, with this layout: nrn0time0 | nrn0time1 | nrn0time2 | nrn1time0 | nrn1time1 | ...
How many of these entries do we access during each round? Just delay=0 and delay=maxDelay?
Could also be represented using maxDelay-many pointers, each to a #RecvNeuron-sized array. Then we exchange the pointers during the timestep update (similar to multiple buffering).
SendSpike performs a matrix product for each pathway: y=Wx, where x is a binary vector, indicating whether the Neuron has spiked or not. How sparse is this vector?
FLOPs: 2 * #outputs * #inputs, where W is a #outputs x #inputs matrix
Memory demand: ideally, loading only W + x, and storing y.
So -> 2n^2 FLOPs, n^2 memory loads -> Constant operational intensity.
Synapses: ~11 parameters * 32bit = 44B
Projection: #inNrn*#outNrn * 44B (the actual Synapse structs) + (4B + 4B + 4B) * #inNrn*#outNrn (indexing based on Pattern)
For the bench.go, we have: 7 Proj times
- Syn[] struct: 2025*2025 Synapses * 44B
- GBuf: 3 max delay * 2025 (neurons) * 4B (so small in size compared to the others that it doesn't really matter)
- RecvConIndex: 2025*2025 * 4B
- RecvSynIndex: 2025*2025 * 4B
- SendConIndex: 2025*2025 * 4B

Documentation ¶

Overview ¶

bench runs a benchmark model with 5 layers (3 hidden, Input, Output) all of the same size, for benchmarking different size networks. These are not particularly realistic models for actual applications (e.g., large models tend to have much more topographic patterns of connectivity and larger layers with fewer connections), but they are easy to run..

Index ¶

Variables
func ConfigEpcLog(dt *table.Table)
func ConfigNet(net *axon.Network, ctx *axon.Context, threads, units int, verbose bool)
func ConfigPats(dt *table.Table, pats, units int)
func TrainNet(net *axon.Network, ctx *axon.Context, pats, epcLog *table.Table, epcs int, ...)

Constants ¶

This section is empty.

Variables ¶

View Source

var ParamSets = params.Sets{
	"Base": {Desc: "these are the best params", Sheets: params.Sheets{
		"Network": &params.Sheet{
			{Sel: "Path", Desc: "",
				Params: params.Params{
					"Path.Learn.LRate.Base": "0.1",
					"Path.SWts.Adapt.LRate": "0.1",
					"Path.SWts.Init.SPct":   "0.5",
				}},
			{Sel: "Layer", Desc: "",
				Params: params.Params{
					"Layer.Inhib.ActAvg.Nominal": "0.08",
					"Layer.Inhib.Layer.Gi":       "1.05",
					"Layer.Acts.Gbar.L":          "0.2",
				}},
			{Sel: "#Input", Desc: "",
				Params: params.Params{
					"Layer.Inhib.Layer.Gi": "0.9",
					"Layer.Acts.Clamp.Ge":  "1.5",
				}},
			{Sel: "#Output", Desc: "",
				Params: params.Params{
					"Layer.Inhib.Layer.Gi": "0.70",
					"Layer.Acts.Clamp.Ge":  "0.8",
				}},
			{Sel: ".BackPath", Desc: "top-down back-pathways MUST have lower relative weight scale, otherwise network hallucinates",
				Params: params.Params{
					"Path.PathScale.Rel": "0.2",
				}},
		},
	}},
}

Functions ¶

func ConfigEpcLog ¶

func ConfigEpcLog(dt *table.Table)

func ConfigNet ¶

func ConfigNet(net *axon.Network, ctx *axon.Context, threads, units int, verbose bool)

func ConfigPats ¶

func ConfigPats(dt *table.Table, pats, units int)

func TrainNet ¶

func TrainNet(net *axon.Network, ctx *axon.Context, pats, epcLog *table.Table, epcs int, verbose, gpu bool)

Types ¶

This section is empty.

Source Files ¶

View all Source files

bench.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL