Documentation ¶
Overview ¶
Bigpi is an example bigmachine program that estimates digits of Pi using the Monte Carlo method. It distributes work by instantiating multiple machines and calling them to make samples, returning the total number of the samples that fell inside of the unit circle.
We can run it locally with a small number of sample to test:
% bigpi -n 1000000 2018/03/16 15:21:05 waiting for machines to come online 2018/03/16 15:21:08 machine http://localhost:63880/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63878/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63879/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63881/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63877/ RUNNING 2018/03/16 15:21:08 all machines are ready 2018/03/16 15:21:08 distributing work among 5 cores http://localhost:63878/: 2018/03/16 15:21:08 0/200000 http://localhost:63880/: 2018/03/16 15:21:08 0/200000 http://localhost:63879/: 2018/03/16 15:21:08 0/200000 http://localhost:63881/: 2018/03/16 15:21:08 0/200000 2018/03/16 15:21:08 total=784425 nsamples=1000000 π = 3.1377
By using a large EC2 instance we can distribute the work over 100s of cores trivially:
% bigpi -bigsystem ec2 -bigec2type c5.18xlarge -n 1000000000000 2018/03/20 21:00:05 waiting for machines to come online 2018/03/20 21:01:09 machine https://ec2-54-213-185-145.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-35-164-137-2.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-34-208-105-231.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-34-211-149-59.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-34-223-251-92.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 all machines are ready 2018/03/20 21:01:09 distributing work among 360 cores https://ec2-34-208-105-231.us-west-2.compute.amazonaws.com:2000/: 2018/03/20 21:01:09 0/2777777777 https://ec2-34-223-251-92.us-west-2.compute.amazonaws.com:2000/: 2018/03/20 21:01:09 0/2777777777 ... 2018/03/20 21:13:27 total=785397678380 nsamples=1000000000000 π = 3.141590713520
Once a bigmachine program is running, we can profile it using the standard Go pprof tooling. The returned profile is sampled from the whole cluster and merged. In the first iteration of this program, this helped find a bug: we were using the global rand.Float64 which requires a lock. The CPU profile highlighted the lock contention easily:
% go tool pprof localhost:3333/debug/bigmachine/pprof/profile Fetching profile over HTTP from http://localhost:3333/debug/bigmachine/pprof/profile Saved profile in /Users/marius/pprof/pprof.045821636.samples.cpu.001.pb.gz File: 045821636 Type: cpu Time: Mar 16, 2018 at 3:17pm (PDT) Duration: 2.51mins, Total samples = 16.80mins (669.32%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top Showing nodes accounting for 779.47s, 77.31% of 1008.18s total Dropped 51 nodes (cum <= 5.04s) Showing top 10 nodes out of 58 flat flat% sum% cum cum% 333.11s 33.04% 33.04% 333.11s 33.04% runtime.procyield 116.71s 11.58% 44.62% 469.55s 46.57% runtime.lock 76.35s 7.57% 52.19% 347.21s 34.44% sync.(*Mutex).Lock 65.79s 6.53% 58.72% 65.79s 6.53% runtime.futex 41.48s 4.11% 62.83% 202.05s 20.04% sync.(*Mutex).Unlock 34.10s 3.38% 66.21% 364.36s 36.14% runtime.findrunnable 33s 3.27% 69.49% 33s 3.27% runtime.cansemacquire 32.72s 3.25% 72.73% 51.01s 5.06% runtime.runqgrab 24.88s 2.47% 75.20% 57.72s 5.73% runtime.unlock 21.33s 2.12% 77.31% 21.33s 2.12% math/rand.(*rngSource).Uint64