Go: multithreading and parallelism

I love Go, I love to praise it (it happens even, I cheat on the slang), I love articles about it. I read the article " Go: Two years in production ", then comments. It became clear on the hub - optimists! They want to believe in the best.

By default, Go runs on a single thread using its sheduler and asynchronous calls. (The programmer has a feeling of multithreading and parallelism.) In this case, the channels work very quickly. But if you tell Go to use 2 or more threads, then Go starts using locks and channel performance may drop. I don’t want to limit myself in the use of channels. Moreover, most third-party libraries use channels at every opportunity. Therefore, it is often effective to run Go with a single thread, as is done by default.

channel01.go
package main

import "fmt"
import "time"
import "runtime"

func main() {
    
    numcpu := runtime.NumCPU()
    fmt.Println("NumCPU", numcpu)
    //runtime.GOMAXPROCS(numcpu)
    runtime.GOMAXPROCS(1)
    
	ch1 := make(chan int)
	ch2 := make(chan float64)

	go func() {
		for i := 0; i < 1000000; i++ {
			ch1 <- i
		}
		ch1 <- -1
		ch2 <- 0.0
	}()
	go func() {
          total := 0.0
		for {
			t1 := time.Now().UnixNano()
			for i := 0; i < 100000; i++ {
				m := <-ch1
				if m == -1 {
					ch2 <- total
				}
			}
			t2 := time.Now().UnixNano()
			dt := float64(t2 - t1) / 1000000.0
			total += dt
			fmt.Println(dt)
		}
	}()
	
	fmt.Println("Total:", <-ch2, <-ch2)
}



users-iMac:channel user$ go run channel01.go 
NumCPU 4
23.901
24.189
23.957
24.072
24.001
23.807
24.039
23.854
23.798
24.1
Total: 239.718 0


Now let's activate all the kernels by commenting out the lines.

    runtime.GOMAXPROCS(numcpu)
    //runtime.GOMAXPROCS(1)


users-iMac:channel user$ go run channel01.go 
NumCPU 4
543.092
534.985
535.799
533.039
538.806
533.315
536.501
533.261
537.73
532.585
Total: 5359.113 0


20 times slower? What's the catch? The default channel size is 1.

	ch1 := make(chan int)


We put 100.

	ch1 := make(chan int, 100)


result 1 stream
users-iMac:channel user$ go run channel01.go 
NumCPU 4
9.704
9.618
9.178
9.84
9.869
9.461
9.802
9.743
9.877
9.756
Total: 0 96.848


4 stream result
users-iMac:channel user$ go run channel01.go 
NumCPU 4
17.046
17.046
16.71
16.315
16.542
16.643
17.69
16.387
17.162
15.232
Total: 0 166.77300000000002


Only twice as slow, but not always possible to use it.

Channel Channel Example


package main

import "fmt"
import "time"
import "runtime"

func main() {
    
    numcpu := runtime.NumCPU()
    fmt.Println("NumCPU", numcpu)
    //runtime.GOMAXPROCS(numcpu)
    runtime.GOMAXPROCS(1)
    
	ch1 := make(chan chan int, 100)
	ch2 := make(chan float64, 1)

	go func() {
		t1 := time.Now().UnixNano()
		for i := 0; i < 1000000; i++ {
      		ch := make(chan int, 100)
			ch1 <- ch
			<- ch
		}
		t2 := time.Now().UnixNano()
		dt := float64(t2 - t1) / 1000000.0
		fmt.Println(dt)
		ch2 <- 0.0
	}()
	go func() {
		for i := 0; i < 1000000; i++ {
			ch := <-ch1
			ch <- i
		}
		ch2 <- 0.0
	}()

	<-ch2
	<-ch2
}


result 1 stream
users-iMac:channel user$ go run channel03.go 
NumCPU 4
1041.489

4 stream result
users-iMac:channel user$ go run channel03.go 
NumCPU 4
11170.616

Therefore, if you have 8 cores and you write the server on Go, you should not rely solely on Go in parallelizing the program, or maybe start 8 single-threaded processes, and before them is a balancer that can also be written on Go. We had a server in production, which, when switching from a single-core server to 4x, began to process 10% less requests.

What do these numbers mean? We were faced with the task of processing 3000 requests per second in one context (for example, giving each request successively numbers: 1, 2, 3, 4, 5 ... maybe a little more complicated) and the performance of 3000 requests per second is limited primarily by channels. With the addition of threads and cores, performance does not grow as zealously as desired. 3000 requests per second for Go is a certain limit on the modern equipment.

Night Update: How to Optimize



The comments from the article “ Go: Two Years in Production ” prompted me to write this article, but the comments of this exceeded the comments of the first.

Harazhitel cybergrind proposed the following optimization. Already liked by 8 other habrazhitelami. I don’t know if they read the code or maybe they are divers and they do everything intuitively, but I will explain. So the article will become more complete and informative.
Here is the code:

package main
 
import "fmt"
import "time"
import "runtime"
 
 
func main() {
 
    numcpu := runtime.NumCPU()
    fmt.Println("NumCPU", numcpu)
    //runtime.GOMAXPROCS(numcpu)
    runtime.GOMAXPROCS(1)
 
    ch3 := make(chan int)
    ch1 := make(chan int, 1000000)
    ch2 := make(chan float64)
 
 
    go func() {
 
        for i := 0; i < 1000000; i++ {
            ch1 <- i
        }
        ch3 <- 1
        ch1 <- -1
        ch2 <- 0.0
 
    }()
    go func() {
        fmt.Println("TT", <-ch3)
        total := 0.0
        for {
            t1 := time.Now().UnixNano()
            for i := 0; i < 100000; i++ {
                m := <-ch1
                if m == -1 {
                    ch2 <- total
                }
            }
            t2 := time.Now().UnixNano()
            dt := float64(t2 - t1) / 1000000.0
            total += dt
            fmt.Println(dt)
        }
    }()
 
    fmt.Println("Total:", <-ch2, <-ch2)
}


What is the essence of this optimization?

1. Added channel ch3. This channel blocks the second gorutin, until the end of the first gorutin.
2. Since the second gorutin does not read from channel ch1, it blocks the first gorutin during filling. Therefore, ch1 is increased to the required 1,000,000.
That is, the code is no longer parallel, it works sequentially, and the channel is used as an array. And of course, this code is not able to use the second core. In the context of this code, one cannot talk about "ideal acceleration by N times."

The main thing is that such a code will work only with a initially defined amount of data and is not able to work continuously, to process information indefinitely for a living.

Update 2: Tests on Go 1.1.2



test number one with buffer 1 (channel01.go)

	ch1 := make(chan chan int, 1)


1 thread
go runchannel01.go
NumCPU 4
66.0038
66.0038
67.0038
66.0038
67.0038
66.0038
65.0037
67.0038
67.0039
76.0043
Total: 0 673.0385000000001


4 threads
go run channel01.go
NumCPU 4
116.0066
186.0106
112.0064
117.0067
175.01
115.0066
114.0065
148.0084
133.0076
153.0088
Total: 0 1369.0782

Conclusion: much better. Why put buffer 1 is hard to imagine, but perhaps there is an application for such a buffer.

test number one with buffer 100 (channel01.go)

	ch1 := make(chan chan int, 100)


1 thread
go run channel01.go
NumCPU 4
16.0009
17.001
16.0009
16.0009
16.0009
16.0009
17.001
16.0009
17.001
16.0009
Total: 0 163.00930000000002


4 threads
go runchannel01.go
NumCPU 4
66.0038
66.0038
67.0038
66.0038
67.0038
66.0038
65.0037
67.0038
67.0039
76.0043
Total: 0 673.0385000000001

Conclusion: two times worse than version 1.0.2
test number two (channel03.go)
1 stream
go run channel03.go
NumCPU 4
1568.0897


4 threads
go run channel03.go
NumCPU 4
12119.6932


Approximately the same as version 1.0.2, but slightly better. 1: 8 vs 1:10