Exploring Go’s Profile-Guided Optimization: A Dummy Example
Introduction
Go 1.20 introduced Profile-Guided Optimization (PGO), a powerful compiler optimization technique that uses runtime profile information to make more intelligent optimization decisions. In this blog post, I’ll share my experience implementing PGO in a simple Go web service and analyzing the results.
What is Profile-Guided Optimization?
Profile-Guided Optimization works by:
- Instrumenting your code to collect performance data during execution
- Using this real-world usage profile to guide optimization choices
- Rebuilding your application with optimizations targeting the most frequently executed code paths
The idea is that the compiler can make better optimization decisions when it knows which parts of your code are executed most frequently.
The GoPGO Demo Project
To explore PGO’s impact, I created a small demo web service with two main endpoints:
/status
- A simple status check endpoint/compute/:complexity
- A CPU-intensive endpoint that performs multiple iterations of SHA-256 hashing
The implementation also includes a “cold path” function that is rarely called, allowing us to see how PGO prioritizes different code sections.
// computeIntensiveHash performs a CPU-intensive operation
// The complexity parameter determines how many iterations of hashing are performed
func computeIntensiveHash(data string, complexity int) string {
result := []byte(data)
// This is our "hot path" that will benefit most from PGO
for i := 0; i < complexity; i++ {
h := sha256.New()
h.Write(result)
result = h.Sum(nil)
}
return hex.EncodeToString(result)
}
// coldPath is a function that is rarely called in our workload
// PGO should deprioritize optimizing this
func coldPath(input string) string {
parts := strings.Split(input, "-")
var result string
// Some complex logic that's rarely executed
for i := len(parts) - 1; i >= 0; i-- {
result += strings.ToUpper(parts[i])
if i > 0 {
result += "_"
}
}
return result
}
PGO Workflow Setup
I set up a Justfile
to streamline the PGO workflow:
- Build the application normally
- Start the server and collect CPU profile data
- Generate representative load using benchmarks
- Rebuild with the collected profile data
- Compare performance before and after PGO
The workflow command (just pgo-workflow
) automates this entire process, making it easy to experiment.
Benchmark Results
After running the PGO workflow, I compared the performance before and after applying PGO:
goos: linux
goarch: amd64
pkg: gocoon.dev/goPGO
cpu: 13th Gen Intel(R) Core(TM) i5-1340P
│ benchmark-before.txt │ benchmark-after.txt │
│ sec/op │ sec/op vs base │
StatusEndpointLive-16 98.73µ ± 3% 94.12µ ± 4% -4.68% (p=0.002 n=6)
ComputeEndpointLive1000-16 296.3µ ± 2% 375.5µ ± 3% +26.70% (p=0.002 n=6)
ComputeEndpointLive10000-16 298.7µ ± 5% 384.4µ ± 1% +28.66% (p=0.002 n=6)
ComputeEndpointLive100000-16 298.6µ ± 3% 376.3µ ± 4% +26.03% (p=0.002 n=6)
ColdPathLive-16 109.7µ ± 4% 110.7µ ± 5% ~ (p=0.699 n=6)
geomean 195.6µ 224.1µ +14.59%
Surprising Results and Analysis
The results were quite surprising:
Status Endpoint: The simple status endpoint showed a modest performance improvement of 4.68%. This is a positive outcome, as this endpoint doesn’t do much computational work.
Compute Endpoint: Unexpectedly, all three complexity levels of the compute endpoint showed a significant performance degradation of about 26-29%. This was contrary to what we might expect from PGO.
Cold Path: The cold path showed no statistically significant change, which is expected since it was rarely executed during profiling.
Memory Usage and Allocations: There were no significant changes in bytes allocated per operation (B/op) or allocations per operation (allocs/op), confirming that the performance differences were primarily in execution time.
What Happened?
Several factors might explain these unexpected results:
Profile Quality: The profile might not have captured a truly representative workload for our application.
Optimization Tradeoffs: The compiler might have made tradeoffs that benefited other parts of the code at the expense of our compute function.
Inlining Decisions: PGO might have made different function inlining decisions based on the profile data.
Benchmark Methodology: Our benchmark methodology might not align perfectly with real-world usage patterns.
Server and Network Overhead: The live server benchmarks include HTTP overhead, which might mask or amplify the impact of code optimizations.
Lessons Learned
This experiment highlighted several important lessons about PGO:
PGO isn’t magic: It doesn’t automatically improve performance in all cases and can sometimes lead to performance regressions.
Profile representativeness matters: The profile must accurately reflect your application’s real-world usage patterns.
Benchmark thoroughly: Test multiple scenarios and workloads to understand the full impact of PGO.
Focus on hot paths: PGO works best when there are clear hot paths in your application.
Analyze results carefully: Don’t assume all performance changes are due to PGO; consider other factors like system load and benchmark variability.
Conclusion
Profile-Guided Optimization is a powerful technique, but it requires careful application and thorough testing. In our experiment, we saw improvements in some areas and regressions in others, highlighting the complex nature of compiler optimizations.
The Go team continues to improve PGO with each release, so its effectiveness will likely increase over time. For now, it’s best to experiment with PGO in your own applications, measure carefully, and make data-driven decisions about whether to adopt it in production.
Remember that performance optimization is a journey, not a destination. PGO is just one tool in your performance toolkit, alongside profiling, benchmarking, and good algorithm design.