Unhandled Error in HTTP Client Retry Logic
Summary
Good improvement — the new code properly distinguishes HTTP error codes (it retries 5xx and 429, passes through 4xx) and closes response bodies on retry to avoid resource leaks. However, there's a correctness gap: when a non-retryable HTTP response is received on the last retry attempt, the function returns the response and the caller processes it — but the caller may not expect a non-200 response after DoWithRetry, because the function name implies it succeeded. More importantly, lastErr is only set in the network error path — if the final attempt gets a 502, lastErr retains the value from the previous attempt (or is nil if the last attempt got an HTTP response), and return nil, lastErr returns nil for the response but the error may be stale or wrong.
Findings
Trace through the exact failure case: request sent 3 times.
- Attempt 1: network error →
lastErr = err1 - Attempt 2: 502 → retries (no
lastErrupdate here, since it's an HTTP response not a network error) - Attempt 3: 502 → retries, body closed, loop exits
After the loop, lastErr still holds err1 — the network error from attempt 1. The function returns (nil, err1). The caller receives a stale network error that has nothing to do with why the final request failed. If the network recovered between attempt 1 and 3, the real reason the request failed was a 502 from the upstream — and the caller has no visibility into that.
The fix: track the HTTP status code from each retry attempt, and if the final attempt returned a non-retryable HTTP response, return that response with a wrapped error (e.g., fmt.Errorf("final attempt: HTTP %d", statusCode)). If all attempts failed with network errors, return the last network error. If you can't distinguish, at minimum return a clear err that says "all retries exhausted" — not a stale error from an earlier attempt.
If the upstream returns a 404 or 403 on any attempt (including the first), the current code returns (resp, nil). The caller gets a response with status 404 and no error. If this is an idempotent GET, that's fine. But if the caller is a POST (e.g., sending a notification), and the 404 means "recipient not found", the caller will treat this as success because err == nil.
Check whether the callers of DoWithRetry are GET or POST. If any are POST or state-changing, add a config flag to treat specific 4xx codes as errors. For a notification worker specifically, a 404 on a send-notification call is not a success — it should surface as an error.
maxRetries is hardcoded to 3. For a worker sending notifications, 3 retries with exponential backoff (1s, 2s, 3s) means a failed notification can take 6+ seconds before returning an error to the caller. Consider making this configurable via an environment variable or the HTTPClient struct options. Also consider adding a jitter to the backoff to avoid thundering herd when multiple workers retry simultaneously.
Recommendation
The HTTP status code filtering is a solid improvement. Merge Finding #2 and #3 as quick follow-up fixes, but treat Finding #1 as a correctness bug that should block the release — the stale error behavior will make debugging failed requests significantly harder. When a 502 fails after 3 retries, the error message should say "502 after 3 retries" not "network timeout from attempt 1".
Suggested Fix
func (c *HTTPClient) DoWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) { retryableCodes := map[int]bool{429: true, 500: true, 502: true, 503: true, 504: true} maxRetries := c.maxRetries // configure via HTTPClient option var lastHTTPStatus int for i := 0; i < maxRetries; i++ { resp, netErr := c.client.Do(req) if netErr != nil { lastHTTPStatus = 0 lastErr := netErr if i < maxRetries-1 { time.Sleep(time.Duration(i+1) * time.Second) continue } return nil, fmt.Errorf("retry exhausted after %d attempts: %w", maxRetries, lastErr) } lastHTTPStatus = resp.StatusCode if resp.StatusCode < 500 && resp.StatusCode != 429 { // Non-retryable: pass through. Caller is responsible for handling 4xx. return resp, nil } resp.Body.Close() if i < maxRetries-1 { time.Sleep(time.Duration(i+1) * time.Second) } } // All retries exhausted with HTTP errors return nil, fmt.Errorf("retry exhausted: final status HTTP %d", lastHTTPStatus) }
Why Codio flagged this
Go's HTTP client retry logic is a pattern Codio recognizes and analyzes structurally. It tracks how errors flow through retry loops and flags when an error variable from an earlier iteration can "leak" into the return value after a later iteration has changed the actual failure mode. The 4xx passthrough issue is a common API consumer mismatch — the function is named "DoWithRetry" and returns nil error for 404, which makes callers assume success. Codio identifies these semantic mismatches between what a function promises and what it actually delivers.
Get Codio reviewing your PRs
Ship reviews like this on every pull request. No reminders. No missed PRs.