← All examples / Go / unhandled-error-http-retry
Go Medium

Unhandled Error in HTTP Client Retry Logic

Retry loop swallows errors on non-idempotent requests, silently dropping failed calls.

Repository notification-worker
PR Title Add exponential backoff to external API calls
Author @jgarcia
Files changed 1
Diff size +33 / -9 lines
worker/http_client.go
func (c *HTTPClient) DoWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) {
maxRetries := 3
var lastErr error
- for i := 0; i < maxRetries; i++ {
- resp, err := c.client.Do(req)
- if err == nil {
- return resp, nil
- }
- lastErr = err
- time.Sleep(time.Duration(i+1) * time.Second)
- }
- return nil, lastErr
+ retryableCodes := map[int]bool{
+ 429: true, // Too Many Requests
+ 500: true, // Internal Server Error
+ 502: true, // Bad Gateway
+ 503: true, // Service Unavailable
+ 504: true, // Gateway Timeout
+ }
+
+ for i := 0; i < maxRetries; i++ {
+ resp, err := c.client.Do(req)
+ if err == nil {
+ if resp.StatusCode < 500 && resp.StatusCode != 429 {
+ return resp, nil
+ }
+ if !retryableCodes[resp.StatusCode] {
+ return resp, nil
+ }
+ resp.Body.Close()
+ } else {
+ lastErr = err
+ }
+ time.Sleep(time.Duration(i+1) * time.Second)
+ }
+ return nil, lastErr
}

Summary

Good improvement — the new code properly distinguishes HTTP error codes (it retries 5xx and 429, passes through 4xx) and closes response bodies on retry to avoid resource leaks. However, there's a correctness gap: when a non-retryable HTTP response is received on the last retry attempt, the function returns the response and the caller processes it — but the caller may not expect a non-200 response after DoWithRetry, because the function name implies it succeeded. More importantly, lastErr is only set in the network error path — if the final attempt gets a 502, lastErr retains the value from the previous attempt (or is nil if the last attempt got an HTTP response), and return nil, lastErr returns nil for the response but the error may be stale or wrong.

Findings

Finding #1 — Stale Error Returned on Last Retry Failure Medium

Trace through the exact failure case: request sent 3 times.

  • Attempt 1: network error → lastErr = err1
  • Attempt 2: 502 → retries (no lastErr update here, since it's an HTTP response not a network error)
  • Attempt 3: 502 → retries, body closed, loop exits

After the loop, lastErr still holds err1 — the network error from attempt 1. The function returns (nil, err1). The caller receives a stale network error that has nothing to do with why the final request failed. If the network recovered between attempt 1 and 3, the real reason the request failed was a 502 from the upstream — and the caller has no visibility into that.

The fix: track the HTTP status code from each retry attempt, and if the final attempt returned a non-retryable HTTP response, return that response with a wrapped error (e.g., fmt.Errorf("final attempt: HTTP %d", statusCode)). If all attempts failed with network errors, return the last network error. If you can't distinguish, at minimum return a clear err that says "all retries exhausted" — not a stale error from an earlier attempt.

Finding #2 — Non-Retryable 4xx Returned Without Error Medium

If the upstream returns a 404 or 403 on any attempt (including the first), the current code returns (resp, nil). The caller gets a response with status 404 and no error. If this is an idempotent GET, that's fine. But if the caller is a POST (e.g., sending a notification), and the 404 means "recipient not found", the caller will treat this as success because err == nil.

Check whether the callers of DoWithRetry are GET or POST. If any are POST or state-changing, add a config flag to treat specific 4xx codes as errors. For a notification worker specifically, a 404 on a send-notification call is not a success — it should surface as an error.

Finding #3 — Retry Count Not Configurable Low

maxRetries is hardcoded to 3. For a worker sending notifications, 3 retries with exponential backoff (1s, 2s, 3s) means a failed notification can take 6+ seconds before returning an error to the caller. Consider making this configurable via an environment variable or the HTTPClient struct options. Also consider adding a jitter to the backoff to avoid thundering herd when multiple workers retry simultaneously.

Recommendation

The HTTP status code filtering is a solid improvement. Merge Finding #2 and #3 as quick follow-up fixes, but treat Finding #1 as a correctness bug that should block the release — the stale error behavior will make debugging failed requests significantly harder. When a 502 fails after 3 retries, the error message should say "502 after 3 retries" not "network timeout from attempt 1".

Suggested Fix

worker/http_client.go
func (c *HTTPClient) DoWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) {
    retryableCodes := map[int]bool{429: true, 500: true, 502: true, 503: true, 504: true}
    maxRetries := c.maxRetries // configure via HTTPClient option
    var lastHTTPStatus int

    for i := 0; i < maxRetries; i++ {
        resp, netErr := c.client.Do(req)
        if netErr != nil {
            lastHTTPStatus = 0
            lastErr := netErr
            if i < maxRetries-1 {
                time.Sleep(time.Duration(i+1) * time.Second)
                continue
            }
            return nil, fmt.Errorf("retry exhausted after %d attempts: %w", maxRetries, lastErr)
        }

        lastHTTPStatus = resp.StatusCode

        if resp.StatusCode < 500 && resp.StatusCode != 429 {
            // Non-retryable: pass through. Caller is responsible for handling 4xx.
            return resp, nil
        }

        resp.Body.Close()
        if i < maxRetries-1 {
            time.Sleep(time.Duration(i+1) * time.Second)
        }
    }

    // All retries exhausted with HTTP errors
    return nil, fmt.Errorf("retry exhausted: final status HTTP %d", lastHTTPStatus)
}

Why Codio flagged this

Go's HTTP client retry logic is a pattern Codio recognizes and analyzes structurally. It tracks how errors flow through retry loops and flags when an error variable from an earlier iteration can "leak" into the return value after a later iteration has changed the actual failure mode. The 4xx passthrough issue is a common API consumer mismatch — the function is named "DoWithRetry" and returns nil error for 404, which makes callers assume success. Codio identifies these semantic mismatches between what a function promises and what it actually delivers.

Get Codio reviewing your PRs

Ship reviews like this on every pull request. No reminders. No missed PRs.

Join the waitlist →
← Previous Race Condition in Async Cache Refresh Next → N+1 Query in ORM Lookup Loop