A Small Note on Backoffs

Unfortunately, exchanging data with external services doesn’t always go smoothly. A mundane example is that the external service could simply be not reachable at the moment you request data from it: It could be offline for maintenance or it could be under heavy load, unable to respond to your request in time. To make your own applications depending on external services more robust, they will need to deal with these temporary downtimes.

One way to do so is using retries with backoffs. To account for the external service being under heavy load, you back off and give it some time to recover until you retry. Naturally, this is an improvement over simply hammering the server with requests in case it’s already under heavy load.

How much time you let pass between retries is a matter of choice. You could be clever and incorporate the concrete exception or HTTP status code into your decision like in boto3’s adaptive mode or you could simply double the time until the next retry. For example, the Python library httpx does this while its older sibling requests also allows for adding some randomization in.

In the context of serverless functions, you can’t retry indefinitely because you’ll hit a timeout (for example, the maximum timeout for an AWS lambda function is 15 minutes). Let us a take a look at how to relate a given timeout with the number of retries.

Relating a timeout requirement with the number of retries Link to heading

Following the simpler implementation of retries with backoffs in httpx, we immediately retry after the first failure, then wait $b$ seconds ($b$ is the configurable backoff factor) after the second and wait twice as long after each following failure until we’ve reached the maximum number $n$ of retries. We can compute the total backoff time in this case as

\[ 0 + b 2^0 + b 2^1 + \dotsb + b 2^{n-2} = \sum_{i=0}^{n-2} b 2^i. \]

This happens to be a geometric sum for which we have a closed formula: \[ \sum_{i=0}^{n-2} b 2^i = \frac{b (2^{n-1} - 1)}{2 - 1} = b(2^{n-1} - 1) \]

A necessary condition for your serverless function to complete successfully is that this backoff time is smaller than or equal to the timeout $t$. I’d like to call this the Retry-Timeout-Inequality:

\[ b(2^{n-1} - 1) \leq t. \]

The way we’ve written the inequality down is probably not the most useful one: Here, given the number of retries you’d get a lower bound for your timeout. What’s more interesting is fixing the timeout $t$ and then computing how many retries we may afford. Let’s rearrange to obtain $2^{n-1} \leq t/b + 1$. Then, applying the $2$-logarithm and using the logarithmic identities, we end up with the following.

\[\begin{align*} n &\leq \log_2(t/b + 1) + 1 \\ &= \log_2(b+t) - \log_2(b) + 1 \end{align*}\]

Let’s call this the logarithmic form of the Retry-Timeout-Inequality.

Example. Let’s assume our backoff factor is $1/2$ (as is the default for httpx). Then the right-hand-side of the logarithmic form simplifies even further:

\[\begin{align*} &\log_2(t+1/2) - \log_2(1/2) + 1 \\ &= \log_2\left(\frac{2t+1}{2}\right) + 2 \\ &= \log_2(2t + 1) + 1 \end{align*}\]

If we assume further that our timeout happens to be $t = 30$ seconds, we see that $2t + 1 = 61 < 64 = 2^6$ so that

\[\begin{align*} n &\leq \log_2(2t+1) + 1 \\ &< \log_2(2^6) + 1 \\ &= 7. \end{align*}\]

This means that we can retry 6 times without reaching the timeout, at least not from the backoff times alone.

Conclusion Link to heading

To sum it up and generalize the last example, you can compute the maximum number of retries when using httpx as follows.

Find the smallest power of two greater than $2t + 1$ where $t$ is your desired timeout in seconds. This means, find an integer $k$ such that $2^{k-1} < 2t + 1 < 2^k$.
Your maximum number of retries is $k$.