Unfortunately, exchanging data with external services doesn’t always go smoothly. A mundane example is that the external service could simply be not reachable at the moment you request data from it: It could be offline for maintenance or it could be under heavy load, unable to respond to your request in time. To make your own applications depending on external services more robust, they will need to deal with these temporary downtimes.
One way to do so is using retries with backoffs. To account for the external service being under heavy load, you back off and give it some time to recover until you retry. Naturally, this is an improvement over simply hammering the server with requests in case it’s already under heavy load.
How much time you let pass between retries is a matter of choice. You could be
clever and incorporate the concrete exception or HTTP status code into your
decision like in boto3
’s adaptive
mode
or you could
simply double the time until the next retry. For example, the Python library
httpx
does
this
while its older sibling requests
also allows for adding some randomization
in.
In the context of serverless functions, you can’t retry indefinitely because you’ll hit a timeout (for example, the maximum timeout for an AWS lambda function is 15 minutes). Let us a take a look at how to relate a given timeout with the number of retries.
Relating a timeout requirement with the number of retries Link to heading
Following the simpler implementation of retries with backoffs in httpx
, we
immediately retry after the first failure, then wait $b$ seconds ($b$ is the
configurable backoff factor) after the second and wait twice as long after
each following failure until we’ve reached the maximum number $n$ of retries. We
can compute the total backoff time in this case as
\[ 0 + b 2^0 + b 2^1 + \dotsb + b 2^{n-2} = \sum_{i=0}^{n-2} b 2^i. \]
This happens to be a geometric sum for which we have a closed formula: \[ \sum_{i=0}^{n-2} b 2^i = \frac{b (2^{n-1} - 1)}{2 - 1} = b(2^{n-1} - 1) \]
A necessary condition for your serverless function to complete successfully is that this backoff time is smaller than or equal to the timeout $t$. I’d like to call this the Retry-Timeout-Inequality:
\[ b(2^{n-1} - 1) \leq t. \]
The way we’ve written the inequality down is probably not the most useful one: Here, given the number of retries you’d get a lower bound for your timeout. What’s more interesting is fixing the timeout $t$ and then computing how many retries we may afford. Let’s rearrange to obtain $2^{n-1} \leq t/b + 1$. Then, applying the $2$-logarithm and using the logarithmic identities, we end up with the following.
\[\begin{align*} n &\leq \log_2(t/b + 1) + 1 \\ &= \log_2(b+t) - \log_2(b) + 1 \end{align*}\]
Let’s call this the logarithmic form of the Retry-Timeout-Inequality.
Example. Let’s assume our backoff factor is $1/2$ (as is the default for
httpx
). Then the right-hand-side of the logarithmic form simplifies even
further:
\[\begin{align*} &\log_2(t+1/2) - \log_2(1/2) + 1 \\ &= \log_2\left(\frac{2t+1}{2}\right) + 2 \\ &= \log_2(2t + 1) + 1 \end{align*}\]
If we assume further that our timeout happens to be $t = 30$ seconds, we see that $2t + 1 = 61 < 64 = 2^6$ so that
\[\begin{align*} n &\leq \log_2(2t+1) + 1 \\ &< \log_2(2^6) + 1 \\ &= 7. \end{align*}\]
This means that we can retry 6 times without reaching the timeout, at least not from the backoff times alone.
Conclusion Link to heading
To sum it up and generalize the last example, you can compute
the maximum number of retries when using httpx
as follows.
- Find the smallest power of two greater than $2t + 1$ where $t$ is your desired timeout in seconds. This means, find an integer $k$ such that $2^{k-1} < 2t + 1 < 2^k$.
- Your maximum number of retries is $k$.