Adjustments around bot long-polling behavior
Description
This PR addresses a few issues relating to the long-polling of bots.
- Update
MAX_WORKER_TTL
to be 300 seconds instead of 3600. - Use a percentage of the given deadline (80%) instead of the entire value -
NETWORK_TIMEOUT
(previously 1s) - Increase
NETWORK_TIMEOUT
to 3 seconds from 1 second.
MAX_WORKER_TTL
is bumped down to 300 seconds to handle the case where no request-timeout is specified on the client side more gracefully. We have logic for if the deadline is None
, but depending on the client language an unset request-timeout might end up being an arbitrarily large uint64 value instead. 300 was chosen as it was the previous default of MAX_JOB_BLOCK_TIME
, which MAX_WORKER_TTL
replaced.
The adjustments around NETWORK_TIMEOUT
were done to give BuildGrid more time to respond to requests when there is no work available. We've seen issues where when using a large threadpool it would often take longer than the 1s we previously allocated to stop waiting for work and finish the request. This would result in buildbox-worker
crashing and potentially restarting, making the issue worse. By using a percentage of the given deadline and increasing the minimum value we allow (NETWORK_TIMEOUT
) we ensure that BuildGrid has enough time to finish requests outside of more serious issues.