Skip to content

Troubleshoot job: Truncate logs based on characters

What does this MR do and why?

We should add on truncating by character for the GA release. Lines can be any length so even a very low line number could be unreliable.

Truncating by character was in the original RCA experiment, but the issue is that we don’t have a great way of calculating tokens on the ruby side of things. Making an API request to the anthropic endpoint is the only reliable way.

What we know

  • In the experimental version of RCA, we found using the token to character approximation from google(1:4) still led to the logs being too long.
  • Lines can be any length so even a very low line number could be unreliable.
    • For some logs 1000 lines is over 200k tokens
  • Anthropic claude has a 200k token limit
  • chat history counts towards the limit
    • we send up to 50 past messages, up to 20k chars each. So would be pretty easy to go over 200k
  • we don’t have a great way of calculating tokens on the ruby side of things. https://github.com/javirandor/anthropic-tokenizer
  • anthropic claude-3.5's tokenization model is propritary but it looks roughly 4:1 like google's](https://github.com/javirandor/anthropic-tokenizer).

Proposal

Given that google guidance only worked flawlessly when we dropped to 1:1 (possibly because of some types of characters used in raw logs), we can start with 1 Char:1 Token here. Later we can try to adjust this further.

Allowing for at least 5 20k character responses in the window would mean we can try to truncate to 100k characters in the prompt.

i.e. 200,000 - (5*20,000) = 100,000

Related to: #474146 (closed)

Edited by Allison Browne

Merge request reports

Loading