Troubleshoot job: Truncate logs based on characters
What does this MR do and why?
We should add on truncating by character for the GA release. Lines can be any length so even a very low line number could be unreliable.
Truncating by character was in the original RCA experiment, but the issue is that we don’t have a great way of calculating tokens on the ruby side of things. Making an API request to the anthropic endpoint is the only reliable way.
What we know
- In the experimental version of RCA, we found using the token to character approximation from google(1:4) still led to the logs being too long.
- Lines can be any length so even a very low line number could be unreliable.
- For some logs 1000 lines is over 200k tokens
- Anthropic claude has a 200k token limit
- chat history counts towards the limit
- we send up to 50 past messages, up to 20k chars each. So would be pretty easy to go over 200k
- we don’t have a great way of calculating tokens on the ruby side of things. https://github.com/javirandor/anthropic-tokenizer
- anthropic claude-3.5's tokenization model is propritary but it looks roughly 4:1 like google's](https://github.com/javirandor/anthropic-tokenizer).
Proposal
Given that google guidance only worked flawlessly when we dropped to 1:1 (possibly because of some types of characters used in raw logs), we can start with 1 Char:1 Token here. Later we can try to adjust this further.
Allowing for at least 5 20k character responses in the window would mean we can try to truncate to 100k characters in the prompt.
i.e. 200,000 - (5*20,000) = 100,000
Related to: #474146 (closed)