Allison Browne requested to merge 474146-truncate-based-on-characters-not-just-lines into master Jul 23, 2024

What does this MR do and why?

We should add on truncating by character for the GA release. Lines can be any length so even a very low line number could be unreliable.

Truncating by character was in the original RCA experiment, but the issue is that we don’t have a great way of calculating tokens on the ruby side of things. Making an API request to the anthropic endpoint is the only reliable way.

What we know

In the experimental version of RCA, we found using the token to character approximation from google(1:4) still led to the logs being too long.
Lines can be any length so even a very low line number could be unreliable.
- For some logs 1000 lines is over 200k tokens
Anthropic claude has a 200k token limit
chat history counts towards the limit
- we send up to 50 past messages, up to 20k chars each. So would be pretty easy to go over 200k
we don’t have a great way of calculating tokens on the ruby side of things. https://github.com/javirandor/anthropic-tokenizer
anthropic claude-3.5's tokenization model is propritary but it looks roughly 4:1 like google's](https://github.com/javirandor/anthropic-tokenizer).

Proposal

Given that google guidance only worked flawlessly when we dropped to 1:1 (possibly because of some types of characters used in raw logs), we can start with 1 Char:1 Token here. Later we can try to adjust this further.

Allowing for at least 5 20k character responses in the window would mean we can try to truncate to 100k characters in the prompt.

i.e. 200,000 - (5*20,000) = 100,000

Related to: #474146 (closed)

Edited Jul 23, 2024 by Allison Browne

Troubleshoot job: Truncate logs based on characters

What does this MR do and why?

What we know

Proposal

Merge request reports