Explore a way to handle test thresholds for older GitLab versions

As endpoints improve we update GPT to have better thresholds to monitor their new performance. One small consequence of this is that users who then use GPT to test older versions will see this as failed when actually it's correct for their environment's version at the time.

Task is to explore a way to handle this in the test. Possibly by detecting the version and adjusting the threshold accordingly?