Upgrade fleeting/taskscaler: fixes reservation/unavailability instance churn (!4865) · Merge requests · GitLab.org / gitlab-runner

Arran Walker requested to merge ajwalker/upgrade-fleeting-taskscaler-scale-fix into main Jul 08, 2024

What does this MR do?

Similar to !4818 (merged), a few more cases where the calculation for required instances was incorrect when handling reservations and instances being deleted.

Why was this MR needed?

When idle_scale was > 0, it would occasionally be hard to completely remove instances that were no longer required, with an instance often being created in place of an instance that was being removed.

What's the best way to test this MR?

taskscaler has updated tests for this: gitlab-org/fleeting/taskscaler!52 (merged)

Manual QA:

concurrent = 4

[[runners]]
  url = "https://gitlab.com"
  executor = "docker-autoscaler"
  [runners.docker]
    image = "busybox:latest"
  [runners.autoscaler]
    capacity_per_instance = 1
    max_use_count = 100
    max_instances = 5
    plugin = "aws"
    [runners.autoscaler.plugin_config]
      name = "linux-test"
      region = "us-west-2"
    [runners.autoscaler.connector_config]
      username = "ec2-user"
      key_path = "key.pem"
      use_static_credentials = true
      keepalive = "0s"
      timeout = "10m0s"
      use_external_addr = true

    [[runners.autoscaler.policy]]
      idle_count = 1
      idle_time = "5m0s"
      scale_factor = 0.0
      scale_factor_limit = 0

Wait for idle instance to come up
Create a job
Wait for new idle instance to come up
Observe second idle instance attempt removal after job finishes

To trigger the old behaviour before this fix, you might need to keep trying until there's a situation where UnavailableCapacity is positive and prevents scaling down/brings up a new instance.

Upgrade fleeting/taskscaler: fixes reservation/unavailability instance churn

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports