K6 K8s Operator Initializer not spawning runners

I have the K6 operator installed in a K8s cluster and using helm charts I am creating the required ConfigMap and K6 CustomResource to kickstart a load test. I am able to successfully run tests in my local cluster but when deploying to a staging cluster I am not seeing the tests run.

I was able to verify that the new custom resource was created and acknowledged as I see the initializer pod get started

> kubectl get pods
pod/ingestion-load-shedder-2022-09-2-initializer-bj8nq   0/1     Completed   0          10m

The initializer seems to complete straight away and looking at the logs of the initializer I see the config for my test (excluding non-relevant fields)

{
  ...
  "scenarios": {
    "constant": {
      "executor": "constant-arrival-rate",
      "rate": 2500,
      "timeUnit": "1m0s",
      "duration": "5m0s",
      "preAllocatedVUs": 1000,
    }
  },
  "totalDuration": "5m30s",
  "maxVUs": 1000
}

but do not see any runners. These are the logs I see in the manager pod

> kubectl logs k6-operator-controller-manager-6cdcf8cd9c-7w4xp -c manager
...
2022-09-27T20:23:18.988Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator/ingestion-load-shedder-2022-09-2"}
2022-09-27T20:23:18.988Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "ingestion-load-shedder-2022-09-2", "namespace": "k6-operator"}

These are the manager logs when running it locally

2022-09-28T18:40:56.876Z	INFO	controllers.K6	Reconcile(); stage = 	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:40:56.877Z	INFO	controllers.K6	Initialize test	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:40:56.877Z	INFO	controllers.K6	Changing stage of K6 status to initialization	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:40:57.070Z	INFO	controllers.K6	Initializer job is ready to start with image `ghcr.io/grafana/operator:latest-runner` and command `[sh -c k6 archive --log-output=none /test/script -O ./script.archived.tar  && k6 inspect --execution-requirements --log-output=none ./script.archived.tar]`	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:40:58.573Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:03.574Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:08.575Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:13.575Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:18.574Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:23.872Z	INFO	controllers.K6	k6 inspect: {External:{Loadimpact:{Name: ProjectID:0}} TotalDuration:1m30s MaxVUs:1000 Thresholds:map[]}	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:23.872Z	INFO	controllers.K6	Changing stage of K6 status to initialized	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:23.970Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "ingestion-load-shedder-2022-09-1", "namespace": "k6-operator"}
2022-09-28T18:41:23.971Z	INFO	controllers.K6	Reconcile(); stage = initialized	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:23.971Z	INFO	controllers.K6	Creating test jobs	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:23.971Z	INFO	controllers.K6	Launching k6 test #1	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:23.972Z	INFO	controllers.K6	Runner job is ready to start with image `ghcr.io/grafana/operator:latest-runner` and command `[k6 run --quiet --execution-segment=0:1/2 --execution-segment-sequence=0,1/2,1 /test/script --address=0.0.0.0:6565 --paused --tag instance_id=1 --tag job_name=ingestion-load-shedder-2022-09-1-1]`	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:24.270Z	INFO	controllers.K6	Launching k6 test #2	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:24.271Z	INFO	controllers.K6	Runner job is ready to start with image `ghcr.io/grafana/operator:latest-runner` and command `[k6 run --quiet --execution-segment=1/2:1 --execution-segment-sequence=0,1/2,1 /test/script --address=0.0.0.0:6565 --paused --tag instance_id=2 --tag job_name=ingestion-load-shedder-2022-09-1-2]`	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:24.469Z	INFO	controllers.K6	Changing stage of K6 status to created	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:24.477Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "ingestion-load-shedder-2022-09-1", "namespace": "k6-operator"}
2022-09-28T18:41:24.477Z	INFO	controllers.K6	Reconcile(); stage = created	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:24.477Z	INFO	controllers.K6	Waiting for pods to get ready	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:24.478Z	INFO	controllers.K6	0/2 pods ready	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}
2022-09-28T18:41:29.479Z	INFO	controllers.K6	2/2 pods ready	{"k6": "k6-operator/ingestion-load-shedder-2022-09-1"}

Any ideas what may be causing this discrepancy?

Furthermore, doing a describe on the K6 custom resource that was created shows

Name:         ingestion-load-shedder-2022-09-2
Namespace:    k6-operator
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: load-testing
              meta.helm.sh/release-namespace: k6-operator
API Version:  k6.io/v1alpha1
Kind:         K6
Metadata:
  Creation Timestamp:  2022-09-27T20:22:54Z
  Generation:          1
  Managed Fields:
    API Version:  k6.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      ...
    Manager:      Go-http-client
    Operation:    Update
    Time:         2022-09-27T20:22:54Z
    API Version:  k6.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:stage:
    Manager:         manager
    Operation:       Update
    Time:            2022-09-27T20:22:54Z
  Resource Version:  1461641101
  UID:               c8774986-42d2-4390-b37d-edbbd158aa71
Spec:
  Parallelism:  10
  Runner:
    Env:
      Name:   API_KEY
      Value:  ...
    Metadata:
      Labels:
        Run - Identifier:  ingestion-load-shedder-2022-09-2
        Type:              k6-operator-runner
    Resources:
      Limits:
        Cpu:     400m
        Memory:  500Mi
      Requests:
        Cpu:     400m
        Memory:  500Mi
  Script:
    Config Map:
      File:  script
      Name:  load-test-configmap
  Separate:  false
  Starter:
    Metadata:
      Labels:
        Run - Identifier:  ingestion-load-shedder-2022-09-2
        Type:              k6-operator-starter
Status:
  Stage:  initialization
Events:   <none>

Of note is the Status.Stage which is stuck at initialization after 22 hours

Update: I’ve noticed than when attempting to run a test the manager pod gets restarted, hence the truncated logs. To see logs of crashed pod I ran

> k logs k6-operator-controller-manager-6cdcf8cd9c-7w4xp -c manager --previous
2022-09-28T20:26:08.099Z	INFO	controllers.K6	Reconcile(); stage = 	{"k6": "k6-operator/ingestion-load-shedder-2022-09-3"}
2022-09-28T20:26:08.099Z	INFO	controllers.K6	Initialize test	{"k6": "k6-operator/ingestion-load-shedder-2022-09-3"}
2022-09-28T20:26:08.099Z	INFO	controllers.K6	Changing stage of K6 status to initialization	{"k6": "k6-operator/ingestion-load-shedder-2022-09-3"}
2022-09-28T20:26:08.112Z	INFO	controllers.K6	Initializer job is ready to start with image `ghcr.io/grafana/operator:latest-runner` and command `[sh -c k6 archive --log-output=none /test/script -O ./script.archived.tar  && k6 inspect --execution-requirements --log-output=none ./script.archived.tar]`	{"k6": "k6-operator/ingestion-load-shedder-2022-09-3"}

That last command is crashing the pod for some reason but don’t see any other errors

Hi @jaimebarriga,
Welcome to the forum :wave:

Firstly, Helm is not yet fully supported by the k6-operator: we have an open PR for that but it’s not merged as of now. If you’re relying on that branch, perhaps it makes sense to try without it as well and see if it helps?

In the first message you mentioned that operator works in one of your clusters correctly but not in the second one. I wonder what is the difference between these clusters?

I strongly suggest to check and verify these two points :point_up:

Otherwise, it’s a bit hard to understand all the steps at the moment :confused: If I understood correctly, your k6-operator-controller-manager pod got crashed, right? If so, can you please paste the full logs (the ones that you posted don’t contain any errors) and also the state of that pod: which error exactly it terminated with and its event logs. If there are logs of other k6 pods (initializer, starter, runner), please paste them as well.

PS I noticed you opened the bug report on Github as well: please avoid duplicating information :slight_smile:

Hi @olha,

Thank you very much for your response! I was able to figure out the issue. I had pasted the full logs (there were no error logs) which is what made it hard to debug this but when running

> kubectl get pod k6-operator-controller-manager-6cdcf8cd9c-7w4xp --output=yaml
...
- containerID: containerd://7832b6bd26232b5dbfc6114fcff6e881e20b7ee7aea22e3d8383a1eb159a218a
    image: ghcr.io/grafana/operator:latest
    imageID: ghcr.io/grafana/operator@sha256:97fef2753dcef151bd5b29864074c3dffadd54837bc33b995cc6ea44779f9af0
    lastState:
      terminated:
        containerID: containerd://2166944558bea0c699c6e0814f0946e1b4253b44cb3f521ac9efd5ded6d5c6ab
        exitCode: 137
        finishedAt: "2022-09-28T20:26:11Z"
        reason: OOMKilled
        startedAt: "2022-09-28T19:32:52Z"
    name: manager
    ready: true
    restartCount: 6
    started: true
    state:
      running:
        startedAt: "2022-09-28T20:26:13Z"
...

I saw it had run out of memory. I increased the memory and it worked.

Apologies about the duplicate information, will post my resolution and close the bug.

1 Like

Oh, that makes sense! Thanks for updating and closing the issue. Glad to hear it works for you :smile: