create new 5k dra job #35700

upodroid · 2025-10-14T20:43:26Z

/hold

Requires kubernetes/kops#17671 to be merged first

A concurrency limit is in place while the boskos-pool is growing; it will then be limited to 4.

Also, ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2 will replace ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload once it's green.

k8s-ci-robot · 2025-10-14T20:43:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [upodroid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alaypatel07 · 2025-10-14T20:54:20Z

config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml

        base_ref: master
        path_alias: k8s.io/kubernetes
-      - org: alaypatel07
+      - org: kubernetes


can you please drop this, its intended to use my branch until we have merged this: kubernetes/perf-tests#3629

alaypatel07 · 2025-10-14T20:54:40Z

config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml

+      - org: kubernetes
        repo: perf-tests
-        base_ref: dra-extended-resources
+        base_ref: master


can you please drop this, its intended to use my branch until we have merged this: kubernetes/perf-tests#3629

config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml

alaypatel07 · 2025-10-16T14:28:45Z

config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml

+          value: "true"
+        - name: PROMETHEUS_PVC_STORAGE_CLASS
+          value: "ssd-csi"
+        - name: CLOUD_PROVIDER


The DRA test also require enabling certain feature flags. From the config above this is set:

- --env=KUBE_FEATURE_GATES=DynamicResourceAllocation=true

In the case of kops, should we be setting the KOPS_FEATURE_FLAGS feature flag with above?

DRA feature should be autoenabled already. kops does enable all GA/Beta flags by default

upodroid · 2025-10-16T22:06:10Z

This is ready to be merged. I'll cancel the hold once the kops PR is merged.

BenTheElder · 2025-10-28T19:30:53Z

It might make sense to move forward with a kube-up based job in the short term, the kops PR has been ongoing for a few weeks now?

alaypatel07 · 2025-10-28T19:39:17Z

It might make sense to move forward with a kube-up based job in the short term, the kops PR has been ongoing for a few weeks now?

+1, I would really like to push forward with a 5k node dra test, considering it GA'ed last release and we don't have any scale test is little concerning to me.

upodroid · 2025-10-28T20:07:36Z

I'll merge the other PR by the end of the week if the kops one is still held up

upodroid · 2025-10-29T20:05:24Z

This is ready to be merged

alaypatel07 · 2025-10-29T20:07:13Z

Let's merge, I'll keep an eye on results

alaypatel07 · 2025-10-29T20:08:13Z

/lgtm

alaypatel07 · 2025-10-29T20:24:17Z

/hold cancel

k8s-ci-robot · 2025-10-29T20:44:06Z

@upodroid: Updated the following 2 configmaps:

job-config configmap in namespace default at cluster test-infra-trusted using the following files:
- key sig-scalability-periodic-dra.yaml using file config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml
- key sig-scalability-periodic-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml
- key sig-scalability-release-blocking-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml
config configmap in namespace default at cluster test-infra-trusted using the following files:
- key config.yaml using file config/prow/config.yaml

Details

In response to this:

/hold

Requires kubernetes/kops#17671 to be merged first

Closes #35699

A concurrency limit is in place while the boskos-pool is growing; it will then be limited to 4.

/cc @alaypatel07 @BenTheElder @hakman

Also, ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2 will replace ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload once it's green.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pacoxu · 2025-10-30T08:35:42Z

https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-5000 failed in last run.

"message": "plate:control-plane-us-east1-b--3o5tk3-1761793466\tok
Subnet:us-east1-scalability-k8s-local\tok
Network:scalability-k8s-local\tok
Not all resources deleted; waiting before reattempting deletion
\tDisk:b-etcd-events-scalability-k8s-local
\tDisk:b-etcd-main-scalability-k8s-local
Disk:b-etcd-main-scalability-k8s-local\tok
Disk:b-etcd-events-scalability-k8s-local\tok
Deleted kubectl config for scalability.k8s.local

Deleted cluster: \"scalability.k8s.local\"
I1030 04:33:08.055960   16769 gcs.go:105] gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-state-212b
I1030 04:33:08.055984   16769 local.go:42] ⚙️ gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-state-212b
Removing gs://k8s-infra-e2e-scale-5k-project-state-212b/...
I1030 04:33:09.148972   16769 gcs.go:105] gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-staging-212b
I1030 04:33:09.148990   16769 local.go:42] ⚙️ gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-staging-212b
BucketNotFoundException: 404 gs://k8s-infra-e2e-scale-5k-project-staging-212b bucket does not exist.
I1030 04:33:09.980969   16769 down.go:90] releasing boskos project
I1030 04:33:10.003680   16769 boskos.go:83] Boskos heartbeat func received signal to close
Error: exit status 255
+ EXIT_VALUE=1
+ set +o xtrace
Cleaning up after docker in docker.
================================================================================
Cleaning up after docker
Stopping Docker: dockerProgram process in pidfile '/var/run/docker-ssd.pid', 1 process(es), refused to die.
================================================================================
Done cleaning up after docker in docker.
{\"component\":\"entrypoint\",\"error\":\"wrapped process failed: exit status 1\",\"file\":\"sigs.k8s.io/prow/pkg/entrypoint/run.go:84\",\"func\":\"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun\",\"level\":\"error\",\"msg\":\"Error executing test process\",\"severity\":\"error\",\"time\":\"2025-10-30T04:33:30Z\"}
",

alaypatel07 · 2025-11-03T14:40:15Z

I1103 03:53:58.922373   24318 warnings.go:110] "Warning: unknown field \"spec.selector.k8s-app\""
W1103 03:53:58.973455   24318 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
W1103 03:56:00.099056   24318 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status���
�
�������Failure�gerror trying to reach service: read tcp 10.64.0.1:42726->10.64.0.2:9090: read: connection reset by peer"�ServiceUnavailable0����"�
W1103 03:56:59.026127   24318 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�}

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-node-dra-with-workload-ipalias-using-cl2/1985180145572384768/build-log.txt

@upodroid @BenTheElder the 5k dra job has been failing consistently due to monitoring stack not coming up.

    - allocatedResources:
        cpu: 2700m
        memory: 10Gi
      containerID: containerd://c2d602263deff3e69bb502e745bb6fcf038e28de3f8044c0eb8f793b801b4444
      image: gcr.io/k8s-testimages/quay.io/prometheus/prometheus:v2.40.0
      imageID: gcr.io/k8s-testimages/quay.io/prometheus/prometheus@sha256:eff669d70ee485a191a645caa269530fc4930d0b6c178390c1e1bb378fd200fc
      lastState:
        terminated:
          containerID: containerd://7f299e821c7506ffb863da4b82fd83db9d8bb6a62d6a5bc5bdf3b539763bb09f
          exitCode: 137
          finishedAt: "2025-11-03T04:04:49Z"
          message: |
          <redacted-for-brevity>
          reason: OOMKilled
          startedAt: "2025-11-03T04:04:10Z"
      name: prometheus

The prometheus pod is getting OOMKilled. How can we increase the resources?

upodroid · 2025-11-03T15:24:38Z

What's the dra job doing that causes OOM, we don't see it on the 5k job?

You need to bump the limits in the perf-tests repository.

alaypatel07 · 2025-11-03T15:35:26Z

What's the dra job doing that causes OOM, we don't see it on the 5k job?

@upodroid the dra job has not even started yet, the workflow of test execution is:

set up prometheus
check if prometheus is healthy
set up dependencies
Check if dependencies are healthy
Start creating DRA workloads

We are hitting this issue in step 2.

You need to bump the limits in the perf-tests repository.

How much resources do we have on monitoring VM, to what number can I increase this memory limits?

upodroid · 2025-11-03T15:54:16Z

get the small scale one green first: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1985329983672815616

It got scheduled on the control plane node that has 96 cores and 360GB of ram

https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-node-dra-with-workload-ipalias-using-cl2/1985180145572384768/artifacts/cluster-info/monitoring/

alaypatel07 · 2025-11-03T16:35:43Z

get the small scale one green first: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1985329983672815616

The small scale one is not relevant to the 5k test. The small scale is using extended resources feature, we have triaged it here: kubernetes/perf-tests#3641 (comment).

The small scale one which is relevant to 5k node test is already green: https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-100

It got scheduled on the control plane node that has 96 cores and 360GB of ram

https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-node-dra-with-workload-ipalias-using-cl2/1985180145572384768/artifacts/cluster-info/monitoring/

+ [[ true == \t\r\u\e ]]
+ KUBETEST2_ARGS+=("--down")
+ export PROMETHEUS_KUBE_PROXY_SELECTOR_KEY=k8s-app
+ PROMETHEUS_KUBE_PROXY_SELECTOR_KEY=k8s-app
+ export PROMETHEUS_SCRAPE_APISERVER_ONLY=true
+ PROMETHEUS_SCRAPE_APISERVER_ONLY=true
+ export CL2_PROMETHEUS_TOLERATE_MASTER=true
+ CL2_PROMETHEUS_TOLERATE_MASTER=true
+ [[ gce == \a\w\s ]]

I see these in the 5k non-dra test, https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-ipalias-using-cl2/1985285839373996032/build-log.txt I can try to use the same env variables, but we need the kubelet metrics for collecting kubelet/driver issues.

pacoxu · 2025-11-04T07:40:32Z

https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-5000 looks better.

Failed for some job timeout.

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-node-dra-with-workload-ipalias-using-cl2/1985542640057192448

alaypatel07 · 2025-11-04T20:32:36Z

@pacoxu there was a misconfiguration in this PR that was fixed by #35854. Starting todays run should have actual invocation of the dra test.

k8s-ci-robot requested a review from alaypatel07 October 14, 2025 20:43

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2025

k8s-ci-robot requested review from BenTheElder and hakman October 14, 2025 20:43

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 14, 2025

github-project-automation bot added this to Dynamic Resource Allocation Oct 14, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Oct 14, 2025

upodroid force-pushed the dra-5k-job branch from be12c99 to e0e1a70 Compare October 14, 2025 20:45

alaypatel07 reviewed Oct 14, 2025

View reviewed changes

config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml Show resolved Hide resolved

alaypatel07 reviewed Oct 16, 2025

View reviewed changes

upodroid force-pushed the dra-5k-job branch from e0e1a70 to b352ffd Compare October 16, 2025 22:05

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2025

create new 5k dra job

a3e8577

upodroid force-pushed the dra-5k-job branch from b352ffd to a3e8577 Compare October 29, 2025 20:03

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 29, 2025

k8s-ci-robot assigned alaypatel07 Oct 29, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2025

k8s-ci-robot merged commit e0a8788 into kubernetes:master Oct 29, 2025
6 checks passed

pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Oct 31, 2025

alaypatel07 mentioned this pull request Nov 3, 2025

DRA: scale testing for GA of structured parameters kubernetes/kubernetes#131198

Open

alaypatel07 mentioned this pull request Nov 3, 2025

remove kubelet scrapes for 5k nodes dra job #35844

Merged

upodroid mentioned this pull request Dec 14, 2025

add 1.35 k8s/kops jobs and fix broken distro jobs #36085

Merged

create new 5k dra job #35700

create new 5k dra job #35700

Uh oh!

Conversation

upodroid commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 14, 2025

Uh oh!

alaypatel07 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

alaypatel07 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

upodroid Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alaypatel07 Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

upodroid Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

upodroid commented Oct 16, 2025

Uh oh!

BenTheElder commented Oct 28, 2025

Uh oh!

alaypatel07 commented Oct 28, 2025

Uh oh!

upodroid commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

upodroid commented Oct 29, 2025

Uh oh!

alaypatel07 commented Oct 29, 2025

Uh oh!

alaypatel07 commented Oct 29, 2025

Uh oh!

alaypatel07 commented Oct 29, 2025

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 29, 2025

Uh oh!

pacoxu commented Oct 30, 2025

Uh oh!

alaypatel07 commented Nov 3, 2025

Uh oh!

upodroid commented Nov 3, 2025

Uh oh!

alaypatel07 commented Nov 3, 2025

Uh oh!

upodroid commented Nov 3, 2025

Uh oh!

alaypatel07 commented Nov 3, 2025

Uh oh!

pacoxu commented Nov 4, 2025

Uh oh!

alaypatel07 commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

upodroid commented Oct 14, 2025 •

edited

Loading

upodroid commented Oct 28, 2025 •

edited

Loading