Skip to content

Conversation

@upodroid
Copy link
Member

@upodroid upodroid commented Oct 14, 2025

/hold

Requires kubernetes/kops#17671 to be merged first

Closes #35699

A concurrency limit is in place while the boskos-pool is growing; it will then be limited to 4.

/cc @alaypatel07 @BenTheElder @hakman

Also, ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2 will replace ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload once it's green.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 14, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/config Issues or PRs related to code in /config area/jobs sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Oct 14, 2025
base_ref: master
path_alias: k8s.io/kubernetes
- org: alaypatel07
- org: kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please drop this, its intended to use my branch until we have merged this: kubernetes/perf-tests#3629

- org: kubernetes
repo: perf-tests
base_ref: dra-extended-resources
base_ref: master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please drop this, its intended to use my branch until we have merged this: kubernetes/perf-tests#3629

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

value: "true"
- name: PROMETHEUS_PVC_STORAGE_CLASS
value: "ssd-csi"
- name: CLOUD_PROVIDER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DRA test also require enabling certain feature flags. From the config above this is set:

            - --env=KUBE_FEATURE_GATES=DynamicResourceAllocation=true

In the case of kops, should we be setting the KOPS_FEATURE_FLAGS feature flag with above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA feature should be autoenabled already. kops does enable all GA/Beta flags by default

@upodroid
Copy link
Member Author

This is ready to be merged. I'll cancel the hold once the kops PR is merged.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2025
@BenTheElder
Copy link
Member

It might make sense to move forward with a kube-up based job in the short term, the kops PR has been ongoing for a few weeks now?

@alaypatel07
Copy link
Contributor

It might make sense to move forward with a kube-up based job in the short term, the kops PR has been ongoing for a few weeks now?

+1, I would really like to push forward with a 5k node dra test, considering it GA'ed last release and we don't have any scale test is little concerning to me.

@upodroid
Copy link
Member Author

upodroid commented Oct 28, 2025

I'll merge the other PR by the end of the week if the kops one is still held up

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 29, 2025
@upodroid
Copy link
Member Author

This is ready to be merged

@alaypatel07
Copy link
Contributor

Let's merge, I'll keep an eye on results

@alaypatel07
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2025
@alaypatel07
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2025
@k8s-ci-robot k8s-ci-robot merged commit e0a8788 into kubernetes:master Oct 29, 2025
6 checks passed
@k8s-ci-robot
Copy link
Contributor

@upodroid: Updated the following 2 configmaps:

  • job-config configmap in namespace default at cluster test-infra-trusted using the following files:
    • key sig-scalability-periodic-dra.yaml using file config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml
    • key sig-scalability-periodic-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml
    • key sig-scalability-release-blocking-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml
  • config configmap in namespace default at cluster test-infra-trusted using the following files:
    • key config.yaml using file config/prow/config.yaml
Details

In response to this:

/hold

Requires kubernetes/kops#17671 to be merged first

Closes #35699

A concurrency limit is in place while the boskos-pool is growing; it will then be limited to 4.

/cc @alaypatel07 @BenTheElder @hakman

Also, ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2 will replace ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload once it's green.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pacoxu
Copy link
Member

pacoxu commented Oct 30, 2025

https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-5000 failed in last run.

"message": "plate:control-plane-us-east1-b--3o5tk3-1761793466\tok
Subnet:us-east1-scalability-k8s-local\tok
Network:scalability-k8s-local\tok
Not all resources deleted; waiting before reattempting deletion
\tDisk:b-etcd-events-scalability-k8s-local
\tDisk:b-etcd-main-scalability-k8s-local
Disk:b-etcd-main-scalability-k8s-local\tok
Disk:b-etcd-events-scalability-k8s-local\tok
Deleted kubectl config for scalability.k8s.local

Deleted cluster: \"scalability.k8s.local\"
I1030 04:33:08.055960   16769 gcs.go:105] gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-state-212b
I1030 04:33:08.055984   16769 local.go:42] ⚙️ gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-state-212b
Removing gs://k8s-infra-e2e-scale-5k-project-state-212b/...
I1030 04:33:09.148972   16769 gcs.go:105] gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-staging-212b
I1030 04:33:09.148990   16769 local.go:42] ⚙️ gsutil -u k8s-infra-e2e-scale-5k-project rm -r gs://k8s-infra-e2e-scale-5k-project-staging-212b
BucketNotFoundException: 404 gs://k8s-infra-e2e-scale-5k-project-staging-212b bucket does not exist.
I1030 04:33:09.980969   16769 down.go:90] releasing boskos project
I1030 04:33:10.003680   16769 boskos.go:83] Boskos heartbeat func received signal to close
Error: exit status 255
+ EXIT_VALUE=1
+ set +o xtrace
Cleaning up after docker in docker.
================================================================================
Cleaning up after docker
Stopping Docker: dockerProgram process in pidfile '/var/run/docker-ssd.pid', 1 process(es), refused to die.
================================================================================
Done cleaning up after docker in docker.
{\"component\":\"entrypoint\",\"error\":\"wrapped process failed: exit status 1\",\"file\":\"sigs.k8s.io/prow/pkg/entrypoint/run.go:84\",\"func\":\"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun\",\"level\":\"error\",\"msg\":\"Error executing test process\",\"severity\":\"error\",\"time\":\"2025-10-30T04:33:30Z\"}
",

@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Oct 31, 2025
@alaypatel07
Copy link
Contributor

I1103 03:53:58.922373   24318 warnings.go:110] "Warning: unknown field \"spec.selector.k8s-app\""
W1103 03:53:58.973455   24318 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
W1103 03:56:00.099056   24318 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status���
�
�������Failure�gerror trying to reach service: read tcp 10.64.0.1:42726->10.64.0.2:9090: read: connection reset by peer"�ServiceUnavailable0����"�
W1103 03:56:59.026127   24318 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�}

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-node-dra-with-workload-ipalias-using-cl2/1985180145572384768/build-log.txt

@upodroid @BenTheElder the 5k dra job has been failing consistently due to monitoring stack not coming up.

    - allocatedResources:
        cpu: 2700m
        memory: 10Gi
      containerID: containerd://c2d602263deff3e69bb502e745bb6fcf038e28de3f8044c0eb8f793b801b4444
      image: gcr.io/k8s-testimages/quay.io/prometheus/prometheus:v2.40.0
      imageID: gcr.io/k8s-testimages/quay.io/prometheus/prometheus@sha256:eff669d70ee485a191a645caa269530fc4930d0b6c178390c1e1bb378fd200fc
      lastState:
        terminated:
          containerID: containerd://7f299e821c7506ffb863da4b82fd83db9d8bb6a62d6a5bc5bdf3b539763bb09f
          exitCode: 137
          finishedAt: "2025-11-03T04:04:49Z"
          message: |
          <redacted-for-brevity>
          reason: OOMKilled
          startedAt: "2025-11-03T04:04:10Z"
      name: prometheus

The prometheus pod is getting OOMKilled. How can we increase the resources?

@upodroid
Copy link
Member Author

upodroid commented Nov 3, 2025

What's the dra job doing that causes OOM, we don't see it on the 5k job?

You need to bump the limits in the perf-tests repository.

@alaypatel07
Copy link
Contributor

What's the dra job doing that causes OOM, we don't see it on the 5k job?

@upodroid the dra job has not even started yet, the workflow of test execution is:

  1. set up prometheus
  2. check if prometheus is healthy
  3. set up dependencies
  4. Check if dependencies are healthy
  5. Start creating DRA workloads

We are hitting this issue in step 2.

You need to bump the limits in the perf-tests repository.

How much resources do we have on monitoring VM, to what number can I increase this memory limits?

@alaypatel07
Copy link
Contributor

get the small scale one green first: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1985329983672815616

The small scale one is not relevant to the 5k test. The small scale is using extended resources feature, we have triaged it here: kubernetes/perf-tests#3641 (comment).

The small scale one which is relevant to 5k node test is already green: https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-100

It got scheduled on the control plane node that has 96 cores and 360GB of ram

https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-node-dra-with-workload-ipalias-using-cl2/1985180145572384768/artifacts/cluster-info/monitoring/

+ [[ true == \t\r\u\e ]]
+ KUBETEST2_ARGS+=("--down")
+ export PROMETHEUS_KUBE_PROXY_SELECTOR_KEY=k8s-app
+ PROMETHEUS_KUBE_PROXY_SELECTOR_KEY=k8s-app
+ export PROMETHEUS_SCRAPE_APISERVER_ONLY=true
+ PROMETHEUS_SCRAPE_APISERVER_ONLY=true
+ export CL2_PROMETHEUS_TOLERATE_MASTER=true
+ CL2_PROMETHEUS_TOLERATE_MASTER=true
+ [[ gce == \a\w\s ]]

I see these in the 5k non-dra test, https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-ipalias-using-cl2/1985285839373996032/build-log.txt I can try to use the same env variables, but we need the kubelet metrics for collecting kubelet/driver issues.

@alaypatel07
Copy link
Contributor

@pacoxu there was a misconfiguration in this PR that was fixed by #35854. Starting todays run should have actual invocation of the dra test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants