-
Notifications
You must be signed in to change notification settings - Fork 2.8k
create new 5k dra job #35700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create new 5k dra job #35700
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: upodroid The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| base_ref: master | ||
| path_alias: k8s.io/kubernetes | ||
| - org: alaypatel07 | ||
| - org: kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please drop this, its intended to use my branch until we have merged this: kubernetes/perf-tests#3629
| - org: kubernetes | ||
| repo: perf-tests | ||
| base_ref: dra-extended-resources | ||
| base_ref: master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please drop this, its intended to use my branch until we have merged this: kubernetes/perf-tests#3629
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
config/jobs/kubernetes/sig-scalability/DRA/sig-scalability-periodic-dra.yaml
Show resolved
Hide resolved
| value: "true" | ||
| - name: PROMETHEUS_PVC_STORAGE_CLASS | ||
| value: "ssd-csi" | ||
| - name: CLOUD_PROVIDER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DRA test also require enabling certain feature flags. From the config above this is set:
- --env=KUBE_FEATURE_GATES=DynamicResourceAllocation=true
In the case of kops, should we be setting the KOPS_FEATURE_FLAGS feature flag with above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRA feature should be autoenabled already. kops does enable all GA/Beta flags by default
|
This is ready to be merged. I'll cancel the hold once the kops PR is merged. |
|
It might make sense to move forward with a kube-up based job in the short term, the kops PR has been ongoing for a few weeks now? |
+1, I would really like to push forward with a 5k node dra test, considering it GA'ed last release and we don't have any scale test is little concerning to me. |
|
I'll merge the other PR by the end of the week if the kops one is still held up |
|
This is ready to be merged |
|
Let's merge, I'll keep an eye on results |
|
/lgtm |
|
/hold cancel |
|
@upodroid: Updated the following 2 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-5000 failed in last run. |
@upodroid @BenTheElder the 5k dra job has been failing consistently due to monitoring stack not coming up. The prometheus pod is getting OOMKilled. How can we increase the resources? |
|
What's the dra job doing that causes OOM, we don't see it on the 5k job? You need to bump the limits in the perf-tests repository. |
@upodroid the dra job has not even started yet, the workflow of test execution is:
We are hitting this issue in step 2.
How much resources do we have on monitoring VM, to what number can I increase this memory limits? |
|
get the small scale one green first: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1985329983672815616 It got scheduled on the control plane node that has 96 cores and 360GB of ram |
The small scale one is not relevant to the 5k test. The small scale is using extended resources feature, we have triaged it here: kubernetes/perf-tests#3641 (comment). The small scale one which is relevant to 5k node test is already green: https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-100
I see these in the 5k non-dra test, https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-5000-ipalias-using-cl2/1985285839373996032/build-log.txt I can try to use the same env variables, but we need the kubelet metrics for collecting kubelet/driver issues. |
/hold
Requires kubernetes/kops#17671 to be merged first
Closes #35699
A concurrency limit is in place while the boskos-pool is growing; it will then be limited to 4.
/cc @alaypatel07 @BenTheElder @hakman
Also,
ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2will replaceci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workloadonce it's green.