Skip to content

Conversation

pratt4
Copy link
Collaborator

@pratt4 pratt4 commented Jul 21, 2025

Closes #1394

added retries for pipeline_validation.sh

@pratt4 pratt4 force-pushed the 1394-retry-pipleline-validation branch 2 times, most recently from 85fab33 to 2e630b3 Compare July 21, 2025 17:31
@pratt4 pratt4 marked this pull request as draft July 21, 2025 18:33
@pratt4 pratt4 force-pushed the 1394-retry-pipleline-validation branch from 2e630b3 to 973f5cf Compare July 22, 2025 15:40
@codecov-commenter
Copy link

codecov-commenter commented Jul 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.53%. Comparing base (0c0cdde) to head (42d37fa).

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #1411   +/-   ##
=========================================
  Coverage     46.53%   46.53%           
  Complexity      663      663           
=========================================
  Files            90       90           
  Lines          5823     5823           
  Branches        800      800           
=========================================
  Hits           2710     2710           
  Misses         2824     2824           
  Partials        289      289           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pratt4 pratt4 marked this pull request as ready for review July 22, 2025 18:00
@pratt4
Copy link
Collaborator Author

pratt4 commented Jul 22, 2025

Hi @bashir2
can you please review this pr whenever you get a moment
Thanks!

@bashir2 bashir2 self-requested a review July 25, 2025 19:15
Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pratt4 for fixing this; just some minor suggestions.

@pratt4
Copy link
Collaborator Author

pratt4 commented Jul 26, 2025

Hi @bashir2

Thanks for the review!
I'd appreciate it if you could clarify something related to the Docker setup:

In e2e-tests/Dockerfile, it copies the JAR to the root level (/parquet-tools-1.11.1.jar), but the pipeline_validation.sh script refers to ./controller-spark/parquet-tools-1.11.1.jar.
Could you confirm how this has been working outside of Cloud Build environment?

Is there a specific scenario I'm not aware of where the root-level JAR is actually being used? I'm asking this to decide the best place to copy the new parquet_utils.sh file.
or
please let me know if I'm missing something here

Thanks again!

@pratt4 pratt4 requested a review from bashir2 July 30, 2025 16:51
@bashir2
Copy link
Collaborator

bashir2 commented Jul 31, 2025

Hi @bashir2

Thanks for the review! I'd appreciate it if you could clarify something related to the Docker setup:

In e2e-tests/Dockerfile, it copies the JAR to the root level (/parquet-tools-1.11.1.jar), but the pipeline_validation.sh script refers to ./controller-spark/parquet-tools-1.11.1.jar. Could you confirm how this has been working outside of Cloud Build environment?

Is there a specific scenario I'm not aware of where the root-level JAR is actually being used? I'm asking this to decide the best place to copy the new parquet_utils.sh file. or please let me know if I'm missing something here

Thanks again!

You are right @pratt4 that the parquet-tools-1.11.1.jar file that is being copied in the e2e Dockerfiles is not actually used in the run. Instead parquet-tools-1.11.1.jar is grabbed from its original repo location at e2e-tests/controller-spark. To understand how it is working, please take a look at this Cloud Build documentation and the usage of the /workspace volume that I mentioned.

To show how this works (and also to propose a solution), I created this sample PR and commented out those COPY instructions. As you can see the e2e build passed. So to simplify things, let's forget about e2e docker images being self-contained and not do any COPY . Instead assume that we always run off of the actual repo that is copied to /workspace. That way we know where each of the scripts and parquet-tools-1.11.1.jar are. Please also document this as a comment in e2e Dockerfiles.

@pratt4
Copy link
Collaborator Author

pratt4 commented Aug 1, 2025

You are right @pratt4 that the parquet-tools-1.11.1.jar file that is being copied in the e2e Dockerfiles is not actually used in the run. Instead parquet-tools-1.11.1.jar is grabbed from its original repo location at e2e-tests/controller-spark. To understand how it is working, please take a look at this Cloud Build documentation and the usage of the /workspace volume that I mentioned.

To show how this works (and also to propose a solution), I created this sample PR and commented out those COPY instructions. As you can see the e2e build passed. So to simplify things, let's forget about e2e docker images being self-contained and not do any COPY . Instead assume that we always run off of the actual repo that is copied to /workspace. That way we know where each of the scripts and parquet-tools-1.11.1.jar are. Please also document this as a comment in e2e Dockerfiles.

Thanks @bashir2 for the detailed answer! Learned a lot from thiss.

I've made all the suggested changes.

However, my commit (this one) previously passed all pipeline checks which has all the changes (build link).

But after updating the PR branch by merging master into it, the pipeline started failing due to the patient count issue (Parquet tools returning empty counts for Patient).

I've retried restarting the build, but it continues to fail.

Could this be related to the recent introduction of the getCheckPatientEndpoint() guard rail or something else?

@bashir2
Copy link
Collaborator

bashir2 commented Aug 6, 2025

/gcbrun

@bashir2
Copy link
Collaborator

bashir2 commented Aug 6, 2025

But after updating the PR branch by merging master into it, the pipeline started failing due to the patient count issue (Parquet tools returning empty counts for Patient).

I've retried restarting the build, but it continues to fail.

Could this be related to the recent introduction of the getCheckPatientEndpoint() guard rail or something else?

Thanks for the changes @pratt4; regarding the e2e failures, I don't think it is because of the new CheckPatientEndpoint flag as there are recent PRs that passed e2e, e.g., #1424 and #1425. I triggered the e2e tests a few times on your PR today and it seems that it always fails at on one of the following steps:

  • Step 11: Run E2E Test for FHIR-search mode with HAPI source
  • Step 13: Run E2E Test for JDBC mode with HAPI source

And it always has an issue with the Patient count. I am not sure why this is happening and we need to debug this. At this point the only possibly useful suggestion that I have are the ones in my review below.

@pratt4 pratt4 force-pushed the 1394-retry-pipleline-validation branch from edaa07a to 42d37fa Compare August 8, 2025 02:53
@pratt4
Copy link
Collaborator Author

pratt4 commented Aug 8, 2025

@bashir2 thanks for the review and suggestions!!
I have made the suggested changes

few things I noticed

  1. 25358b3 for this commit I increased the retry count and that build passed, but I don’t think the retries were the reason(not 100% sure). The logs showed the correct Patient count appearing early;

one possible reason i could think is there was an environment/cache hiccup that cleared once the new commit forced a fresh workspace.

  1. One more place I noticed flaky ness was in test_fhir_sink
    for the commit edaa07a I observed a one-off mismatch (off by 1) in test_fhir_sink, I can add a small retry/backoff around that count as well. (this was fixed by restarting)
image

if you suggest its worth fixing then i prefer to add the test_fhir_sink retries in a separate PR to keep this one focused. Happy to include it here instead if you’d rather keep it together.

Thanks again!!

@bashir2
Copy link
Collaborator

bashir2 commented Aug 8, 2025

  1. One more place I noticed flaky ness was in test_fhir_sink
    for the commit edaa07a I observed a one-off mismatch (off by 1) in test_fhir_sink, I can add a small retry/backoff around that count as well. (this was fixed by restarting)
image if you suggest its worth fixing then i prefer to add the test_fhir_sink retries in a separate PR to keep this one focused. Happy to include it here instead if you’d rather keep it together.

Good point about the test_fhir_sink issue; indeed it can be an issue because sometimes it takes a while for HAPI to return the correct count. As you implied, it is a separate issue so let's leave it out of this PR. I'll send you a PR that possibly fixes that problem using a different approach.

Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pratt4 for you diligence and tests for this PR.

@bashir2 bashir2 merged commit 6f7a555 into google:master Aug 8, 2025
6 checks passed
@bashir2
Copy link
Collaborator

bashir2 commented Aug 8, 2025

  1. One more place I noticed flaky ness was in test_fhir_sink
    for the commit edaa07a I observed a one-off mismatch (off by 1) in test_fhir_sink, I can add a small retry/backoff around that count as well. (this was fixed by restarting)
image if you suggest its worth fixing then i prefer to add the test_fhir_sink retries in a separate PR to keep this one focused. Happy to include it here instead if you’d rather keep it together.

Good point about the test_fhir_sink issue; indeed it can be an issue because sometimes it takes a while for HAPI to return the correct count. As you implied, it is a separate issue so let's leave it out of this PR. I'll send you a PR that possibly fixes that problem using a different approach.

@pratt4 can you please take a look at #1426 whenever you can?

@pratt4
Copy link
Collaborator Author

pratt4 commented Aug 8, 2025

Sure @bashir2

Also please feel free to assign any bash related issues/cleanups to me...
Happy to help....

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor pipeline_validation.sh to use shared retry_rowcount() helper
3 participants