Skip to content

[CI] Validate scraped push commits via GitHub API #514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 26, 2025

Conversation

jriv01
Copy link
Contributor

@jriv01 jriv01 commented Jul 21, 2025

As GitHubArchive BigQuery is known to be lossy, it is likely that we currently overestimate the number of commits made to llvm/llvm-project without an associated pull request.

To remedy this, we can make calls to the GitHub Event API. While we want to avoid using the API to get information regarding every single commit made to LLVM, we can narrow our calls down to only commits that don't have any pull request data available via BigQuery. From those calls, we can determine if a "push" commit actually does have a pull request and, if it does, whether or not it has been approved.

@jriv01
Copy link
Contributor Author

jriv01 commented Jul 21, 2025

@lnihlen

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at all at using the Github GraphQL API?

I think you could get away with one or two API requests per day, query the exact information you need (pull request associativity, maybe description to check for PRs merged with SPR) and use barely any API quota per request.

@jriv01
Copy link
Contributor Author

jriv01 commented Jul 25, 2025

Have you looked at all at using the Github GraphQL API?

After looking into it, I agree that we could use the GraphQL API instead of the REST API. We would at most need 2 requests per day assuming we need to query somewhere between 60 - 80 commits and we'd stay well within the node & rate limits. Will switch us over.

Separate thought but since we can query the API with so few requests anyway, we may be able to just use the GraphQL API as our primary source of information instead of the GitHub Archive BigQuery. We'd also be able to avoid having to verify and amend missing data that way. But I can look into that more and follow up in another PR, it's probably best to just get the job working and start collecting data for now.

@boomanaiden154
Copy link
Contributor

Separate thought but since we can query the API with so few requests anyway, we may be able to just use the GraphQL API as our primary source of information instead of the GitHub Archive BigQuery. We'd also be able to avoid having to verify and amend missing data that way.

Yeah, that's the direction I was thinking. It makes things a lot simpler and ensures the data is up to date. That assumes the Github API doesn't return nonsense (which is more likely than you would think), but for commits to main without any filtering, I think we should be fine.

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits that need address, otherwise LGTM.

@boomanaiden154 boomanaiden154 merged commit a16b133 into llvm:main Jul 26, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants