-
Notifications
You must be signed in to change notification settings - Fork 111
[CI] Validate scraped push commits via GitHub API #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at all at using the Github GraphQL API?
I think you could get away with one or two API requests per day, query the exact information you need (pull request associativity, maybe description to check for PRs merged with SPR) and use barely any API quota per request.
After looking into it, I agree that we could use the GraphQL API instead of the REST API. We would at most need 2 requests per day assuming we need to query somewhere between 60 - 80 commits and we'd stay well within the node & rate limits. Will switch us over. Separate thought but since we can query the API with so few requests anyway, we may be able to just use the GraphQL API as our primary source of information instead of the GitHub Archive BigQuery. We'd also be able to avoid having to verify and amend missing data that way. But I can look into that more and follow up in another PR, it's probably best to just get the job working and start collecting data for now. |
Yeah, that's the direction I was thinking. It makes things a lot simpler and ensures the data is up to date. That assumes the Github API doesn't return nonsense (which is more likely than you would think), but for commits to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits that need address, otherwise LGTM.
As GitHubArchive BigQuery is known to be lossy, it is likely that we currently overestimate the number of commits made to llvm/llvm-project without an associated pull request.
To remedy this, we can make calls to the GitHub Event API. While we want to avoid using the API to get information regarding every single commit made to LLVM, we can narrow our calls down to only commits that don't have any pull request data available via BigQuery. From those calls, we can determine if a "push" commit actually does have a pull request and, if it does, whether or not it has been approved.