Skip to content

Leverage gharchive.org dataset #20

@brillout

Description

@brillout

In the dataset commits are included in the events callled PushEvent (https://developer.github.com/v3/activity/events/types/#pushevent).

Among others a PushEvent contains:

Key Type Description
commits array An array of commit objects describing the pushed commits. (The array includes a maximum of 20 commits. If necessary, you can use the Commits API to fetch additional commits. This limit is applied to timeline events only and isn't applied to webhook deliveries.)
commits[][sha] string The SHA of the commit.
commits[][message] string The commit message.
commits[][author] object The git author of the commit.
commits[][author][name] string The git author's name.
commits[][author][email] string The git author's email address.
commits[][url] url URL that points to the commit API resource.
commits[][distinct] boolean Whether this commit is distinct from any that have been pushed before.

There are several limitations:

  • A PushEvent contains a maximum of 20 commits. This means that any commit that is above this limit is simply missing in the dataset. Most PushEvent don't hit that limit and contain all the commits (something like 99%). But the problem are initial pushes that could have several thousands of commits. (E.g. A private repo moving to github would have a first PushEvent with a high number of commits.) Missing out on these commits is not okay. We could use the GitHub API for such initial PushEvent that have 20 commits (and potentially thousands of truncated commits). Missing out on subsequent PushEvent commits is probably ok.
  • Commit dates are missing. But we do have the push date. So we could take the push date as coarse approximation of the commit date (assuming that most of the time the date of a git push is within the same approximate time frame as the dates of the commits). But we shouldn't do this approximation for a initial PushEvent that has 20 commits (and potentially thousands of truncated commits).

We could still use the dataset to get a list of repos per user. I expect this list of repos to be mostly exhaustive as:

  • I expect most repoS to start public (We can easily get stats for the ratio how many start private and how many start public. (By checking if the first PushEvent has more than 20 commits.)
  • If you contributed to a private repo, chances are not that low that you contribute to it after it goes open source.
  • Small contributions (only couple of commits) are very unlikely to be missing. (Small contribs most likely only happen in public repoS. Very unlikely to miss out of a small contributions because of truncated subsequent PushEvent commit array.)

We can also use the dataset for repoS that have a first PushEvent with less than 20 commits. If the first PushEvent has less than 20 commits then we can be confident that the repo started public. Then missing out on couple of commits is probably ok: The approximate commit stats would likely be good enough to categorize users as "maintainer"/"gold contrib"/"silver contrib"/"bronze contrib" and show a contribution timeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions