Leverage gharchive.org dataset

In the dataset commits are included in the events callled PushEvent (https://developer.github.com/v3/activity/events/types/#pushevent).

Among others a PushEvent contains:

Key | Type | Description
----------------- | -- | --
| commits | array | An array of commit objects describing the pushed commits. (The array includes a maximum of 20 commits. If necessary, you can use the Commits API to fetch additional commits. This limit is applied to timeline events only and isn't applied to webhook deliveries.) |
| commits[][sha] | string| The SHA of the commit. |
|commits[][message]|string|The commit message.
|commits[][author]|object|The git author of the commit.
|commits[][author][name]|string|The git author's name.
|commits[][author][email]|string|The git author's email address.
|commits[][url]|url|URL that points to the commit API resource.
|commits[][distinct]|boolean|Whether this commit is distinct from any that have been pushed before.

There are several limitations:
- A PushEvent contains a maximum of 20 commits. This means that any commit that is above this limit is simply missing in the dataset.  Most PushEvent don't hit that limit and contain all the commits (something like 99%). But the problem are initial pushes that could have several thousands of commits. (E.g. A private repo moving to github would have a first PushEvent with a high number of commits.) Missing out on these commits is not okay. We could use the GitHub API for such initial PushEvent that have 20 commits (and potentially thousands of truncated commits). Missing out on subsequent PushEvent commits is probably ok.
 - Commit dates are missing. But we do have the push date. So we could take the push date as coarse approximation of the commit date (assuming that most of the time the date of a git push is within the same approximate time frame as the dates of the commits). But we shouldn't do this approximation for a initial PushEvent that has 20 commits (and potentially thousands of truncated commits).

We could still use the dataset to get a list of repos per user. I expect this list of repos to be mostly exhaustive as:
 - I expect most repoS to start public (We can easily get stats for the ratio how many start private and how many start public. (By checking if the first PushEvent has more than 20 commits.)
 - If you contributed to a private repo, chances are not that low that you contribute to it after it goes open source.
 - Small contributions (only couple of commits) are very unlikely to be missing. (Small contribs most likely only happen in public repoS. Very unlikely to miss out of a small contributions because of truncated subsequent PushEvent commit array.) 

We can also use the dataset for repoS that have a first PushEvent with less than 20 commits. If the first PushEvent has less than 20 commits then we can be confident that the repo started public. Then missing out on couple of commits is probably ok: The approximate commit stats would likely be good enough to categorize users as "maintainer"/"gold contrib"/"silver contrib"/"bronze contrib" and show a contribution timeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Leverage gharchive.org dataset #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Key	Type	Description
commits	array	An array of commit objects describing the pushed commits. (The array includes a maximum of 20 commits. If necessary, you can use the Commits API to fetch additional commits. This limit is applied to timeline events only and isn't applied to webhook deliveries.)
commits[][sha]	string	The SHA of the commit.
commits[][message]	string	The commit message.
commits[][author]	object	The git author of the commit.
commits[][author][name]	string	The git author's name.
commits[][author][email]	string	The git author's email address.
commits[][url]	url	URL that points to the commit API resource.
commits[][distinct]	boolean	Whether this commit is distinct from any that have been pushed before.

Leverage gharchive.org dataset #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions