Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 2 additions & 9 deletions plot/crawl_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,18 +171,11 @@ def plot(self):
'Pages / Unique Items Cumulative',
'crawlsize/cumulative.png',
data_export_csv='crawlsize/cumulative.csv')
# -- new items per crawl
row_types = ['page', 'url estim. new',
'digest estim. new']
self.size_plot(self.size_by_type, row_types, ' new$',
'New Items per Crawl (not observed in prior crawls)',
'Pages / New Items', 'crawlsize/monthly_new.png',
data_export_csv='crawlsize/monthly_new.csv')
# -- new URLs per crawl
row_types = ['url estim. new']
self.size_plot(self.size_by_type, row_types, ' new$',
self.size_plot(self.size_by_type, row_types, '',
'New URLs per Crawl (not observed in prior crawls)',
'', 'crawlsize/monthly_new_urls.png',
'New URLs', 'crawlsize/monthly_new.png',
data_export_csv='crawlsize/monthly_new.csv')
# -- cumulative URLs over last N crawls (this and preceding N-1 crawls)
row_types = ['url', '1 crawl', # 'url' replaced by '1 crawl'
Expand Down
6 changes: 3 additions & 3 deletions plots/crawlsize.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ Every monthly crawl is a sample of the web and we try to make every monthly snap

![Cumulative size of monthly crawl archives since 2013](./crawlsize/cumulative.png)

The next plot shows the difference in the cumulative size to the preceding crawl. In other words, the amount of new URLs or new content not observed in any of the preceding monthly crawls.
The next plot shows the difference in the cumulative size of URLs to the preceding crawl. In other words, the amount of new URLs, not observed in any of the preceding crawls.

![New Items per Crawl, not observed in prior crawls](./crawlsize/monthly_new.png)
![New URLs per Crawl, not observed in prior crawls](./crawlsize/monthly_new.png)

([New items per crawl as CSV](./crawlsize/monthly_new.csv))
([New URLs per crawl as CSV](./crawlsize/monthly_new.csv))

How many unique items (in terms of URLs or unique content by digest) are covered by the last n crawls? The coverage over certain time intervals went down early 2015 when continuous donations of verified seeds stopped. Since autumn 2016 we are able to extend the crawl by our own, and we try to increase the coverage for the last n crawls.

Expand Down
2 changes: 1 addition & 1 deletion plots/crawlsize/monthly_new.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
crawl,url estim.
crawl,url estim. new
CC-MAIN-2008-2009,1799114116
CC-MAIN-2009-2010,2025520640
CC-MAIN-2012,2875802047
Expand Down
Binary file modified plots/crawlsize/monthly_new.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed plots/crawlsize/monthly_new_urls.png
Binary file not shown.