-
Notifications
You must be signed in to change notification settings - Fork 2
Description
(As opposed to inputting the same domain as a command-line argument, which does work as expected.)
Description of behavior:
- When a wildcarded domain is input as a command-line argument to
graph.py
, it is processed as expected. - When it is input from the input file, the wildcard is not processed correctly. This happens both when:
a. The domain is the single domain in the input file
b. There is another domain in the input file.
Case 1: Correct behavior - wildcarded domain is input from command line:
python graph.py "*.maori.nz" | tee '*.maori.nz.log'
produces non-zero output:
$ cat *.maori.nz_*csv
"crawl","sum_fetch_200","sum_fetch_200_lote"
"CC-MAIN-2021-49",1817,316
"CC-MAIN-2022-05",2768,364
"CC-MAIN-2022-21",3272,211
"CC-MAIN-2022-27",3950,284
"CC-MAIN-2022-33",1564,117
"CC-MAIN-2022-40",5412,802
...
Case 2.a: Wrong behavior - wildcarded domain is given from an input file, which has only that domain:
$ cat maori_hosts_subdomain.txt
*.maori.nz
produces empty output (the output should have been non-empty, see Case 1)
Command: python graph.py -f maori_hosts_subdomain.txt | tee maori_hosts_subdomain.txt.log
...
[cc-host-index] ~/dev/cc-work-in-progress/hande-host-index-plots ()
Ooooops $ cat maori_hosts_subdomain.txt_*csv
"crawl","sum_fetch_200"
"crawl","sum_fetch_200","sum_fetch_200_lote"
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
Case 2.b: Wrong behavior - wildcarded domain is given from an input file, which has another domain as well:
$ cat maori_hosts.txt
maori.nz
*.maori.nz
Wildcarded domain is not processed correctly, thus is not reflected in the output.
Output contains the information form the first domain only.
Command: python graph.py -f maori_hosts.txt | tee maori_hosts.txt.log
$ cat maori_hosts_*csv
"crawl","sum_fetch_200"
"crawl","sum_fetch_200","sum_fetch_200_lote"
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
"crawl","sum_fetch_200"
"CC-MAIN-2021-49",0
"CC-MAIN-2022-05",0
"CC-MAIN-2022-21",0
"CC-MAIN-2022-27",0
"CC-MAIN-2022-33",0
"CC-MAIN-2022-40",0
...
"crawl","sum_fetch_200","sum_fetch_200_lote"
"CC-MAIN-2021-49",0,0
"CC-MAIN-2022-05",0,0
"CC-MAIN-2022-21",0,0
"CC-MAIN-2022-27",0,0
"CC-MAIN-2022-33",0,0
"CC-MAIN-2022-40",0,0
...
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
"CC-MAIN-2021-49",0,0,0,0,0,0,0,0
"CC-MAIN-2022-05",0,0,0,0,0,0,0,0
"CC-MAIN-2022-21",0,0,0,0,0,0,0,0
"CC-MAIN-2022-27",0,0,0,0,0,0,0,0
"CC-MAIN-2022-33",0,0,0,0,0,0,0,0
"CC-MAIN-2022-40",0,0,0,0,0,0,0,0
...
Diagnosing the reason of behavior:
The sql that is formed when wildcarded domain is given through a file is not correct:
Starting with this as the input file:
$cat maori_hosts.txt
maori.nz
*.maori.nz
This is the sql formed by graph.py
:
python graph.py -f maori_hosts.txt | tee maori_hosts.txt.log
DEBUG::surt line from file: maori.nz
DEBUG::found surt_host_name: nz,maori
DEBUG::appending
DEBUG::surt line from file: *.maori.nz
DEBUG::found surt_host_name: nz,maori,
DEBUG::appending
DEBUG::surt_host_name: ['nz,maori', 'nz,maori,']
DEBUG::config: {'sum': ('crawl', 'fetch_200'), 'sum_lote': ('crawl', 'fetch_200', 'fetch_200_lote'), 'sum_nutch': ('crawl', 'fetch_200', 'fetch_200_lote', 'nutch_fetched', 'nutch_unfetched', 'nutch_gone', 'nutch_redirTemp', 'nutch_redirPerm', 'nutch_notModified')}
DEBUG::check_sums: True
DEBUG::sql for many hosts:
SELECT
crawl, CAST(SUM(fetch_200) AS INT64) AS sum_fetch_200
FROM host_index
WHERE contains(ARRAY ['nz,maori','nz,maori,'], surt_host_name) AND url_host_tld = 'nz'
GROUP BY crawl
ORDER BY crawl ASC
When ran in duckdb, this sql returns:
crawl surt_host_name
0 CC-MAIN-2022-49 nz,maori
1 CC-MAIN-2022-21 nz,maori
2 CC-MAIN-2023-06 nz,maori
3 CC-MAIN-2024-30 nz,maori
4 CC-MAIN-2025-18 nz,maori
5 CC-MAIN-2023-40 nz,maori
6 CC-MAIN-2024-22 nz,maori
7 CC-MAIN-2024-38 nz,maori
8 CC-MAIN-2023-50 nz,maori
9 CC-MAIN-2022-27 nz,maori
10 CC-MAIN-2025-13 nz,maori
11 CC-MAIN-2023-14 nz,maori
12 CC-MAIN-2024-51 nz,maori
13 CC-MAIN-2024-42 nz,maori
14 CC-MAIN-2022-33 nz,maori
15 CC-MAIN-2022-40 nz,maori
16 CC-MAIN-2025-05 nz,maori
17 CC-MAIN-2023-23 nz,maori
18 CC-MAIN-2024-33 nz,maori
19 CC-MAIN-2024-18 nz,maori
20 CC-MAIN-2021-49 nz,maori
21 CC-MAIN-2024-10 nz,maori
22 CC-MAIN-2025-08 nz,maori
23 CC-MAIN-2024-26 nz,maori
24 CC-MAIN-2022-05 nz,maori
25 CC-MAIN-2024-46 nz,maori
-------
Which is not correct, as it doesnt include the wildcarded subdomains.
I think the input *.maori.nz
is mapped to surt_host_name nz,maori,
and this does not properly cover the wildcarded subdomains.
This bug exists only when the wildcarded subdomain (*.maori.nz
) is read from a file.