Skip to content

graph.py: Inputting wildcarded domain from file input interface doesnt work correctly #5

@handecelikkanat

Description

@handecelikkanat

(As opposed to inputting the same domain as a command-line argument, which does work as expected.)

Description of behavior:

  1. When a wildcarded domain is input as a command-line argument to graph.py, it is processed as expected.
  2. When it is input from the input file, the wildcard is not processed correctly. This happens both when:
    a. The domain is the single domain in the input file
    b. There is another domain in the input file.

Case 1: Correct behavior - wildcarded domain is input from command line:

python graph.py "*.maori.nz" | tee '*.maori.nz.log'

produces non-zero output:

$ cat *.maori.nz_*csv
"crawl","sum_fetch_200","sum_fetch_200_lote"
"CC-MAIN-2021-49",1817,316
"CC-MAIN-2022-05",2768,364
"CC-MAIN-2022-21",3272,211
"CC-MAIN-2022-27",3950,284
"CC-MAIN-2022-33",1564,117
"CC-MAIN-2022-40",5412,802
...

Case 2.a: Wrong behavior - wildcarded domain is given from an input file, which has only that domain:

$ cat maori_hosts_subdomain.txt
*.maori.nz

produces empty output (the output should have been non-empty, see Case 1)

Command: python graph.py -f maori_hosts_subdomain.txt | tee maori_hosts_subdomain.txt.log
...
[cc-host-index] ~/dev/cc-work-in-progress/hande-host-index-plots () 
Ooooops $ cat maori_hosts_subdomain.txt_*csv
"crawl","sum_fetch_200"
"crawl","sum_fetch_200","sum_fetch_200_lote"
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"

Case 2.b: Wrong behavior - wildcarded domain is given from an input file, which has another domain as well:

$ cat maori_hosts.txt
maori.nz
*.maori.nz

Wildcarded domain is not processed correctly, thus is not reflected in the output.
Output contains the information form the first domain only.

Command: python graph.py -f maori_hosts.txt | tee maori_hosts.txt.log

$ cat maori_hosts_*csv
"crawl","sum_fetch_200"
"crawl","sum_fetch_200","sum_fetch_200_lote"
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
"crawl","sum_fetch_200"
"CC-MAIN-2021-49",0
"CC-MAIN-2022-05",0
"CC-MAIN-2022-21",0
"CC-MAIN-2022-27",0
"CC-MAIN-2022-33",0
"CC-MAIN-2022-40",0
...
"crawl","sum_fetch_200","sum_fetch_200_lote"
"CC-MAIN-2021-49",0,0
"CC-MAIN-2022-05",0,0
"CC-MAIN-2022-21",0,0
"CC-MAIN-2022-27",0,0
"CC-MAIN-2022-33",0,0
"CC-MAIN-2022-40",0,0
...
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
"CC-MAIN-2021-49",0,0,0,0,0,0,0,0
"CC-MAIN-2022-05",0,0,0,0,0,0,0,0
"CC-MAIN-2022-21",0,0,0,0,0,0,0,0
"CC-MAIN-2022-27",0,0,0,0,0,0,0,0
"CC-MAIN-2022-33",0,0,0,0,0,0,0,0
"CC-MAIN-2022-40",0,0,0,0,0,0,0,0
...

Diagnosing the reason of behavior:

The sql that is formed when wildcarded domain is given through a file is not correct:

Starting with this as the input file:

$cat maori_hosts.txt
maori.nz
*.maori.nz

This is the sql formed by graph.py:

python graph.py -f maori_hosts.txt | tee maori_hosts.txt.log
DEBUG::surt line from file: maori.nz
DEBUG::found surt_host_name: nz,maori
DEBUG::appending
DEBUG::surt line from file: *.maori.nz
DEBUG::found surt_host_name: nz,maori,
DEBUG::appending
DEBUG::surt_host_name: ['nz,maori', 'nz,maori,']
DEBUG::config: {'sum': ('crawl', 'fetch_200'), 'sum_lote': ('crawl', 'fetch_200', 'fetch_200_lote'), 'sum_nutch': ('crawl', 'fetch_200', 'fetch_200_lote', 'nutch_fetched', 'nutch_unfetched', 'nutch_gone', 'nutch_redirTemp', 'nutch_redirPerm', 'nutch_notModified')}
DEBUG::check_sums: True
DEBUG::sql for many hosts: 
SELECT
  crawl, CAST(SUM(fetch_200) AS INT64) AS sum_fetch_200
FROM host_index
WHERE contains(ARRAY ['nz,maori','nz,maori,'], surt_host_name) AND url_host_tld = 'nz'
GROUP BY crawl
ORDER BY crawl ASC

When ran in duckdb, this sql returns:

              crawl surt_host_name
0   CC-MAIN-2022-49       nz,maori
1   CC-MAIN-2022-21       nz,maori
2   CC-MAIN-2023-06       nz,maori
3   CC-MAIN-2024-30       nz,maori
4   CC-MAIN-2025-18       nz,maori
5   CC-MAIN-2023-40       nz,maori
6   CC-MAIN-2024-22       nz,maori
7   CC-MAIN-2024-38       nz,maori
8   CC-MAIN-2023-50       nz,maori
9   CC-MAIN-2022-27       nz,maori
10  CC-MAIN-2025-13       nz,maori
11  CC-MAIN-2023-14       nz,maori
12  CC-MAIN-2024-51       nz,maori
13  CC-MAIN-2024-42       nz,maori
14  CC-MAIN-2022-33       nz,maori
15  CC-MAIN-2022-40       nz,maori
16  CC-MAIN-2025-05       nz,maori
17  CC-MAIN-2023-23       nz,maori
18  CC-MAIN-2024-33       nz,maori
19  CC-MAIN-2024-18       nz,maori
20  CC-MAIN-2021-49       nz,maori
21  CC-MAIN-2024-10       nz,maori
22  CC-MAIN-2025-08       nz,maori
23  CC-MAIN-2024-26       nz,maori
24  CC-MAIN-2022-05       nz,maori
25  CC-MAIN-2024-46       nz,maori
-------

Which is not correct, as it doesnt include the wildcarded subdomains.

I think the input *.maori.nz is mapped to surt_host_name nz,maori, and this does not properly cover the wildcarded subdomains.

This bug exists only when the wildcarded subdomain (*.maori.nz) is read from a file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions