graph.py: Inputting wildcarded domain from file input interface doesnt work correctly

(As opposed to inputting the same domain as a command-line argument, which does work as expected.)

### Description of behavior:

1. When a wildcarded domain is input as a command-line argument to `graph.py`, it is processed as expected.
2. When it is input from the input file, the wildcard is not processed correctly. This happens both when:
  a. The domain is the single domain in the input file
  b. There is another domain in the input file.

#### Case 1: Correct behavior - wildcarded domain is input from command line:

```
python graph.py "*.maori.nz" | tee '*.maori.nz.log'
```

produces non-zero output:
```
$ cat *.maori.nz_*csv
"crawl","sum_fetch_200","sum_fetch_200_lote"
"CC-MAIN-2021-49",1817,316
"CC-MAIN-2022-05",2768,364
"CC-MAIN-2022-21",3272,211
"CC-MAIN-2022-27",3950,284
"CC-MAIN-2022-33",1564,117
"CC-MAIN-2022-40",5412,802
...
``` 


#### Case 2.a: Wrong behavior - wildcarded domain is given from an input file, which has only that domain:

```
$ cat maori_hosts_subdomain.txt
*.maori.nz
```

produces empty output (**the output should have been non-empty, see Case 1**)
```
Command: python graph.py -f maori_hosts_subdomain.txt | tee maori_hosts_subdomain.txt.log
...
[cc-host-index] ~/dev/cc-work-in-progress/hande-host-index-plots () 
Ooooops $ cat maori_hosts_subdomain.txt_*csv
"crawl","sum_fetch_200"
"crawl","sum_fetch_200","sum_fetch_200_lote"
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
```


#### Case 2.b: Wrong behavior - wildcarded domain is given from an input file, which has another domain as well:

```
$ cat maori_hosts.txt
maori.nz
*.maori.nz
```

Wildcarded domain is not processed correctly, thus is not reflected in the output.  
Output contains the information form the first domain only.

```
Command: python graph.py -f maori_hosts.txt | tee maori_hosts.txt.log

$ cat maori_hosts_*csv
"crawl","sum_fetch_200"
"crawl","sum_fetch_200","sum_fetch_200_lote"
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
"crawl","sum_fetch_200"
"CC-MAIN-2021-49",0
"CC-MAIN-2022-05",0
"CC-MAIN-2022-21",0
"CC-MAIN-2022-27",0
"CC-MAIN-2022-33",0
"CC-MAIN-2022-40",0
...
"crawl","sum_fetch_200","sum_fetch_200_lote"
"CC-MAIN-2021-49",0,0
"CC-MAIN-2022-05",0,0
"CC-MAIN-2022-21",0,0
"CC-MAIN-2022-27",0,0
"CC-MAIN-2022-33",0,0
"CC-MAIN-2022-40",0,0
...
"crawl","sum_fetch_200","sum_fetch_200_lote","sum_nutch_fetched","sum_nutch_unfetched","sum_nutch_gone","sum_nutch_redirTemp","sum_nutch_redirPerm","sum_nutch_notModified"
"CC-MAIN-2021-49",0,0,0,0,0,0,0,0
"CC-MAIN-2022-05",0,0,0,0,0,0,0,0
"CC-MAIN-2022-21",0,0,0,0,0,0,0,0
"CC-MAIN-2022-27",0,0,0,0,0,0,0,0
"CC-MAIN-2022-33",0,0,0,0,0,0,0,0
"CC-MAIN-2022-40",0,0,0,0,0,0,0,0
...
```


### Diagnosing the reason of behavior:

The sql that is formed when wildcarded domain is given through a file is not correct:

Starting with this as the input file:
```
$cat maori_hosts.txt
maori.nz
*.maori.nz
```

This is the sql formed by `graph.py`:
```
python graph.py -f maori_hosts.txt | tee maori_hosts.txt.log
DEBUG::surt line from file: maori.nz
DEBUG::found surt_host_name: nz,maori
DEBUG::appending
DEBUG::surt line from file: *.maori.nz
DEBUG::found surt_host_name: nz,maori,
DEBUG::appending
DEBUG::surt_host_name: ['nz,maori', 'nz,maori,']
DEBUG::config: {'sum': ('crawl', 'fetch_200'), 'sum_lote': ('crawl', 'fetch_200', 'fetch_200_lote'), 'sum_nutch': ('crawl', 'fetch_200', 'fetch_200_lote', 'nutch_fetched', 'nutch_unfetched', 'nutch_gone', 'nutch_redirTemp', 'nutch_redirPerm', 'nutch_notModified')}
DEBUG::check_sums: True
DEBUG::sql for many hosts: 
SELECT
  crawl, CAST(SUM(fetch_200) AS INT64) AS sum_fetch_200
FROM host_index
WHERE contains(ARRAY ['nz,maori','nz,maori,'], surt_host_name) AND url_host_tld = 'nz'
GROUP BY crawl
ORDER BY crawl ASC
```

When ran in duckdb, this sql returns:
```
              crawl surt_host_name
0   CC-MAIN-2022-49       nz,maori
1   CC-MAIN-2022-21       nz,maori
2   CC-MAIN-2023-06       nz,maori
3   CC-MAIN-2024-30       nz,maori
4   CC-MAIN-2025-18       nz,maori
5   CC-MAIN-2023-40       nz,maori
6   CC-MAIN-2024-22       nz,maori
7   CC-MAIN-2024-38       nz,maori
8   CC-MAIN-2023-50       nz,maori
9   CC-MAIN-2022-27       nz,maori
10  CC-MAIN-2025-13       nz,maori
11  CC-MAIN-2023-14       nz,maori
12  CC-MAIN-2024-51       nz,maori
13  CC-MAIN-2024-42       nz,maori
14  CC-MAIN-2022-33       nz,maori
15  CC-MAIN-2022-40       nz,maori
16  CC-MAIN-2025-05       nz,maori
17  CC-MAIN-2023-23       nz,maori
18  CC-MAIN-2024-33       nz,maori
19  CC-MAIN-2024-18       nz,maori
20  CC-MAIN-2021-49       nz,maori
21  CC-MAIN-2024-10       nz,maori
22  CC-MAIN-2025-08       nz,maori
23  CC-MAIN-2024-26       nz,maori
24  CC-MAIN-2022-05       nz,maori
25  CC-MAIN-2024-46       nz,maori
-------
```

Which is not correct, as it doesnt include the wildcarded subdomains.

I think the input `*.maori.nz` is mapped to surt_host_name `nz,maori,` and this does not properly cover the wildcarded subdomains.

This bug exists only when the wildcarded subdomain (`*.maori.nz`) is read from a file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

graph.py: Inputting wildcarded domain from file input interface doesnt work correctly #5

Description of behavior:

Case 1: Correct behavior - wildcarded domain is input from command line:

Case 2.a: Wrong behavior - wildcarded domain is given from an input file, which has only that domain:

Case 2.b: Wrong behavior - wildcarded domain is given from an input file, which has another domain as well:

Diagnosing the reason of behavior:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

graph.py: Inputting wildcarded domain from file input interface doesnt work correctly #5

Description

Description of behavior:

Case 1: Correct behavior - wildcarded domain is input from command line:

Case 2.a: Wrong behavior - wildcarded domain is given from an input file, which has only that domain:

Case 2.b: Wrong behavior - wildcarded domain is given from an input file, which has another domain as well:

Diagnosing the reason of behavior:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions