Skip to content

Commit d0e7560

Browse files
authored
add option to --no-check-certs use at own risk (#89)
* add option to --no-check-certs use at own risk * support to target single file, and HEAD request * bug: update selenium to use newer interface The current failures are a result of an update to selenium, so the instantiation of our driver fails, returns as None, and then all the requests are done with only requests. As the web matures (and sites do not want scraping) it is less likely this approach will work - we need the driver. This change will update the selenium UI to ensure the driver works and restore functionality. I will follow up with any tweaks needed for the CI (working locally for me). Signed-off-by: vsoch <vsoch@users.noreply.github.com>
1 parent 7dbd7ac commit d0e7560

File tree

19 files changed

+156
-96
lines changed

19 files changed

+156
-96
lines changed

.github/workflows/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ jobs:
1515
name: Build Container
1616
steps:
1717
- name: Checkout
18-
uses: actions/checkout@v3
18+
uses: actions/checkout@v4
1919

2020
- name: Build
2121
run: |

.github/workflows/test.yml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ jobs:
1111
formatting:
1212
runs-on: ubuntu-latest
1313
steps:
14-
- uses: actions/checkout@v3
14+
- uses: actions/checkout@v4
1515

1616
- name: Setup black environment
1717
run: conda create --quiet --name black pyflakes
@@ -28,7 +28,7 @@ jobs:
2828
needs: formatting
2929
runs-on: ubuntu-latest
3030
steps:
31-
- uses: actions/checkout@v3
31+
- uses: actions/checkout@v4
3232

3333
- name: Setup mypy environment
3434
run: conda create --quiet --name type_checking mypy
@@ -45,15 +45,16 @@ jobs:
4545
needs: type_checking
4646
runs-on: ubuntu-latest
4747
steps:
48-
- uses: actions/checkout@v3
48+
- uses: actions/checkout@v4
4949
- name: Setup testing environment
5050
run: conda create --quiet --name testing pytest
5151

5252
- name: Download ChromeDriver
5353
run: |
54-
wget https://chromedriver.storage.googleapis.com/107.0.5304.18/chromedriver_linux64.zip
55-
unzip chromedriver_linux64.zip
56-
rm chromedriver_linux64.zip
54+
# Note if you use locally, must match
55+
wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/121.0.6167.85/linux64/chromedriver-linux64.zip
56+
unzip chromedriver-linux64.zip
57+
rm chromedriver-linux64.zip
5758
5859
- name: Test
5960
run: |

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ and **Merged pull requests**. Critical items to know are:
1212
Referenced versions in headers are tagged on Github, in parentheses are for pypi.
1313

1414
## [vxx](https://github.com/urlstechie/urlschecker-python/tree/master) (master)
15+
- allow variable to skip checking certificates (0.0.35)
1516
- switch back to pypi release of fake-useragent (0.0.34)
1617
- preparing to install from git for fake-useragent (0.0.33)
1718
- serial option for debugging (0.0.32)

Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,9 @@ RUN /bin/bash -c "source activate urlchecker && \
2626
pip install --upgrade certifi && \
2727
pip install .[all]"
2828
# Download chrome driver for selenium
29-
RUN /bin/bash -c "wget https://chromedriver.storage.googleapis.com/107.0.5304.18/chromedriver_linux64.zip && \
30-
unzip chromedriver_linux64.zip && \
31-
rm chromedriver_linux64.zip"
29+
RUN /bin/bash -c "wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/121.0.6167.85/linux64/chromedriver-linux64.zip && \
30+
unzip -o chromedriver-linux64.zip && \
31+
rm chromedriver-linux64.zip"
3232
RUN echo "source activate urlchecker" > ~/.bashrc
3333
ENV PATH /code:/opt/conda/envs/urlchecker/bin:${PATH}
3434
ENTRYPOINT ["urlchecker"]

README.md

Lines changed: 20 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -56,49 +56,42 @@ for files. In this case, you can use urlchecker check:
5656

5757
```bash
5858
$ urlchecker check --help
59-
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup]
60-
[--force-pass] [--no-print] [--file-types FILE_TYPES]
61-
[--files FILES] [--exclude-urls EXCLUDE_URLS]
62-
[--exclude-patterns EXCLUDE_PATTERNS]
63-
[--exclude-files EXCLUDE_FILES] [--save SAVE]
64-
[--retry-count RETRY_COUNT] [--timeout TIMEOUT]
59+
```
60+
```console
61+
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup] [--serial] [--no-check-certs]
62+
[--force-pass] [--no-print] [--verbose] [--file-types FILE_TYPES] [--files FILES]
63+
[--exclude-urls EXCLUDE_URLS] [--exclude-patterns EXCLUDE_PATTERNS]
64+
[--exclude-files EXCLUDE_FILES] [--save SAVE] [--retry-count RETRY_COUNT] [--timeout TIMEOUT]
6565
path
6666

6767
positional arguments:
6868
path the local path or GitHub repository to clone and check
6969

70-
optional arguments:
70+
options:
7171
-h, --help show this help message and exit
7272
-b BRANCH, --branch BRANCH
73-
if cloning, specify a branch to use (defaults to
74-
master)
73+
if cloning, specify a branch to use (defaults to main)
7574
--subfolder SUBFOLDER
76-
relative subfolder path within path (if not specified,
77-
we use root)
78-
--cleanup remove root folder after checking (defaults to False,
79-
no cleaup)
80-
--force-pass force successful pass (return code 0) regardless of
81-
result
82-
--no-print Skip printing results to the screen (defaults to
83-
printing to console).
75+
relative subfolder path within path (if not specified, we use root)
76+
--cleanup remove root folder after checking (defaults to False, no cleaup)
77+
--serial run checks in serial (no multiprocess)
78+
--no-check-certs Allow urls to validate that fail certificate checks
79+
--force-pass force successful pass (return code 0) regardless of result
80+
--no-print Skip printing results to the screen (defaults to printing to console).
81+
--verbose Print file names for failed urls in addition to the urls.
8482
--file-types FILE_TYPES
85-
comma separated list of file extensions to check
86-
(defaults to .md,.py)
87-
--files FILES comma separated list of exact files or patterns to
88-
check.
83+
comma separated list of file extensions to check (defaults to .md,.py)
84+
--files FILES comma separated list of exact files or patterns to check.
8985
--exclude-urls EXCLUDE_URLS
9086
comma separated links to exclude (no spaces)
9187
--exclude-patterns EXCLUDE_PATTERNS
92-
comma separated list of patterns to exclude (no
93-
spaces)
88+
comma separated list of patterns to exclude (no spaces)
9489
--exclude-files EXCLUDE_FILES
95-
comma separated list of files and patterns to exclude
96-
(no spaces)
90+
comma separated list of files and patterns to exclude (no spaces)
9791
--save SAVE Path to a csv file to save results to.
9892
--retry-count RETRY_COUNT
9993
retry count upon failure (defaults to 2, one retry).
100-
--timeout TIMEOUT timeout (seconds) to provide to the requests library
101-
(defaults to 5)
94+
--timeout TIMEOUT timeout (seconds) to provide to the requests library (defaults to 5)
10295
```
10396

10497
You have a lot of flexibility to define patterns of urls or files to skip,

setup.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,6 @@ def get_reqs(lookup=None, key="INSTALL_REQUIRES"):
6767
INSTALL_REQUIRES = get_reqs(lookup)
6868
TESTS_REQUIRES = get_reqs(lookup, "TESTS_REQUIRES")
6969
INSTALL_REQUIRES_ALL = get_reqs(lookup, "INSTALL_REQUIRES_ALL")
70-
SELENIUM_REQUIRES = get_reqs(lookup, "SELENIUM_REQUIRES")
7170

7271
setup(
7372
name=NAME,
@@ -90,7 +89,6 @@ def get_reqs(lookup=None, key="INSTALL_REQUIRES"):
9089
tests_require=TESTS_REQUIRES,
9190
extras_require={
9291
"all": INSTALL_REQUIRES_ALL,
93-
"selenium": SELENIUM_REQUIRES,
9492
},
9593
classifiers=[
9694
"Intended Audience :: Developers",

urlchecker/client/__init__.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
"""
44
5-
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
5+
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
66
77
This source code is licensed under the terms of the MIT license.
88
For a copy, see <https://opensource.org/licenses/MIT>.
@@ -76,6 +76,13 @@ def get_parser():
7676
default=False,
7777
action="store_true",
7878
)
79+
check.add_argument(
80+
"--no-check-certs",
81+
dest="no_check_certs",
82+
help="Allow urls to validate that fail certificate checks",
83+
default=False,
84+
action="store_true",
85+
)
7986
check.add_argument(
8087
"--force-pass",
8188
help="force successful pass (return code 0) regardless of result",

urlchecker/client/check.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
"""
22
3-
client/github.py: entrypoint for interaction with a GitHub repostiory.
4-
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
3+
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
54
65
This source code is licensed under the terms of the MIT license.
76
For a copy, see <https://opensource.org/licenses/MIT>.
@@ -73,6 +72,7 @@ def main(args, extra):
7372
print(" urls excluded: %s" % exclude_urls)
7473
print(" url patterns excluded: %s" % exclude_patterns)
7574
print(" file patterns excluded: %s" % exclude_files)
75+
print(" no check certs: %s" % args.no_check_certs)
7676
print(" force pass: %s" % args.force_pass)
7777
print(" retry count: %s" % args.retry_count)
7878
print(" save: %s" % args.save)
@@ -90,6 +90,7 @@ def main(args, extra):
9090
check_results = checker.run(
9191
exclude_urls=exclude_urls,
9292
exclude_patterns=exclude_patterns,
93+
no_check_certs=args.no_check_certs,
9394
retry_count=args.retry_count,
9495
timeout=args.timeout,
9596
)

urlchecker/core/check.py

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
"""
22
3-
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
3+
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
44
55
This source code is licensed under the terms of the MIT license.
66
For a copy, see <https://opensource.org/licenses/MIT>.
@@ -12,7 +12,7 @@
1212
import random
1313
import re
1414
import sys
15-
from typing import Dict, List
15+
from typing import Optional, Dict, List
1616

1717
from urlchecker.core import fileproc
1818
from urlchecker.core.urlproc import UrlCheckResult
@@ -27,11 +27,11 @@ class UrlChecker:
2727

2828
def __init__(
2929
self,
30-
path: str = None,
31-
file_types: List[str] = None,
32-
exclude_files: List[str] = None,
30+
path: Optional[str] = None,
31+
file_types: Optional[List[str]] = None,
32+
exclude_files: Optional[List[str]] = None,
3333
print_all: bool = True,
34-
include_patterns: List[str] = None,
34+
include_patterns: Optional[List[str]] = None,
3535
serial: bool = False,
3636
):
3737
"""
@@ -73,12 +73,16 @@ def __init__(
7373
if not os.path.exists(path):
7474
sys.exit("%s does not exist." % path)
7575

76-
self.file_paths = fileproc.get_file_paths(
77-
base_path=path,
78-
file_types=self.file_types,
79-
exclude_files=self.exclude_files,
80-
include_patterns=self.include_patterns,
81-
)
76+
# Case 1: a single file
77+
if os.path.isfile(path):
78+
self.file_paths = [os.path.abspath(path)]
79+
else:
80+
self.file_paths = fileproc.get_file_paths(
81+
base_path=path,
82+
file_types=self.file_types,
83+
exclude_files=self.exclude_files,
84+
include_patterns=self.include_patterns,
85+
)
8286

8387
def __str__(self) -> str:
8488
if self.path:
@@ -92,7 +96,7 @@ def save_results(
9296
self,
9397
file_path: str,
9498
sep: str = ",",
95-
header: List[str] = None,
99+
header: Optional[List[str]] = None,
96100
relative_paths: bool = True,
97101
) -> str:
98102
"""
@@ -161,11 +165,12 @@ def save_results(
161165

162166
def run(
163167
self,
164-
file_paths: List[str] = None,
165-
exclude_patterns: List[str] = None,
166-
exclude_urls: List[str] = None,
168+
file_paths: Optional[List[str]] = None,
169+
exclude_patterns: Optional[List[str]] = None,
170+
exclude_urls: Optional[List[str]] = None,
167171
retry_count: int = 2,
168172
timeout: int = 5,
173+
no_check_certs: bool = False,
169174
) -> Dict[str, set]:
170175
"""
171176
Run the url checker given a path, excluded patterns for urls/files
@@ -179,6 +184,7 @@ def run(
179184
- exclude_patterns (list) : list of excluded patterns for urls.
180185
- retry_count (int) : number of retries on failed first check. Default=2.
181186
- timeout (int) : timeout to use when waiting on check feedback. Default=5.
187+
- no_check_certs (bool) : do not check certificates
182188
183189
Returns:
184190
dictionary with each of list of urls for "failed" and "passed."
@@ -210,6 +216,7 @@ def run(
210216
kwargs = {
211217
"file_name": file_name,
212218
"exclude_patterns": exclude_patterns,
219+
"no_check_certs": no_check_certs,
213220
"exclude_urls": exclude_urls,
214221
"print_all": self.print_all,
215222
"retry_count": retry_count,
@@ -257,6 +264,7 @@ def check_task(*args, **kwargs):
257264
retry_count=kwargs.get("retry_count", 2),
258265
timeout=kwargs.get("timeout", 5),
259266
port=kwargs.get("port"),
267+
no_check_certs=kwargs.get("no_check_certs"),
260268
)
261269

262270
# Update flattened results

urlchecker/core/exclude.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,19 @@
11
"""
22
3-
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
3+
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
44
55
This source code is licensed under the terms of the MIT license.
66
For a copy, see <https://opensource.org/licenses/MIT>.
77
88
"""
99

10-
from typing import List
10+
from typing import Optional, List
1111

1212

1313
def excluded(
14-
url: str, exclude_urls: List[str] = None, exclude_patterns: List[str] = None
14+
url: str,
15+
exclude_urls: Optional[List[str]] = None,
16+
exclude_patterns: Optional[List[str]] = None,
1517
) -> bool:
1618
"""
1719
Check if link is in the excluded URLs or patterns to ignore.

0 commit comments

Comments
 (0)