- Create Python virtual environment & move into the directory
 
python -m venv nlp_webscraper
cd nlp_webscraper
- 
Download clientextraction, companycrawler & requirements.txt into nlp_webscraper folder
 - 
Start virtual environment inside nlp_webscraper folder by running activate.bat
 
scripts\activate.bat
- Install dependencies
 
pip install -r requirements.txt
- Run example.py
 
python example.py
clients.csv will be outputted, containing client data
Python 3.9
absl-py==0.15.0
aiofiles==0.8.0
anyio==3.6.1
asttokens==2.0.5
astunparse==1.6.3
async-generator==1.10
attrs==21.4.0
backcall==0.2.0
cachetools==5.2.0
certifi==2022.6.15
cffi==1.15.1
chardet==3.0.4
charset-normalizer==2.0.12
ci-info==0.2.0
clang==5.0
click==8.1.3
colorama==0.4.5
configobj==5.0.6
configparser==5.2.0
cryptography==37.0.4
Cython==0.29.28
decorator==5.1.1
etelemetry==0.3.0
executing==0.8.3
filelock==3.7.1
flatbuffers==1.12
frontend==0.0.3
future==0.18.2
gast==0.4.0
gensim==4.2.0
google-auth==2.9.1
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googletrans==4.0.0rc1
grpcio==1.47.0
h11==0.9.0
h2==3.2.0
h5py==3.1.0
hpack==3.0.0
hstspreload==2022.7.10
httpcore==0.9.1
httplib2==0.20.4
httpx==0.13.3
hyperframe==5.2.0
idna==2.10
importlib-metadata==4.12.0
ipython==8.4.0
isodate==0.6.1
itsdangerous==2.1.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.1.0
jsonpickle==2.2.0
keras==2.8.0
Keras-Preprocessing==1.1.2
langdetect==1.0.9
libclang==14.0.1
looseversion==1.0.1
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib-inline==0.1.3
networkx==2.8
nibabel==4.0.1
nipype==1.8.3
nltk==3.6.7
numpy==1.23.1
oauthlib==3.2.0
opencv-python==4.5.3.56
opt-einsum==3.3.0
outcome==1.2.0
packaging==21.3
pandas==1.3.1
parso==0.8.3
pathlib==1.0.1
pickleshare==0.7.5
Pillow==9.2.0
prompt-toolkit==3.0.30
protobuf==3.19.4
prov==2.0.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybrowsers==0.5.1
pycparser==2.21
pydot==1.4.2
Pygments==2.12.0
PyMuPDF==1.20.1
pyOpenSSL==22.0.0
pyparsing==3.0.9
PyPDF2==2.7.0
PySocks==1.7.1
pytesseract==0.3.9
python-dateutil==2.8.2
python-dotenv==0.20.0
pytz==2022.1
pyvis==0.2.0
pywin32==304
pyxnat==1.4
rdflib==6.1.1
regex==2022.7.9
requests==2.27.1
requests-oauthlib==1.3.1
rfc3986==1.5.0
rsa==4.9
scipy==1.8.1
selenium==4.3.0
simplejson==3.17.6
six==1.15.0
smart-open==6.0.0
sniffio==1.2.0
sortedcontainers==2.4.0
stack-data==0.3.0
starlette==0.20.4
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.8.0
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
tqdm==4.64.0
traitlets==5.3.0
traits==6.3.2
trio==0.21.0
trio-websocket==0.9.2
typing-extensions==4.3.0
urllib3==1.26.10
uvicorn==0.18.2
wcwidth==0.2.5
webdriver-manager==3.8.0
Werkzeug==2.1.2
wrapt==1.12.1
wsproto==1.1.0
zipp==3.8.1
- Download Tesseract installer
 - Run installer to install Tesseract-OCR executable
 - Modify 
"pytesseract-path"in companycrawler/json/functions-config.json 
Crawl single company
from companycrawler.crawler import CompanyCrawler
    
CC = CompanyCrawler(save_webtree=False, save_network_graph=True)
CC.crawl_company(
    root='https://www.intermodalics.eu/', 
    company='intermodalics', 
    save_dir='saved_data',
    max_depth=2
)Crawl list of companies in excel
from companycrawler.crawler import CompanyCrawler
import pandas as pd
# define variables
save_dir = 'saved_data'
max_depth = 2
CC = CompanyCrawler(save_webtree=False, save_network_graph=True)
# load & clean excel data from 'companies-software.xlsx'
df = pd.read_excel('companies-software.xlsx')
df.dropna(axis=0, inplace=True, subset=['actual_url'])
df.reset_index(drop=True, inplace=True)
# enumerate rows in excel
for i, row in df.iterrows():
    CC.crawl_company(
        root=row['actual_url'], 
        company=row['Company Name'], 
        save_dir=save_dir,
        max_depth=max_depth
    )ReverseSearch.get_driver()
- Set Selenium webdriver options & returns webdriver object
 
ReverseSearch.start()
- Start Selenium webdriver
 
ReverseSearch.filter_search_value(str: search_value)
- Return False if 
search_valuecontains an invalid word like "dictionary", "horizontal" 
ReverseSearch.filter_header(str: header)
- Return False if title of 1st search result contains invalid word (usually company names like "LinkedIn", "FontAwesome")
 
ReverseSearch.rear_strip(String: s)
- Removes non-alphanumeric characters from the rear, like "..."
 
ReverseSearch.get_num_results(String: s)
- Returns number of reverse search results based on text of DOM element #result-stats
 
ReverseSearch.clean_str(String: s)
- Lowercase
 - Replaces escape character "%20" commonly found in URLs
 - Remove non-alphanumeric & underscore characters
 - Removes numerals & floats
 
ReverseSearch.search(String: url, String: company="")
- Returns {
'url': //image url
'url_tail': //cleaned image filename
'header': //title of first search result
'body': //body text of 1st search result
'search_value': //value in search box as interpreted by Google
'results': //number of search results
} 
ReverseSearch.random_wait(Float: lower=0.5, Float: upper=2)
- Wait for a random duration between 
lower&upper 
ReverseSearch.reset()
- Close webdriver
 
RS = ReverseSearch()
RS.start()
results = RS.search('https://images.squarespace-cdn.com/content/v1/5ab393009f87708addd204e2/1523980229490-KB8R24FUGXC8X6DDZ7EC/colruyt_groupB.png?format=300w', 'Intermodalics')
print(results)
RS.reset()WebTree(Boolean: save=False)
- Save file as <gen_link()>.json if save is True
 
WebTree.start()
- Start Selenium webdriver
 
WebTree.store(String: url)
- Store URL in list to crawl all at once
 
WebTree.run_all()
- Generator that yields url, get_cluster(url) for each stored URL
 
WebTree.is_src(String: src)
- Returns if src is image
 
WebTree.get_src(Object: elem)
- Return 
"<image_url> <image_alt>" 
WebTree.get_clusters(String: url)
- Detect & return image clusters (list of list of image URLs) from a web page
 
WebTree.build_tree(String: url)
- Map out web tree of a web page
 - If self.save, save web tree as JSON file
 
WebTree.reset()
- Close webdriver
 
Map out the web tree of https://www.intermodalics.eu/
WT = WebTree(save=True)
WT.start()
clusters = WT.get_clusters('https://www.intermodalics.eu/')
print(clusters)
WT.reset()Map out the web trees of https://www.intermodalics.eu/ and https://www.intermodalics.eu/visual-positioning-slam-navigation
WT = WebTree(save=True)
WT.start()
WT.store('https://www.intermodalics.eu/')
WT.store('https://www.intermodalics.eu/visual-positioning-slam-navigation')
generator = WT.run_all()
for page, clusters in generator:
    print('Image clusters of', page)
    for image_url in clusters:
        print(image_url)
        
WT.reset()LogoDetector(String: saved_model)
saved_model: path to saved CNN model relative to where this object is being called from
LogoDetector.prepare_img(String: src)
- 
- Download image
 
 - 
- Convert image to RGB
 
 - 
- Resize according to self.dims (100,100,3)
 
 - 
- Return image data, download path (so it can be deleted after detection model runs)
 
 
LogoDetector.predict(List: srcs, Boolean: verbose)
- Runs CNN logo detection model on each image in 
srcs - Returns 1D list of probabilities of each image being a logo
 - Print scores if verbose
 
LD = LogoDetector()
predictions = LD.predict([
    'https://images.squarespace-cdn.com/content/v1/5ab393009f87708addd204e2/1523980229490-KB8R24FUGXC8X6DDZ7EC/colruyt_groupB.png?format=300w',
    'https://images.squarespace-cdn.com/content/v1/5ab393009f87708addd204e2/1522415419883-R8K5KQVMGX48TPWP58X0/b49602d4-9b0a-24f3-8260-933b31b8d160_COM_6calibrations_2018-01-24-13-55-00+-+dev+room.png?format=500w'\
])
print(predictions) # [0.8967107, 0.07239765]GoogleTranslate.get_chunk()
- Return chunk of string of length self.max_char
 
GoogleTranslate.load_lines(String: text)
- Store 
textas sentences in self.lines 
GoogleTranslate.translate(String: text)
- Detect language. If not EN, translate chunk by chunk using gooogletrans API
 
GT = GoogleTranslate()
f = open(text_file, 'r', encoding='utf-8')
text = f.read()
f.close()
translation = GT.translate(text)
print(translation)plot_network(String: filename, Object: edges)
filename: Save as network graph asfilename.html & edge list asfilename.csvedges: <target>:<source> key pairs where <target> = sublink found on <source> page
# target:source
edges = {
    'https://www.intermodalics.eu/what-we-do': 'https://www.intermodalics.eu/',
    'https://www.intermodalics.eu/join-us': 'https://www.intermodalics.eu/',
    'https://www.intermodalics.eu/team': 'https://www.intermodalics.eu/',
    'https://www.intermodalics.eu/senior-software-developer-robotics': 'https://www.intermodalics.eu/join-us'
}
plot_network('my_network_graph', edges)PDFReader.add(String: url)
- Adds PDF URL to 
self.pdfs 
PDFReader.cleanText(String: url)
- Adds PDF URL to 
self.pdfs - Cleans text
 - 
- Lowercase
 
 - 
- Remove non-alphanumeric & underscore chars
 
 - 
- Remove consecutive newlines & lines with only 1 character
 
 
PDFReader.extract_text(String: path)
- Converts PDF file at 
pathto text - For every page, read all text + append image_to_text at end
 - Images should be pre-downloaded in 
self.pdf_dir - Once completed, delete PDF
 
PDFReader.save_imgs(String: path)
- Downloads all images from PDF file at 
pathintoself.pdf_dir 
PDFReader.read_all_pdfs()
- For each url in 
self.pdfs, downloads PDF at url and saves PDF images inself.pdf_dir - Generator. Yields: {
'url': //PDF url
'text': //PDF text (including image_to_text)
} 
PDFReader.reset()
- Empties 
self.pdfs 
PR = PDFReader()
PR.add('https://www.memoori.com/wp-content/uploads/2017/10/The-Future-Workplace-2017-Synopsis.pdf')
generator = PR.read_all_pdfs()
for obj in generator:
    print(obj['url'])
    print(obj['text'])
PR.reset()CompanyCrawler(String:  dictionary="companycrawler/json/dictionary.json",  Boolean: save_webtree=False,  Boolean: save_network_graph=True)
dictionary: file path of dictionary.json, a keyword store, relative to where this object is being called fromsave_webtree: save webtree data as JSON if Truesave_network_graph: save network graph as <company_name>.HTML & <company_name>.csv if True
CompanyCrawler.get_driver()
- Start Selenium webdriver
 
CompanyCrawler.check_link(String: url, String: base)
url: an<a>element's hrefbase: URL of web page from which the<a>is taken from
CompanyCrawler.check_img(Integer: depth, String: url)
- Check if image is from root URL or in a valid web segment like /customers
 - Return True only if valid because this is for client logo detection
 
CompanyCrawler.get_hrefs()
- Return HREF attribute of all in driver's current webpage
 
CompanyCrawler.get_logos(String: company)
- Get image clusters from WebTree's stored URLs for a particular 
company - Run logo detection model & identify image clusters with average logo probability > 0.5
 - Reverse search on filtered images and append results to 
self.clients 
CompanyCrawler.process_pdfs()
- Run 
read_all_pdfs()on stored PDF urls in PDFReader() to extract all PDF data 
CompanyCrawler.crawl_site(String: url, Integer: depth, Boolean: expand)
url: URL to crawldepth: Current web depth from rootexpand: Whether max crawling depth has been reached. If not reached, continue adding sublinksself.sitesto crawl- Crawls 
urlto:- add itself to 
self.pdf_readerif self is PDF - extract HTML content (translate if applicable)
 - store images in 
self.web_treeif they passself.check_link() - add sublinks to 
self.sites&self.edgesif expand=True 
 - add itself to 
 
CompanyCrawler.crawl_company(String: root, String: company, String: save_dir, Integer: max_depth)
root: Base URLcompany: Name of company being crawledsave_dir: Directory path to save all crawled data atmax_depth: Max crawling depth fromroot- Process:
- Clear cache
 - Collate & crawl all sublinks up to max depth. In the process:
- Add potential client logos to 
self.web_tree - Add PDFs to 
self.pdf_reader - Save HTML data of all crawled pages as .txt files
 
 - Add potential client logos to 
 - Run 
self.get_logos()to get client data:WebTree.run_all()to build web trees, solve image clusters- Pass image clusters to 
self.logo_detectorto filter clusters with average logo probability > 0.5 - Conduct reverse search using 
self.reverse_search.search()to acquire client data from image URLs 
 
 
CC = CompanyCrawler(dictionary='/json/dictionary.json') # Adjust filepath depending on relative location of parent process
CC.crawl_company(root='http://aisle411.com/', company='Aisle411', save_dir='../', max_depth=2)img_to_text(string: path)
- Converts image to text of image at local 
pathusing PyTesseract 
gen_path(String: ext="")
- Takes in extension 
ext(E.g., ".jpg") and outputs a random vacant filename of typeext 
download_url(String: url, String: save_path)
- Download file at 
urltosave_path - PDFs, images...
 
find_ext(String: path)
- Returns extension if file at path is an image
 
is_pdf(String: url)
- Returns whether file at 
urlis PDF 
url_rstrip(String: s)
- Deletes trailing '/' and '#' from 
s 
print_header(String: header)
- Displays 
headerwith styling 
from clientextraction.json_extraction import clients_from_json, print_clients
import pandas as pd # for saving client data to csv
import os
# set variables
data_path = 'saved_data' # directory containing all scraped client data
companies = os.listdir(data_path) # list of companies to extract clients from. Requires existing <data_path>/<company>/clients.json
client_output_path = 'clients.csv' # output file
results = {
    'url': [],
    'page': [],
    'alt': [],
    'url_tail': [],
    'common': []
}
# for each company data folder found in data_path, 
# extract client data from each company's clients.json
for company in companies:
    clients = clients_from_json(os.path.join(data_path, company, 'clients.json'), company)
    results['url'].extend(clients['url'])
    results['page'].extend(clients['page'])
    results['alt'].extend(clients['alt'])
    results['url_tail'].extend(clients['url_tail'])
    results['common'].extend(clients['common'])
    #print_clients(clients)
# save client data to csv at client_output_path
client_df = pd.DataFrame(results)
client_df.to_csv(client_output_path, index=False, encoding='utf-8-sig')clean(String: s)
- 
- Lowercase
 
 - 
- Remove chars that are (non-alphanumeric && not spaces && not periods) OR (underscores) // r'[^.\w\s]|_'
 
 - 
- Remove excess spaces
 
 
form_sentences(String: s, Integer: min_tokens=3, Integer: min_token_len=1)
- Split 
sby \n - Remove sentences where number of words < 
min_tokens - Remove words where length of word < 
min_token_len 
find_ext(String: path)
- Return extension if file at 
pathis image, else "" 
get_url_tail(String: s)
- Return cleaned, tail segment of a URL
 - Mainly for getting image filename from a URL
 
exclude_words(String: words, List: exclude_list)
- Remove words from a list that are found inside exclude_list
 
print_header(String: header)
- Displays header with styling
 
print_clients(Object: results)
- Displays list of clients with styling
 
clients_from_json(String: file, String: company)
- Extracts clients based on client data at 
filebased on:- Reverse search results
 - Reverse search value
 - Alt text
 - Image filename (or URL tail)
 - Frequency list of keywords (including bigrams & trigrams)
 
 
get_orgs(String: file)
filepath to .txt file to carry out NLP extraction on
- Selenium framework
 - [get_sublinks.py] extracts all sublinks up to a specified depth from the root node
 - [Network Graphs/*.html] Plots the sublinks in a network graph (download Network Graphs/*.html and run it on localhost)
 - [Edgelist/*.csv] Generates csv with all graph edges for tracking of sublinks
 - [company_website_searcher] Finds company website based on company name. Requires manual checking though
 - [Companies/companies-sensor.xlsx] - actual company websites for software
 - [Companies/companies-software.xlsx] - actual company websites for sensors (missing for Paracosm)
 
- Added functions to cut down on amount of similar sites visited with the same content by comparing md5 hash value of self-generated html-id 
<length of DOM><first 5 char><middle 9 char><last 5 char>for quicker hashing - Translates websites which are in other languages to english after scrapping the data
 
- [pdf_reader.py] Reads PDF text + extract text from PDF images using Tesseract OCR
 - [reverse_search.py] Exploring automated Google reverse image search on brand images to identify customers which are represented in image form
 
- Revamped reverse search algorithm
- Filters results based on number of reverse search hits
 - Filters out false positives
 
 - Improved entity detection system
- 3 Tiers of detection
- Tier 1: Alt text (img meta data) > Header (reverse search) > Body (reverse search)
 - Tier 2: Identify common words from image file name & reverse search value
 - Tier 3: Purely file name
 
 
 - 3 Tiers of detection
 
- Web tree
- Map out web elements in a tree to identify image clusters with a breadth-first approach
 - Motivation: Images, especially brand logos, come in clusters. By pruning irrelevant image clusters, we can sharply improve our client extraction accuracy
 - Used in conjunction with our CNN logo detection models
 
 - Logo detection
- Filters out irrelevant image clusters so we can reduce false positives during client extraction
 - Convolutionary Neural Network model
 - ~80% accuracy
 - Trained on images scraped from the given 120 company websites
 - Evaluates the probability of each image cluster being the client logo cluster
 
 
- Get surrounding text of keywords
 - Experimenting with Python NLP library gensim to filter text as well
 - Tested client extraction pipeline on all sensor companies
 
- Replaced deprecated Selenium code with updated versions
 


