Web Scraping Course - SimpliLearn

Introduction to code and Beautiful Soup in intro.ipynb

<!DOCTYPE html>
<html>
<body>
<div class="organizationlist">
  <ul id="HR">
    <li class="HRmanager">
      <div class="name">Jack</div>
      <div class="ID">101</div>
    </li>
    <li class="ITmanager">
      <div class="name">Daren</div>
      <div class="ID">65</div>
    </li>
  </ul>
</div>
</body>
</html>

<div> ---> this tag's class has CSS applied

|-BeautifulSoup            # Parent
|--html                    # Direct Child |
|---div:organizationlist                 |  Descendants   
|----ul:IT                               V
|----ul:HR
|-----li: HRmanager
|------div: Class:name
|------div: Class:ID
|-----li: HRmanager
|------div: Class:name
|------div: Class:ID
|----ul:Finance

Searching Tree Filters

With the help of search filters we can extract specific information from the parsed document. These filters act as search criteria to extract data based on the elements in the HTML structure.

Types of Filters:

String: Simplest filter. Searches tags matching the given string.
List: Filter by matching any string in the list.
Regular Expressions: Use regex for pattern matching.
Function: Custom filter logic using user-defined functions.
Keyword Arguments: Use tag attributes as arguments (e.g., id, class_).

Searching Methods:

find_all()

Searches and retrieves all matching tag descendants.

soup.find_all(name, attrs, recursive, string, limit, **kwargs)

find()

Returns the first matching result. Use when only one element is needed.

select() (CSS selectors)

soup.select("div.class")
soup.select("ul#HR > li")

select_one()

Same as select() but returns only the first match.

get_text()

Used to extract inner text from a tag.

tag.get_text(strip=True)

attrs

To extract attribute value.

tag['href']

Tree Navigation:

.contents: A tag's children as a list.
.children: Iterator version of .contents.
.descendants: All children, recursively.
.parent: Immediate parent.
.parents: Generator of all parent elements.
.next_sibling / .previous_sibling: Navigate to adjacent nodes.
.next_element / .previous_element: Next element in the tree, even if not a sibling.

Modifying the Tree:

.append(): Add child to tag
.insert(): Insert element at given position
.clear(): Remove all contents
.extract(): Remove tag from tree and return it
.decompose(): Same as extract but deletes it
.replace_with(): Replace tag with another

Handling Encodings

soup = BeautifulSoup(html_doc, 'html.parser')
soup = BeautifulSoup(html_doc, 'lxml') → faster & more powerful

Dealing with JSON & APIs:

Some modern sites return JSON instead of HTML.

Use requests.get(url).json()
Navigate like a normal dictionary

Handling Pagination

Loop through next page links or numbered pages programmatically.

while next_url:
    response = requests.get(next_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    next_url = soup.select_one("a.next")['href']

Selenium WebDriver (Dynamic Scraping)

For JS-rendered websites, forms, and automation — covered in selenium_scraper.ipynb

Installation:

pip install selenium

Basic Setup:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Open URL
driver.get("https://example.com")

Interaction:

element = driver.find_element(By.CLASS_NAME, "some-class")
element.click()
element.send_keys("text")

Waits:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "some_id"))
)

Extracting HTML with Selenium:

from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")

Handling Cookies & Headers

headers = {
  'User-Agent': 'Mozilla/5.0...'
}
response = requests.get(url, headers=headers)

Anti-Scraping & Bypass Techniques:

Rotate User-Agent
Add delay/random wait between requests
Use Proxies or Tor
Headless browsers

Projects & Experiments:

✅ intro.ipynb: Basic Parsing & Tree Navigation ✅ lab1.ipynb: Tree Searching with Filters ✅ selenium_scraper.ipynb: JS Website Scraping & Automation 🔜 Project - Case Summarizer & Classifier using data from indiankanoon.org

Next: Create a scraper for Indian Kanoon using requests, BeautifulSoup, and Selenium (for dynamic content and login pages).

Let me know when you're ready to move to the classifier part or need help bypassing captchas or IP restrictions. 💻

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
FinalWebScrapper_Segmented.py		FinalWebScrapper_Segmented.py
FinalWebScrapper_conclusion.py		FinalWebScrapper_conclusion.py
FinalWebscrapping.py		FinalWebscrapping.py
Intro.ipynb		Intro.ipynb
ReadMe.md		ReadMe.md
Selenium.py		Selenium.py
image-2.png		image-2.png
image.png		image.png
lab1.html		lab1.html
lab1.ipynb		lab1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraping Course - SimpliLearn

Searching Tree Filters

Types of Filters:

Searching Methods:

find_all()

find()

select() (CSS selectors)

select_one()

get_text()

attrs

Tree Navigation:

Modifying the Tree:

Handling Encodings

Dealing with JSON & APIs:

Handling Pagination

Selenium WebDriver (Dynamic Scraping)

Installation:

Basic Setup:

Interaction:

Waits:

Extracting HTML with Selenium:

Handling Cookies & Headers

Anti-Scraping & Bypass Techniques:

Projects & Experiments:

About

Uh oh!

Releases

Packages

Languages

FR34KY-CODER/WebScraping-Practice

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Course - SimpliLearn

Searching Tree Filters

Types of Filters:

Searching Methods:

find_all()

find()

select() (CSS selectors)

select_one()

get_text()

attrs

Tree Navigation:

Modifying the Tree:

Handling Encodings

Dealing with JSON & APIs:

Handling Pagination

Selenium WebDriver (Dynamic Scraping)

Installation:

Basic Setup:

Interaction:

Waits:

Extracting HTML with Selenium:

Handling Cookies & Headers

Anti-Scraping & Bypass Techniques:

Projects & Experiments:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages