This project demonstrates the design and implementation of a basic Web Search Engine, showcasing key components such as web crawling, indexing, and query processing.
The search engine operates by crawling websites, parsing their content, and indexing relevant information for efficient retrieval. It respects the robots.txt
protocol and extracts metadata such as:
- Page titles
- HTML content
- JavaScript and CSS
- Headings
- Meta tags
- Web Crawler: Navigates through websites, obeying
robots.txt
directives. - Indexer: Processes and stores page content for fast lookup.
- Query Processor: Handles user queries and retrieves relevant results.
- Metadata Extraction: Captures structured data from HTML for improved relevance.
The engine starts by crawling web pages:
- Downloads HTML content from URLs.
- Follows links to discover new pages.
- Respects
robots.txt
rules to avoid restricted areas.
It extracts:
- Page titles
- Headings (
<h1>
,<h2>
, etc.) - Meta tags
- JavaScript and CSS references
- Raw HTML content
After crawling, the engine processes and stores content in an index:
- Tokenization: Breaks text into searchable words.
- Normalization: Lowercases text, removes punctuation, etc.
- Metadata Storage: Links each word to the page(s) it appears in, with frequency and position.
When a user enters a search query:
- Parses the query.
- Looks up relevant terms in the index.
- Ranks pages based on relevance (e.g., term frequency, metadata match).
- Returns results with titles, snippets, and links.
- C++ for core engine logic
- HTML/CSS/JavaScript for front-end interface
- Custom Semaphores for concurrency control
To run the project locally:
- Clone the repository:
git clone https://github.com/mihaimoga/WebSearchEngine.git
- Build the project using your preferred C++ compiler or IDE.
- Launch the crawler and indexer modules.
- Use the front-end interface to perform search queries.
Visit text-mining.ro
for a live demonstration.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.