Building a Modular Web Scraping Framework
In today’s data-driven environment, it is important to have reliable tools for extracting information from various online sources. This project provides a modular web scraping framework that combines several scraping techniques to deliver data from static pages, dynamic content, real-time feeds, and even full-text articles.
Overview
The framework consists of multiple scraping modules:
- Google Scraper: Retrieves search results and snippets directly from Google.
- Dynamic Scraper: Uses Selenium with undetected‑chromedriver to handle JavaScript-rendered pages.
- Realtime Scraper: Collects data from live feeds (such as RSS or Atom) using background scheduling.
- Topic Scraper: Extracts URLs and content from search results, and then uses Newspaper3k to download and parse full articles.
- Combined Scraper: Integrates the above functionalities by first performing a Google search for relevant article URLs and then processing the found pages with Newspaper3k.
This design allows users to choose the appropriate module based on their requirements or to run a complete pipeline for comprehensive data collection.
Key Technical Details
Selenium Manager and Browser Automation
Modern web pages are often dynamic and require a browser that can execute JavaScript. For this purpose, our project employs Selenium in combination with undetected‑chromedriver. The custom SeleniumManager module plays a crucial role by:
- Detecting the Installed Chrome Version:
It uses system-specific methods (Windows Registry on Windows, command-line checks on macOS/Linux) to determine the version of Chrome installed, ensuring that the appropriate driver is used. - Handling Driver Compatibility Issues:
If a version mismatch occurs, the code catches the error, extracts the required version from the error message, and then attempts to reinitialize the driver with the correct version.
Below is a snippet demonstrating the driver initialization logic:
def _init_driver(self):
chrome_version = self._get_chrome_version() or 108
try:
driver = uc.Chrome(options=self.options, version_main=chrome_version)
except SessionNotCreatedException as e:
msg = str(e)
version_match = re.search(r"Current browser version is (\d+)", msg)
if version_match:
version = int(version_match.group(1))
driver = uc.Chrome(options=self.options, version_main=version)
else:
raise e
return driver
Content Extraction and Parsing
For parsing static HTML content, the framework uses BeautifulSoup. Modules like the Google Scraper quickly extract text snippets and links. When deeper content is required, the Combined Scraper leverages Newspaper3k, which handles full-text extraction and metadata retrieval from news articles.
Natural Language Processing
The project also includes basic NLP capabilities using spaCy. The integration offers functions to:
- Lemmatize Text: Convert words to their base forms.
- Remove Stopwords: Clean the text of common, uninformative words.
- Generate Summaries: Provide short summaries of long articles.
This processing step helps in turning raw text into cleaner, more analyzable data.
Handling Anti-Bot Measures
Many modern websites implement measures like CAPTCHAs to prevent automated access. To reduce detection:
- Realistic Browser Behavior:
The framework sets dynamic user-agent strings and adjusts browser flags (for example,--disable-blink-features=AutomationControlled
) to simulate human-like browsing behavior. - Retry Mechanism:
When a CAPTCHA is encountered, the system is designed to wait and retry, improving the chance of successful data extraction over multiple attempts.
Example of setting realistic browser options:
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36")
Practical Applications
This framework is intended for a variety of use cases, including:
- News Aggregation: Collecting and summarizing recent articles.
- Market Research: Monitoring trends and extracting relevant information.
- Academic Projects: Building datasets for natural language processing and machine learning.
- Content Curation: Pulling content from multiple sources to create a comprehensive resource hub.
Conclusion
The modular web scraping framework presented here is designed with practicality and flexibility in mind. By combining various scraping techniques, robust browser automation, and basic NLP, the project offers a comprehensive toolset for reliable data extraction from a wide range of online sources.
Whether you need to quickly fetch search results or deeply parse full-length articles, this framework provides the necessary building blocks. Feel free to explore the code, customize the modules, and integrate them into your own data workflows.