
Web scraping is a method to extract data from websites, whether for collecting prices, market analysis, and much more. Python is the preferred language thanks to libraries like requests
and BeautifulSoup
. In this article, we will cover the basics of setting up a scraper, from installation to proxy management.
Installing the necessary libraries
Python is a rich language with a variety of libraries, so there's no need to reinvent the wheel!
Let's install them:
pip install requests requests[socks] beautifulsoup4 lxml
Step 1: Analyze the site structure
Start with a manual step. Use your browser (right-click > Inspect) to identify the HTML tags containing the information you want to extract. For example, on an e-commerce site, prices might be found in tags like <span class="price">
or <div class="product-price">
.
Also, check the site's robots.txt
file (e.g., https://example.com/robots.txt
) to know the areas allowed for scraping (failing to comply with this can be considered abusive, especially if scraping sensitive or large-scale data).
Step 2: Send an HTTP request and parse the page
Here’s a simple example to retrieve the HTML content of a page:
import requests
from bs4 import BeautifulSoup
# Here’s the page I’m using for this example
url = "https://pythonium.net/blog/how-to-do-web-scraping-with-python"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
print("Page title:", soup.title.text)
Step 3: Extract data
Now, let’s imagine you want to extract the product prices displayed on the page. These could be contained in tags like <span class="price">
:
prices = soup.find_all("span", class_="price")
for price in prices:
print(price.text.strip())
Step 4: Handle pagination
If the site uses URLs like ?page=2
, you can automate the navigation (note: my page doesn’t have pagination, this is just for the example…):
for page in range(1, 6): # Scrape the first 5 pages
url = f"https://pythonium.net/blog/how-to-do-web-scraping-with-python?page={page}"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
articles = soup.find_all("h2", class_="post-title")
for article in articles:
print(article.text.strip())
Step 5: Use a proxy
To avoid blocks, you can route your requests through an HTTP or SOCKS proxy:
proxies = {
"http": "http://user:pass@proxy-ip:port",
"https": "http://user:pass@proxy-ip:port"
}
response = requests.get(url, headers=headers, proxies=proxies, timeout=10)
To automatically rotate through multiple proxies:
proxy_list = [
"http://user:pass@ip1:port",
"http://user:pass@ip2:port",
"http://user:pass@ip3:port"
]
import random
proxy = {"http": random.choice(proxy_list), "https": random.choice(proxy_list)}
response = requests.get(url, headers=headers, proxies=proxy)
Choose the right type of proxy
There are various types of proxies, and you need to understand which one is right for your needs.
Datacenter proxies are fast and cost-effective but easily blocked. They are ideal for simple or less protected sites.
Residential proxies are more expensive but appear like real home connections. They are ideal for scraping sensitive or geo-restricted sites.
Mobile proxies (from 4G/5G networks) are almost undetectable. They are mostly used for social media or mobile testing, but they are the most expensive and often slower.
Step 6: Handle errors and anti-bot measures
Use delays between requests to limit the risk of being blocked. If you bombard a site with requests from a single IP, you will quickly get blacklisted as it will look like an attack.
import time
import random
time.sleep(random.uniform(2, 5))
And handle errors:
try:
response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print("Request error:", e)
Going further
More and more websites are now built with dynamic pages generated by JavaScript, which makes scraping with libraries like requests
or BeautifulSoup
difficult since these tools cannot execute or render JavaScript. To scrape effectively from such sites, it is often necessary to use a browser capable of loading and interpreting JavaScript. In such cases, tools like Selenium are ideal as they automate browser interactions to retrieve dynamic content. However, this can significantly slow down the scraping process.
Ethics and legality
It is forbidden to scrape data, especially when it comes to personal data. This may violate strict regulations like the GDPR (in Europe) or similar laws in the United States (such as CCPA in California).
Web scraping must always be performed in compliance with the following rules:
- Don’t overload servers: Avoid sending thousands of requests in bursts. Your scraper could be mistaken for a brute-force attack…
- Don’t collect personal data without consent.
- Always read and respect the terms of service of the target site.
Conclusion
Web scraping with Python is easy to set up. With libraries like requests
and BeautifulSoup
, combined with techniques such as proxy rotation, handling anti-bot measures, and automating pagination, you can build robust scrapers. But remember: a good scraper is discreet, legal, and ethical.
Laisser un commentaire