How to do web scraping with Python

Web scraping is a method to extract data from websites, whether for collecting prices, market analysis, and much more. Python is the preferred language thanks to libraries like requests and BeautifulSoup. In this article, we will cover the basics of setting up a scraper, from installation to proxy management.

Installing the necessary libraries

Python is a rich language with a variety of libraries, so there's no need to reinvent the wheel!

Let's install them:

pip install requests requests[socks] beautifulsoup4 lxml

Step 1: Analyze the site structure

Start with a manual step. Use your browser (right-click > Inspect) to identify the HTML tags containing the information you want to extract. For example, on an e-commerce site, prices might be found in tags like <span class="price"> or <div class="product-price">.

Also, check the site's robots.txt file (e.g., https://example.com/robots.txt) to know the areas allowed for scraping (failing to comply with this can be considered abusive, especially if scraping sensitive or large-scale data).

Step 2: Send an HTTP request and parse the page

Here’s a simple example to retrieve the HTML content of a page:

import requests
from bs4 import BeautifulSoup

# Here’s the page I’m using for this example
url = "https://pythonium.net/blog/how-to-do-web-scraping-with-python"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

print("Page title:", soup.title.text)

Step 3: Extract data

Now, let’s imagine you want to extract the product prices displayed on the page. These could be contained in tags like <span class="price">:

prices = soup.find_all("span", class_="price")
for price in prices:
    print(price.text.strip())

Step 4: Handle pagination

If the site uses URLs like ?page=2, you can automate the navigation (note: my page doesn’t have pagination, this is just for the example…):

for page in range(1, 6):  # Scrape the first 5 pages
    url = f"https://pythonium.net/blog/how-to-do-web-scraping-with-python?page={page}"
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "lxml")

    articles = soup.find_all("h2", class_="post-title")
    for article in articles:
        print(article.text.strip())

Step 5: Use a proxy

To avoid blocks, you can route your requests through an HTTP or SOCKS proxy:

proxies = {
    "http": "http://user:pass@proxy-ip:port",
    "https": "http://user:pass@proxy-ip:port"
}

response = requests.get(url, headers=headers, proxies=proxies, timeout=10)

To automatically rotate through multiple proxies:

proxy_list = [
    "http://user:pass@ip1:port",
    "http://user:pass@ip2:port",
    "http://user:pass@ip3:port"
]

import random

proxy = {"http": random.choice(proxy_list), "https": random.choice(proxy_list)}
response = requests.get(url, headers=headers, proxies=proxy)

Choose the right type of proxy

There are various types of proxies, and you need to understand which one is right for your needs.

Datacenter proxies are fast and cost-effective but easily blocked. They are ideal for simple or less protected sites.

Residential proxies are more expensive but appear like real home connections. They are ideal for scraping sensitive or geo-restricted sites.

Mobile proxies (from 4G/5G networks) are almost undetectable. They are mostly used for social media or mobile testing, but they are the most expensive and often slower.

Step 6: Handle errors and anti-bot measures

Use delays between requests to limit the risk of being blocked. If you bombard a site with requests from a single IP, you will quickly get blacklisted as it will look like an attack.

import time
import random

time.sleep(random.uniform(2, 5))

And handle errors:

try:
    response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print("Request error:", e)

Going further

More and more websites are now built with dynamic pages generated by JavaScript, which makes scraping with libraries like requests or BeautifulSoup difficult since these tools cannot execute or render JavaScript. To scrape effectively from such sites, it is often necessary to use a browser capable of loading and interpreting JavaScript. In such cases, tools like Selenium are ideal as they automate browser interactions to retrieve dynamic content. However, this can significantly slow down the scraping process.

Ethics and legality

It is forbidden to scrape data, especially when it comes to personal data. This may violate strict regulations like the GDPR (in Europe) or similar laws in the United States (such as CCPA in California).

Web scraping must always be performed in compliance with the following rules:

Don’t overload servers: Avoid sending thousands of requests in bursts. Your scraper could be mistaken for a brute-force attack…
Don’t collect personal data without consent.
Always read and respect the terms of service of the target site.

Conclusion

Web scraping with Python is easy to set up. With libraries like requests and BeautifulSoup, combined with techniques such as proxy rotation, handling anti-bot measures, and automating pagination, you can build robust scrapers. But remember: a good scraper is discreet, legal, and ethical.

Installing the necessary libraries

Step 1: Analyze the site structure

Step 2: Send an HTTP request and parse the page

Step 3: Extract data

Step 4: Handle pagination

Step 5: Use a proxy

Choose the right type of proxy

Step 6: Handle errors and anti-bot measures

Going further

Ethics and legality

Conclusion

Product 1

Product 2

Product 3

cyrilou

Laisser un commentaire Annuler la réponse