Web Scraping with Python — requests, BeautifulSoup, and Playwright Guide

Learn web scraping with Python: extract data with requests + BeautifulSoup, handle JavaScript with Playwright, and avoid common pitfalls.

Web Scraping Fundamentals

Web scraping is the process of programmatically extracting data from websites. Python has a rich ecosystem for this: requests for HTTP, BeautifulSoup for HTML parsing, and Playwright or Selenium for JavaScript-rendered pages.

Simple Scraping with requests + BeautifulSoup

pip install requests beautifulsoup4 lxml
import requests
from bs4 import BeautifulSoup

def scrape_page(url: str) -> list[dict]:
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")

    results = []
    for article in soup.select("article.post"):
        results.append({
            "title": article.select_one("h2").get_text(strip=True),
            "url": article.select_one("a")["href"],
            "date": article.select_one("time")["datetime"],
        })
    return results

CSS Selectors Reference

soup.select("div.class-name")       # by class
soup.select("#unique-id")           # by ID
soup.select("ul > li")              # direct children
soup.select("a[href^='https']")     # attribute starts with
soup.select_one("h1")               # first match only
el.get_text(strip=True)             # text content
el["href"]                          # attribute value
el.find_all("a", limit=10)          # find with limit

Handling JavaScript-Rendered Pages with Playwright

Many modern sites render content via JavaScript. requests only fetches the raw HTML — it won't execute JS. Use Playwright for these cases.

pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright

def scrape_spa(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Block images/fonts to speed up scraping
        page.route("**/*.{png,jpg,gif,woff2}", lambda r: r.abort())

        page.goto(url, wait_until="networkidle")

        # Wait for specific element
        page.wait_for_selector(".product-list", timeout=10000)

        # Get all product titles
        titles = page.eval_on_selector_all(
            ".product-title",
            "els => els.map(el => el.textContent.trim())"
        )

        browser.close()
        return titles

Async Scraping for Speed

import asyncio
import httpx
from bs4 import BeautifulSoup

async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    r = await client.get(url)
    soup = BeautifulSoup(r.text, "lxml")
    return {"url": url, "title": soup.title.string}

async def scrape_many(urls: list[str]) -> list[dict]:
    async with httpx.AsyncClient(
        headers={"User-Agent": "Mozilla/5.0"},
        timeout=15,
        limits=httpx.Limits(max_connections=10)
    ) as client:
        tasks = [fetch(client, url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

results = asyncio.run(scrape_many(url_list))

Handling Pagination

def scrape_all_pages(base_url: str) -> list[dict]:
    all_items = []
    page = 1

    while True:
        url = f"{base_url}?page={page}"
        response = requests.get(url, headers=HEADERS)
        soup = BeautifulSoup(response.text, "lxml")

        items = soup.select(".item")
        if not items:
            break  # No more pages

        for item in items:
            all_items.append({"title": item.get_text(strip=True)})

        # Check for "next" button
        next_btn = soup.select_one("a.next-page")
        if not next_btn:
            break

        page += 1

    return all_items

Respecting robots.txt and Rate Limits

import time
from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = "*") -> bool:
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

# Always add delays between requests
for url in urls:
    if can_scrape(url):
        data = scrape_page(url)
        time.sleep(1.5)  # Be polite: 1-2 seconds between requests

Frequently Asked Questions

Is web scraping legal?

It depends on the site's Terms of Service and local laws. Always check robots.txt, respect rate limits, don't scrape personal data, and prefer official APIs when available. Scraping publicly available data for personal use is generally fine; commercial use at scale requires more care.

How do I handle cookies and sessions?

Use requests.Session() to persist cookies across requests. For login-gated pages, log in with the session first, then scrape authenticated pages using the same session object.

How do I avoid getting blocked?

Rotate User-Agent strings, add realistic delays, use residential proxies for large-scale scraping, handle CAPTCHAs with services like 2captcha, and avoid scraping too fast. Playwright's stealth mode (playwright-stealth package) helps bypass bot detection.

→ Explore Free Developer Tools at DevKits
aiforeverthing.com — 100+ tools, no signup required