Web Scraping Fundamentals
Web scraping is the process of programmatically extracting data from websites. Python has a rich ecosystem for this: requests for HTTP, BeautifulSoup for HTML parsing, and Playwright or Selenium for JavaScript-rendered pages.
Simple Scraping with requests + BeautifulSoup
pip install requests beautifulsoup4 lxml
import requests
from bs4 import BeautifulSoup
def scrape_page(url: str) -> list[dict]:
headers = {
"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
results = []
for article in soup.select("article.post"):
results.append({
"title": article.select_one("h2").get_text(strip=True),
"url": article.select_one("a")["href"],
"date": article.select_one("time")["datetime"],
})
return results
CSS Selectors Reference
soup.select("div.class-name") # by class
soup.select("#unique-id") # by ID
soup.select("ul > li") # direct children
soup.select("a[href^='https']") # attribute starts with
soup.select_one("h1") # first match only
el.get_text(strip=True) # text content
el["href"] # attribute value
el.find_all("a", limit=10) # find with limit
Handling JavaScript-Rendered Pages with Playwright
Many modern sites render content via JavaScript. requests only fetches the raw HTML — it won't execute JS. Use Playwright for these cases.
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
def scrape_spa(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Block images/fonts to speed up scraping
page.route("**/*.{png,jpg,gif,woff2}", lambda r: r.abort())
page.goto(url, wait_until="networkidle")
# Wait for specific element
page.wait_for_selector(".product-list", timeout=10000)
# Get all product titles
titles = page.eval_on_selector_all(
".product-title",
"els => els.map(el => el.textContent.trim())"
)
browser.close()
return titles
Async Scraping for Speed
import asyncio
import httpx
from bs4 import BeautifulSoup
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
r = await client.get(url)
soup = BeautifulSoup(r.text, "lxml")
return {"url": url, "title": soup.title.string}
async def scrape_many(urls: list[str]) -> list[dict]:
async with httpx.AsyncClient(
headers={"User-Agent": "Mozilla/5.0"},
timeout=15,
limits=httpx.Limits(max_connections=10)
) as client:
tasks = [fetch(client, url) for url in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
results = asyncio.run(scrape_many(url_list))
Handling Pagination
def scrape_all_pages(base_url: str) -> list[dict]:
all_items = []
page = 1
while True:
url = f"{base_url}?page={page}"
response = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".item")
if not items:
break # No more pages
for item in items:
all_items.append({"title": item.get_text(strip=True)})
# Check for "next" button
next_btn = soup.select_one("a.next-page")
if not next_btn:
break
page += 1
return all_items
Respecting robots.txt and Rate Limits
import time
from urllib.robotparser import RobotFileParser
def can_scrape(url: str, user_agent: str = "*") -> bool:
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
# Always add delays between requests
for url in urls:
if can_scrape(url):
data = scrape_page(url)
time.sleep(1.5) # Be polite: 1-2 seconds between requests
Frequently Asked Questions
Is web scraping legal?
It depends on the site's Terms of Service and local laws. Always check robots.txt, respect rate limits, don't scrape personal data, and prefer official APIs when available. Scraping publicly available data for personal use is generally fine; commercial use at scale requires more care.
How do I handle cookies and sessions?
Use requests.Session() to persist cookies across requests. For login-gated pages, log in with the session first, then scrape authenticated pages using the same session object.
How do I avoid getting blocked?
Rotate User-Agent strings, add realistic delays, use residential proxies for large-scale scraping, handle CAPTCHAs with services like 2captcha, and avoid scraping too fast. Playwright's stealth mode (playwright-stealth package) helps bypass bot detection.
aiforeverthing.com — 100+ tools, no signup required