Chapter 20 Quiz: Web Scraping for Business Intelligence

DataField.Dev

Chapter 20 Quiz: Web Scraping for Business Intelligence

Instructions: Answer all questions. The answer key with full explanations is at the bottom. Attempt every question before checking answers.

Part A — Multiple Choice (1 point each)

Question 1

What does requests.get(url) return?

a. A string containing the page's HTML b. A BeautifulSoup object ready for parsing c. A Response object with status code, headers, and body d. A dictionary of all data found on the page

Question 2

Which BeautifulSoup method returns the first matching element, or None if nothing matches?

a. soup.find_all() b. soup.select() c. soup.find() d. soup.get()

Question 3

You fetch a web page and response.status_code is 429. What does this mean and what should your script do?

a. The page was not found — log the error and move on b. The server is telling you to slow down — wait before retrying c. The request succeeded and the page content is in response.text d. Your credentials were rejected — check your login

Question 4

What does element.get_text(strip=True) do?

a. Gets the value of the text attribute on the element b. Extracts all text content from the element, with leading/trailing whitespace removed c. Returns the element as a plain text string including all HTML tags d. Searches for text matching the pattern "strip=True"

Question 5

Which of the following CSS selectors correctly finds all <span> elements with the class price that are inside a <div> with the class product-card?

a. div.product-card > span.price b. div.product-card span.price c. span.price div.product-card d. .product-card + .price

Question 6

Why should you use element.get("href") instead of element["href"] in production scraper code?

a. .get() is faster than dictionary-style access b. element["href"] only works on <a> tags; .get() works on all tags c. .get() returns None if the attribute is missing, while element["href"] raises a KeyError d. .get() automatically converts relative URLs to absolute URLs

Question 7

Which of the following is the correct way to convert a relative URL like /products/widget to an absolute URL?

a. base_url + relative_url b. requests.join(base_url, relative_url) c. urljoin(base_url, relative_url) from urllib.parse d. os.path.join(base_url, relative_url)

Question 8

What is robots.txt and what is your obligation as a scraper author?

a. A Python configuration file that requests automatically reads before each request b. A server-side file specifying which paths bots should not access — you should check and respect it before scraping c. A legal contract that makes you liable if you access any page it does not explicitly allow d. A file that lists valid user agent strings you are permitted to use

Question 9

You are scraping a competitor's public catalog and receive no products from the page, but you can see the products when you visit in a browser. The most likely explanation is:

a. Your BeautifulSoup CSS selectors are wrong b. The page requires a POST request instead of a GET c. The products are loaded by JavaScript after the initial page load — requests only gets the initial HTML shell d. The server requires authentication to view product listings

Question 10

pd.read_html("https://example.com/table-page") returns what?

a. A single DataFrame containing the first table on the page b. A list of DataFrames, one for each <table> element found on the page c. A string of HTML from the page d. A dictionary mapping table IDs to DataFrames

Part B — True or False (1 point each)

Question 11

soup.find_all("div") returns a list, and iterating over it with a for loop is always safe even if no <div> elements exist on the page.

Question 12

Using "html.parser" as the BeautifulSoup parser requires no additional package installation.

Question 13

A Crawl-delay: 5 directive in robots.txt means your scraper may send no more than five requests per second.

Question 14

When using requests.Session(), cookies set by the server in one response are automatically included in subsequent requests made with the same session object.

Question 15

Checking robots.txt and finding that your target path is not disallowed means you have verified the site's full Terms of Service and confirmed scraping is legally permitted.

Part C — Short Answer (2–3 points each)

Question 16

Explain why you should use a context manager (with smtplib.SMTP_SSL(...) as server:) or — in the case of requests — a Session object rather than making individual calls for each request. What are the technical benefits?

Question 17

You write a scraper that works perfectly today. Three weeks later it crashes with AttributeError: 'NoneType' object has no attribute 'text'. What likely happened and how should you have written the code to prevent this crash?

Question 18

Write Python code that: (a) fetches the HTML from https://books.toscrape.com, (b) creates a BeautifulSoup object, (c) finds all elements with the class price_color, and (d) prints each price. Use the "lxml" parser.

Question 19

A friend says: "I want to scrape a site's API response directly instead of parsing the HTML, because I can see JSON data in the browser's Network tab." Is this a good idea? Explain your reasoning.

Question 20

Describe three specific situations where you should NOT scrape a website, and explain the alternative data access method appropriate for each.

Answer Key

Part A — Multiple Choice

Q1: c — A Response object requests.get() returns a Response object, not raw HTML or a parsed structure. You access the HTML via response.text, the status code via response.status_code, and headers via response.headers. BeautifulSoup is a separate step: soup = BeautifulSoup(response.text, "lxml").

Q2: c — soup.find() soup.find() returns the first matching element or None. soup.find_all() always returns a list (possibly empty). soup.select() also returns a list. soup.get() does not exist at the top level — it is a method on individual elements for attribute access.

Q3: b — Rate limited — wait before retrying A 429 status code means "Too Many Requests." The server is telling you explicitly that you are sending requests too quickly. The appropriate response is to wait — at least 30 seconds, more if this is a repeated 429 — before retrying. Immediately retrying after a 429 will almost certainly get your IP blocked.

Q4: b — Extracts text content with whitespace stripped get_text() traverses the element and all its descendants, collects all text content, and concatenates it. The strip=True argument removes leading and trailing whitespace from the result. This is different from element.text, which may include extra whitespace from the HTML formatting.

Q5: b — div.product-card span.price A space between selectors means "anywhere inside." div.product-card span.price finds a <span class="price"> anywhere inside a <div class="product-card">. Option a (>) means "direct child only," which would fail if there are intermediate elements. Option c has the selectors backward.

Q6: c — .get() returns None for missing attributes element["href"] uses Python's dictionary-style access, which raises a KeyError if the attribute does not exist. In real HTML, not every element has every attribute you expect. element.get("href") returns None (or a specified default) if the attribute is absent, preventing crashes.

Q7: c — urljoin(base_url, relative_url) from urllib.parse urljoin correctly handles all relative URL cases: paths starting with / (absolute path relative to the domain), paths without a leading slash (relative to the current directory), query strings, and fragment identifiers. Simple string concatenation (base_url + relative_url) breaks for most real-world cases.

Q8: b — A server-side file you should check and respect robots.txt is a plain text file published by website operators to communicate access preferences for automated bots. It is not enforced by Python or by requests — you must check it yourself and comply deliberately. It is not a legal contract (though violating it has been used as evidence in some legal cases), and it does not need to be "signed" to be honored.

Q9: c — JavaScript rendering This is the most common surprise in web scraping. Modern single-page applications (SPAs) built with React, Vue, or Angular send a minimal HTML shell initially, then use JavaScript to fetch and render the actual content. requests.get() receives only the shell. You can verify this by looking at response.text — if the products are not there, JavaScript is responsible. Solutions include finding the underlying API or using Selenium/Playwright.

Q10: b — A list of DataFrames pd.read_html() scans the page for all <table> elements and returns each as a DataFrame in a Python list. If the page has three tables, you get a list of three DataFrames. Select the one you want by index: tables[0] for the first table, tables[1] for the second, and so on.

Part B — True or False

Q11: TRUE find_all() always returns a list, never None. If no <div> elements exist, it returns an empty list []. A for loop over an empty list simply does not execute — no crash, no error. This is the key advantage of find_all() over find() for iteration purposes.

Q12: TRUE "html.parser" is Python's built-in HTML parser, part of the standard library. No additional installation is needed. However, it is slower than lxml and less forgiving of malformed HTML. In production, "lxml" (requiring pip install lxml) is preferred.

Q13: FALSE Crawl-delay: 5 means wait at least 5 seconds between requests, not 5 requests per second. Five requests per second would be Crawl-delay: 0.2. A crawl delay of 5 means the scraper may make at most one request every five seconds — a slow, polite rate appropriate for considerate crawling.

Q14: TRUE This is one of the primary benefits of requests.Session(). The session maintains a cookie jar that is automatically populated with cookies from server responses and automatically included in subsequent requests. This is essential for sites that use cookies for session tracking and is also more efficient because the session reuses underlying TCP connections.

Q15: FALSE Checking robots.txt and verifying your target path is not disallowed is only one part of due diligence. robots.txt is a technical specification about bot access paths. Terms of Service is a legal agreement that may impose additional restrictions, including prohibitions on scraping even paths that robots.txt permits. You must check both. Even if both are clear, privacy laws (GDPR, CCPA) may impose further constraints on what data you collect and how you use it.

Part C — Short Answer

Q16: Sessions and connection reuse A requests.Session object reuses the underlying TCP connection across multiple requests to the same server. HTTP connections have overhead: TCP handshake, TLS negotiation (for HTTPS), and connection setup. Without a session, each requests.get() creates a new connection — slower and more resource-intensive for both your script and the server. A session also maintains cookies automatically, which many sites depend on for consistent behavior across page views. The polite scraper reuses sessions not just for performance but because it reduces the load on the server — fewer connection establishments, fewer resources consumed.

Q17: Page structure changed — defensive coding The most likely cause is that the website updated its HTML structure. The element you were looking for — <span class="price"> for example — was renamed, nested differently, or removed in a redesign. Your soup.find("span", class_="price") returned None, and then calling .text on None raised the AttributeError.

The correct defensive approach is to check for None before accessing properties:

# Fragile — crashes if element is missing:
price = soup.find("span", class_="price").text

# Safe — handles missing elements gracefully:
price_tag = soup.find("span", class_="price")
price = price_tag.get_text(strip=True) if price_tag is not None else "N/A"

Or using a helper function:

def safe_text(element, default=""):
    return element.get_text(strip=True) if element else default

price = safe_text(soup.find("span", class_="price"))

Write all element access defensively. Log a warning when an expected element is missing — this helps you detect structure changes before they cause complete failures.

Q18: Fetch, parse, extract

import requests
from bs4 import BeautifulSoup

response = requests.get("https://books.toscrape.com", timeout=10)
soup = BeautifulSoup(response.text, "lxml")

price_elements = soup.find_all("p", class_="price_color")
for element in price_elements:
    print(element.get_text(strip=True))

Note: On books.toscrape.com the price is in a <p> tag with class price_color, not a <span>. This illustrates why you always inspect the actual HTML before writing your selectors — assumptions about tag types are often wrong.

Q19: Yes — calling the API directly is almost always better Scraping the underlying JSON API is superior to parsing HTML in nearly every way: - Structured data immediately — no parsing, no regex, no extracting values from HTML noise - More stable — API response schemas change far less often than HTML layouts - Faster — no overhead of transmitting HTML formatting, CSS, JS references - Lighter — JSON responses are typically much smaller than full HTML pages - No parsing errors — HTML malformation cannot break your parser

The only concern: verify that the API endpoint is intended for public use (check the Terms of Service) and apply appropriate rate limiting. Some APIs require authentication, use pagination differently, or implement stricter rate limits than the main website. But if the data is available via a clean JSON endpoint, always use that over HTML parsing.

Q20: When NOT to scrape — three situations

Situation 1: An API exists. If the site offers a documented public API (a /api/ endpoint, an API developer portal, an API key registration page), use it. APIs are designed for machine access, explicitly permitted, more stable than HTML, and return already-structured data. Before scraping any commercial site, search for "[site name] developer API." Financial data providers (Quandl, Alpha Vantage), social platforms (Twitter/X, Reddit), and mapping services (Google Maps, OpenStreetMap) all have APIs.

Situation 2: The site prohibits scraping in its Terms of Service. Some sites explicitly forbid automated access in their ToS. Notable examples have included LinkedIn, Instagram, and various e-commerce platforms. Even if the data is publicly visible, scraping in violation of ToS creates legal exposure and has resulted in accounts being banned, IP addresses being blocked, and in a few cases legal action. The alternative is to find the data elsewhere — through an official data partnership, by purchasing a data feed, or by using a competitor's API.

Situation 3: Personal data about individuals is involved. Scraping names, email addresses, phone numbers, or other personal information about private individuals — even from publicly accessible pages — may violate GDPR (in the EU), CCPA (in California), and other privacy regulations. The "it was on a public page" defense is not always legally sufficient, especially when the data is being stored, used for marketing, or combined with other sources. The alternative is to work with a data broker who has proper consent mechanisms, or to use privacy-compliant datasets.

Scoring Guide

Score	Assessment
45–50	Excellent — you are ready to build production scrapers
38–44	Proficient — revisit the sections covering your missed questions
28–37	Developing — re-read the chapter and practice with books.toscrape.com
Below 28	Needs review — work through the chapter examples before the exercises