Chapter 13 Quiz: Getting Data from the Web

Q: What does BeautifulSoup's `find_all('tr')` return? - (A) The first ` ` element in the HTML - (B) A list of all ` ` elements in the HTML - (C) The text content of all ` ` elements - (D) A dictionary mapping row indices to ` ` elements

Correct: (B) `find_all()` returns a list of all matching elements (Tag objects). To get the text content, you'd call `.text` on each element individually. `find()` (without "all") returns only the first match. The elements are returned in document order, not as a dictionary.

Contributors to Introduction to Data Science

Chapter 13 Quiz: Getting Data from the Web

Instructions: This quiz tests your understanding of Chapter 13. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (8 questions, 5 points each)

Question 1. What is the primary purpose of an API (Application Programming Interface) in the context of data collection?

(A) To make websites load faster in web browsers
(B) To provide a structured, programmatic way to request and receive data from a server
(C) To prevent unauthorized users from accessing a website
(D) To convert HTML into JSON format

Answer

**Correct: (B)** - **(A)** is about web performance, not APIs. APIs can actually be used without a web browser at all. - **(B)** captures the essence of an API: it's a defined interface that allows programs to request data in a structured way and receive structured responses (usually JSON). This is the primary reason data scientists use APIs. - **(C)** describes authentication and access control, which are features some APIs have, but not their primary purpose. - **(D)** describes a format conversion, not what APIs do.

Question 2. Which HTTP status code indicates that you've exceeded the API's rate limit?

(A) 200
(B) 404
(C) 429
(D) 500

Answer

**Correct: (C)** - **200** means success (OK). - **404** means the resource was not found. - **429** means "Too Many Requests" — you've hit the rate limit and need to slow down. The response often includes a `Retry-After` header indicating how long to wait. - **500** means the server experienced an internal error (not related to your request rate).

Question 3. What does response.json() return?

(A) A raw string of JSON text
(B) A Python dictionary or list, parsed from the JSON response body
(C) A pandas DataFrame
(D) An HTML page formatted as JSON

Answer

**Correct: (B)** `response.json()` parses the JSON response body and returns the corresponding Python object — typically a dictionary or a list. JSON objects become dictionaries, JSON arrays become lists. If the response isn't valid JSON, this method raises a `JSONDecodeError`. To get a DataFrame, you'd pass the result to `pd.DataFrame()` or `pd.json_normalize()` as a separate step.

Question 4. Which of the following is the most important reason to add time.sleep() between API requests?

(A) To make your code easier to debug
(B) To avoid overwhelming the server and violating rate limits
(C) To make the data download faster
(D) To ensure the JSON is properly formatted

Answer

**Correct: (B)** Sending requests too rapidly can: (1) trigger rate limiting (429 responses), (2) overwhelm the server — especially small ones, (3) get your IP address or API key banned, and (4) be inconsiderate to other users sharing the server. Adding a delay is the simplest way to be a responsible API consumer. It does not affect debugging, download speed, or JSON formatting.

Question 5. What does BeautifulSoup's find_all('tr') return?

(A) The first <tr> element in the HTML
(B) A list of all <tr> elements in the HTML
(C) The text content of all <tr> elements
(D) A dictionary mapping row indices to <tr> elements

Answer

**Correct: (B)** `find_all()` returns a list of all matching elements (Tag objects). To get the text content, you'd call `.text` on each element individually. `find()` (without "all") returns only the *first* match. The elements are returned in document order, not as a dictionary.

Question 6. You need data from a website. The website has an official API and also displays the data on web pages. Which approach should you try first?

(A) Scrape the web pages with BeautifulSoup — it's faster to write
(B) Use the API — it's the official, structured, and more reliable method
(C) Download the entire website with wget
(D) Copy the data manually from the browser

Answer

**Correct: (B)** Always prefer the API when one exists. APIs return structured data (usually JSON), are designed for programmatic access, and are more reliable than scraping (which breaks when the website's HTML structure changes). Scraping is a fallback for when no API exists. Copying manually doesn't scale, and downloading an entire website is excessive and likely violates ToS.

Question 7. What is robots.txt?

(A) A file that lists all the URLs on a website
(B) A configuration file that tells automated programs which parts of a site they are allowed or disallowed from accessing
(C) A security certificate that proves a website is legitimate
(D) A file that contains the website's API documentation

Answer

**Correct: (B)** `robots.txt` is located at the root of a website (e.g., `https://example.com/robots.txt`). It contains directives like `Disallow: /private/` and `Crawl-delay: 10` that specify which paths bots should avoid and how frequently they should make requests. It's a voluntary protocol — not enforced by the web server — but respecting it is the ethical standard in the web community.

Question 8. Why should you never hardcode an API key directly in a Python script that you share or commit to version control?

(A) API keys expire after being written in code
(B) It makes the code harder to read
(C) Anyone who sees the code can use (and potentially abuse) your API key, and Git history preserves it even after deletion
(D) Python doesn't allow strings that long in source code

Answer

**Correct: (C)** If you push an API key to a public GitHub repository, it's immediately visible to anyone. Even if you delete it in a later commit, the key remains in the Git history. Malicious actors actively scan public repositories for leaked API keys. The fix is to store keys in environment variables or configuration files excluded from version control (via `.gitignore`).

Section 2: True/False (3 questions, 5 points each)

Question 9. True or False: Web scraping is always illegal because it involves accessing someone else's server without permission.

Answer

**False.** Web scraping is not inherently illegal. In the United States, the 2022 *hiQ Labs v. LinkedIn* ruling established that scraping publicly accessible data generally does not violate the Computer Fraud and Abuse Act. However, legality depends on jurisdiction, the website's Terms of Service, the type of data (personal data has additional protections), and how the scraped data is used. "Not always illegal" doesn't mean "always legal" — each case requires judgment.

Question 10. True or False: The requests library is part of Python's standard library and requires no separate installation.

Answer

**False.** Despite being the most popular HTTP library in Python, `requests` is *not* part of the standard library. It must be installed separately with `pip install requests`. Python's standard library includes `urllib`, which provides basic HTTP functionality but is significantly less user-friendly than `requests`.

Question 11. True or False: If a website's robots.txt file disallows scraping, your Python script will automatically be blocked from accessing those pages.

Answer

**False.** `robots.txt` is a voluntary protocol. The web server does not enforce it — your script can still technically access disallowed pages. `robots.txt` is like a "Please Do Not Disturb" sign: it communicates the website owner's wishes, but it doesn't lock the door. However, ignoring `robots.txt` is considered unethical and may be used as evidence of bad faith in legal proceedings.

Section 3: Short Answer (4 questions, 5 points each)

Question 12. Explain the difference between query parameters and request headers. When would you use each?

Answer

**Query parameters** are key-value pairs appended to the URL (after `?`) that specify what data you want. Example: `?city=Portland&year=2023`. They're visible in the URL and define the request's content. **Headers** are metadata sent alongside the request that provide information about the request itself (not the data being requested). Common headers include `Authorization` (for API keys), `User-Agent` (identifying your script), and `Accept` (specifying desired response format). Use parameters for data filtering/selection. Use headers for authentication, identification, and protocol-level information.

Question 13. What is pagination in the context of APIs, and why is it necessary? Write 2-3 sentences.

Answer

Pagination is the practice of dividing a large set of results into smaller "pages," each returned by a separate API request. For example, an API with 10,000 records might return 100 records per page across 100 pages. This is necessary because sending all records in a single response would be slow, consume excessive bandwidth, and potentially crash the client's program by exceeding available memory. The client loops through pages, collecting all data incrementally.

Question 14. You're considering scraping a news website to collect article headlines for a sentiment analysis project. Walk through the ethical checklist from the chapter: what questions should you ask, and what would reasonable answers look like?

Answer

(1) **Is there an API?** Check the site's developer documentation. Many news sites offer APIs (e.g., NYT, Guardian). (2) **What does robots.txt say?** Check if article pages are disallowed. (3) **What do the ToS say?** Many news sites prohibit scraping for commercial use. (4) **Could scraping cause harm?** Rate-limit requests to avoid overloading the server. (5) **What about the data?** Headlines are published content — less sensitive than personal data, but copyright applies. Reasonable approach: use the API if available; if scraping, respect robots.txt, rate-limit heavily, use data for academic purposes only, and credit the source.

Question 15. Explain why data visible in your web browser might not appear in the HTML returned by requests.get(). What would you do in this situation?

Answer

Modern websites often load content dynamically using JavaScript. When you visit a page in your browser, the browser executes the JavaScript, which may make additional API calls to fetch data and insert it into the page. `requests.get()` only retrieves the initial HTML — it does not execute JavaScript. So the data loaded by JavaScript won't be in the response. Solutions: (1) Use the browser's Developer Tools (Network tab) to find the actual API endpoint that the JavaScript calls — you can often call this endpoint directly with `requests`; (2) Use a tool like Selenium that controls a real browser and can execute JavaScript; (3) Look for an alternative data source.

Section 4: Applied Scenarios (3 questions, 5 points each)

Question 16. You're building a script to collect weather data from an API for 50 cities. The API allows 30 requests per minute and returns JSON. Describe your approach, including how you'd handle rate limiting, errors, and data storage.

Answer

**Approach:** 1. Store cities and coordinates in a list or dictionary 2. Loop through cities, making one API request per city 3. Add `time.sleep(2)` between requests (30 requests/min = one every 2 seconds) 4. Check `response.status_code` before parsing; handle 429 with a longer wait 5. Parse JSON with `response.json()` and append to a list of records 6. After the loop, create a DataFrame and save to CSV 7. Optionally cache intermediate results to a file so you can resume if the script crashes The total collection time would be about 50 cities x 2 seconds = ~100 seconds — well within reasonable bounds.

Question 17. Your manager asks you to scrape a competitor's product catalog from their website nightly. The competitor's robots.txt says Disallow: /products/ and their Terms of Service prohibit automated data collection. How do you respond?

Answer

You should advise against this approach and explain why. Both the `robots.txt` and the Terms of Service explicitly prohibit this activity. Even though the data is publicly visible, scraping it violates the stated wishes of the website owner and could expose the company to legal risk. Instead, suggest alternatives: (1) Check if the competitor offers a public API or data feed; (2) Use manual research for a smaller sample; (3) Use third-party services that aggregate pricing data through authorized means; (4) Focus on publicly available data that's intended for reuse (like open datasets or press releases).

Question 18. You receive data from two sources: a REST API that returns JSON with nested structures, and a web page with an HTML table. Both contain information about the same set of universities. Describe the steps to collect data from both sources and merge them into a single DataFrame.

Answer

**Step 1 (API):** Use `requests.get()` to fetch the JSON data. Parse with `response.json()`. Use `pd.json_normalize()` to flatten nested structures into a DataFrame. Identify the key column (e.g., `university_name` or `university_id`). **Step 2 (Web page):** Use `requests.get()` to download the HTML. Parse with BeautifulSoup. Extract the table data using `find_all('tr')` and `find_all('td')`. Build a DataFrame. Alternatively, use `pd.read_html()` if the table is straightforward. **Step 3 (Merge):** Standardize the key column in both DataFrames (e.g., strip whitespace, convert to lowercase). Use `pd.merge()` to join on the key column. Inspect the merge result for missing matches using `how='outer'` with `indicator=True`.

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. What will this code print, and what potential problem does it have?

import requests

for i in range(1000):
    response = requests.get(
        f'https://api.example.com/record/{i}'
    )
    data = response.json()
    print(data['name'])

Answer

This code makes 1,000 GET requests in a tight loop with no delay, error handling, or timeout. It will attempt to print the `name` field from each response. **Problems:** 1. **No rate limiting** — 1,000 rapid requests will likely trigger a 429 (Too Many Requests) response, and after that, `response.json()` might return an error message without a `name` key, causing a `KeyError`. 2. **No error handling** — if any request fails (network error, timeout, server error), the script crashes. 3. **No timeout** — if the server is slow, the script could hang indefinitely on any single request. 4. **No delay** — this hammers the server as fast as possible, which is inconsiderate and possibly harmful. A responsible version would add `time.sleep(1)`, `timeout=10`, status code checking, and `try/except` for error handling.

Question 20. This code attempts to scrape a table and create a DataFrame, but produces incorrect results. What's wrong?

from bs4 import BeautifulSoup
import pandas as pd

html = """
<table>
  <tr><th>Name</th><th>Score</th></tr>
  <tr><td>Alice</td><td>95</td></tr>
  <tr><td>Bob</td><td>87</td></tr>
</table>
"""

soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('tr')

data = []
for row in rows:
    cells = row.find_all('td')
    data.append([cell.text for cell in cells])

df = pd.DataFrame(data, columns=['Name', 'Score'])
print(df)

Answer

The code includes the header row (`...`) in the loop. When it processes the first ``, `find_all('td')` finds zero `` elements (because the header uses ``, not ``), resulting in an empty list. The DataFrame will have an empty first row:

   Name Score
0  None  None    <-- From the header row (no <td> elements)
1  Alice    95
2    Bob    87

**Fix:** Skip the header row by slicing: `for row in rows[1:]:` Alternatively, you could check `if cells:` before appending, or use `pd.read_html(html)[0]` which handles header detection automatically.