Quiz: Chapter 27

DataField.Dev

Quiz: Chapter 27

Working with Geospatial Data: Maps, Spatial Joins, and Location-Based Analysis

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

When creating a GeoDataFrame from a pandas DataFrame with latitude and longitude columns, what is the correct order for the Point() constructor?

A) Point(latitude, longitude)
B) Point(longitude, latitude)
C) Point(latitude, longitude, elevation)
D) The order does not matter

Answer: B) Point(longitude, latitude). Shapely and most geospatial libraries follow the mathematical convention of (x, y), which maps to (longitude, latitude). Longitude is the x-axis (east-west) and latitude is the y-axis (north-south). This is the single most common geospatial bug: human convention says "lat, lon" but geospatial code expects "lon, lat." Getting this wrong places your points in the wrong location, and geopandas will not raise an error --- it will silently plot customers in the ocean.

Question 2 (Multiple Choice)

You compute point_a.distance(point_b) on two points in a GeoDataFrame with CRS EPSG:4326 (WGS84). The result is 0.045. What does this number represent?

A) 0.045 kilometers
B) 0.045 miles
C) 0.045 degrees (approximately 5 km, depending on latitude)
D) 0.045 meters

Answer: C) 0.045 degrees (approximately 5 km, depending on latitude). EPSG:4326 uses degrees as its unit. The .distance() method computes Euclidean distance in the coordinate system's native units. On an unprojected CRS, this produces degrees, which vary in real-world distance depending on latitude. At the equator, 1 degree of longitude is about 111 km. At 60 degrees latitude, it is about 55 km. To get distance in meters or kilometers, either reproject to a local projected CRS or use the haversine formula.

Question 3 (Multiple Choice)

Which CRS should you use to store geospatial data for a web application that receives coordinates from GPS devices?

A) EPSG:5070 (Albers Equal Area Conic)
B) EPSG:3857 (Web Mercator)
C) EPSG:4326 (WGS84)
D) EPSG:32618 (UTM Zone 18N)

Answer: C) EPSG:4326 (WGS84). GPS devices output coordinates in WGS84 (latitude and longitude in degrees). It is the standard for data storage and exchange. EPSG:3857 (B) is used internally by web map tiles (Google Maps, OpenStreetMap) for display, not storage. EPSG:5070 (A) and EPSG:32618 (D) are projected CRS used for analysis and measurement, not for storing raw GPS data.

Question 4 (Multiple Choice)

You call .buffer(50000) on a point in a GeoDataFrame with CRS EPSG:4326. What happens?

A) A circular buffer with a 50 km radius is created correctly
B) A buffer of 50,000 degrees is created, covering most of the planet
C) Geopandas raises a CRS error
D) A buffer of 50,000 meters is created in the projected CRS

Answer: B) A buffer of 50,000 degrees is created, covering most of the planet. The .buffer() method interprets its argument in the CRS's native units. For EPSG:4326, the unit is degrees, so 50,000 degrees creates an absurdly large buffer. Geopandas does not raise an error --- it silently produces a wrong result. The correct approach: reproject to a meter-based CRS (e.g., UTM or Albers Equal Area), call .buffer(50000) to get a 50 km buffer in meters, then reproject back to EPSG:4326 if needed.

Question 5 (Multiple Choice)

In a spatial join using gpd.sjoin(points, polygons, predicate='within'), what does the predicate='within' parameter specify?

A) Polygons that are within the bounding box of the points
B) Points that fall entirely inside a polygon
C) Points that are within a specified distance of a polygon boundary
D) Polygons that contain at least one point

Answer: B) Points that fall entirely inside a polygon. The 'within' predicate tests whether the left geometry (points) is completely contained by the right geometry (polygons). For point-in-polygon queries, this is the standard predicate. A point on the exact boundary of a polygon may return False for 'within' --- use 'intersects' if boundary cases matter. Choice D describes 'contains', which tests the inverse relationship.

Question 6 (Multiple Choice)

Why does geopandas use a spatial index (R-tree) for spatial joins?

A) To ensure geometric precision at all zoom levels
B) To reduce the number of geometry comparisons from O(n * m) to approximately O(n * log(m))
C) To automatically reproject geometries to matching CRS
D) To compress geometry data for smaller file sizes

Answer: B) To reduce the number of geometry comparisons from O(n * m) to approximately O(n * log(m)). Without a spatial index, joining n points against m polygons requires checking every point against every polygon: n * m comparisons. The R-tree spatial index organizes geometries by their bounding boxes, allowing the algorithm to quickly eliminate polygons that cannot possibly contain a given point. For a join of 100,000 points against 3,000 polygons, this reduces comparisons from 300 million to roughly hundreds of thousands.

Question 7 (Multiple Choice)

Which is the best feature to feed into a machine learning model for predicting customer satisfaction?

A) Raw latitude
B) Raw longitude
C) Distance to nearest warehouse in kilometers
D) A concatenated string of "lat,lon"

Answer: C) Distance to nearest warehouse in kilometers. Raw latitude (A) and longitude (B) have no meaningful relationship to satisfaction --- a model might learn spurious splits like "latitude > 40" that do not generalize. The concatenated string (D) is categorical and would create thousands of unique values. Distance to nearest warehouse captures the actual causal mechanism (delivery speed depends on distance from fulfillment center), making it interpretable, generalizable, and directly actionable.

Question 8 (Multiple Choice)

You are creating a choropleth map of median income by US county. Which color scale is most appropriate?

A) A diverging scale (red-white-blue) centered at the national median
B) A sequential scale (light-to-dark blue) from low to high
C) A qualitative scale (distinct colors for each county)
D) A binary scale (two colors: above and below median)

Answer: B) A sequential scale (light-to-dark blue) from low to high. Median income is a continuous, unipolar variable --- it goes from low to high with no meaningful midpoint. A sequential color scale correctly communicates "more = darker." A diverging scale (A) is appropriate when there is a meaningful center point (e.g., election margin: Democrat vs. Republican). A qualitative scale (C) is for categorical variables. A binary scale (D) discards the continuous information.

Question 9 (Multiple Choice)

You have customer data in EPSG:4326 and state boundaries in EPSG:2163 (US National Atlas Equal Area). You perform gpd.sjoin(customers, states) without reprojecting. What happens?

A) Geopandas automatically reprojects to a common CRS
B) The spatial join runs but produces incorrect results due to CRS mismatch
C) Geopandas raises a CRS mismatch error
D) The spatial join runs correctly because both CRS cover the US

Answer: C) Geopandas raises a CRS mismatch error. Since version 0.8, geopandas checks that both GeoDataFrames share the same CRS before performing spatial operations. If the CRS values differ, it raises an error rather than silently producing wrong results. The fix: reproject one GeoDataFrame to match the other using .to_crs() before the join.

Question 10 (Multiple Choice)

What is the haversine formula used for?

A) Converting between CRS projections
B) Calculating great-circle distance between two points on a sphere
C) Performing spatial joins without a spatial index
D) Creating equal-area map projections

Answer: B) Calculating great-circle distance between two points on a sphere. The haversine formula computes the shortest distance between two points on the surface of a sphere, given their latitude and longitude. It accounts for the Earth's curvature, unlike Euclidean distance on unprojected coordinates. It assumes a perfectly spherical Earth (radius 6,371 km), which introduces a small error (up to 0.5%) compared to the Vincenty formula (which models Earth as an ellipsoid). For data science applications, the haversine formula is sufficiently accurate.

Question 11 (Short Answer)

Explain the difference between a shapefile, GeoJSON, and GeoPackage. Which would you recommend for a new project, and why?

Answer: A shapefile is a legacy format consisting of 3-6 paired files (.shp, .shx, .dbf, .prj, etc.) --- widely supported but cumbersome to manage. GeoJSON is a single JSON-based file that is human-readable and web-friendly but inefficient for large datasets. GeoPackage is a single SQLite-based file that supports multiple layers, large datasets, and spatial indexing. For a new project, GeoPackage is recommended: it combines the portability of a single file with the performance and feature set needed for serious geospatial work. GeoJSON is the right choice for web APIs and small datasets.

Question 12 (Short Answer)

A colleague uses state_churn_rate (the average churn rate for each customer's state) as a feature in a churn prediction model. They compute this from the full dataset before splitting into train and test sets. Why is this a problem? How would you fix it?

Answer: This is target leakage. The state churn rate is computed from the target variable (churned/not churned), and computing it on the full dataset means the test set's churn information leaks into the feature. A customer's own churn status contributes to their state's churn rate. Fix: compute state churn rates only from the training set and apply them to the test set. For even more protection, use leave-one-out encoding where each customer's state rate is computed from all other customers in that state within the training fold.

Question 13 (Short Answer)

You are analyzing ShopSmart delivery data and notice that customers in Montana have an average delivery time of 6.2 days vs. 2.1 days in New Jersey. List three possible explanations, at least one of which is purely geographic and one that is not.

Answer: Geographic: Montana customers are far from all fulfillment centers, so shipments travel longer distances (the nearest FC may be 800+ km away vs. under 100 km for New Jersey). Non-geographic: Montana may have fewer carrier service options (FedEx/UPS frequency) or rely on ground shipping where New Jersey qualifies for next-day air. Confound: Montana customers may disproportionately order bulky or hazardous items that require ground-only shipping, inflating average delivery times independently of location.

Question 14 (Short Answer)

Explain why MarkerCluster from folium.plugins is necessary when plotting thousands of points on a folium map.

Answer: Rendering thousands of individual markers in a web browser causes severe performance degradation --- the JavaScript layer must track each marker's position, tooltip, and click handler, leading to slow load times and unresponsive maps. MarkerCluster groups nearby markers into a single cluster icon at low zoom levels and progressively expands them as the user zooms in. This reduces the number of DOM elements rendered at any given zoom level from thousands to dozens, keeping the map responsive.

Question 15 (Short Answer)

You build a churn model for StreamFlow with and without spatial features. The model without spatial features achieves AUC 0.72. The model with state_churn_rate and dist_nearest_datacenter_km achieves AUC 0.78. Write two sentences interpreting this result for a business audience.

Answer: Adding geographic information to our churn model improved prediction accuracy by 6 percentage points (AUC from 0.72 to 0.78), meaning the model is substantially better at distinguishing customers who will churn from those who will stay. This tells us that where a customer is located --- specifically, which state they are in and how far they are from our nearest data center --- carries real predictive signal about churn risk that is not captured by account-level features like plan tier and tenure alone.

Return to the chapter for full context.