Additional Assessments

Additional Assessments

This document contains a midterm exam covering Chapters 1--15, a comprehensive final exam covering all 35 chapters, and grading rubrics for the capstone project. All questions are designed for take-home or open-book administration unless otherwise noted.

Midterm Exam: Chapters 1--15

Instructions: Answer all 20 questions. For multiple-choice questions, select the single best answer. For short-answer questions, respond in 2--4 sentences. For code questions, write Python code that would produce the described output. Total points: 100.

Section A: Perception and Design (Questions 1--10, 5 points each)

Question 1. Name three pre-attentive visual features and, for each, give one example of how it could be used to encode data in a chart.

Expected answer: Color hue (encode categories), length/size (encode magnitude), orientation (encode direction or category). Other valid features include position, shape, motion, and enclosure. Each feature should be paired with a concrete encoding use.

Question 2. A colleague sends you a scatter plot using a rainbow (jet) color map to represent temperature. Identify two perceptual problems with this color map and recommend a specific alternative.

Expected answer: (1) Rainbow color maps are perceptually non-uniform --- equal data differences produce unequal perceived color changes, creating artificial banding. (2) They are unreadable for viewers with color vision deficiency. Recommended alternative: viridis, inferno, or cividis (any perceptually uniform sequential map is acceptable).

Question 3 (Multiple Choice). Which Gestalt principle explains why points that share the same color in a scatter plot are perceived as belonging to the same group?

(a) Proximity (b) Similarity (c) Continuity (d) Closure

Answer: (b) Similarity.

Question 4. Describe two ways a chart can be technically accurate but still misleading. For each, explain what the viewer is likely to misinterpret.

Expected answer: (1) Truncated y-axis --- the viewer overestimates the magnitude of changes because the baseline is hidden. (2) Cherry-picked date range --- the viewer misses the broader trend because only a favorable or unfavorable period is shown. Other valid answers: dual axes with mismatched scales, area-based encoding where radius is used instead of area, and non-zero-based bar charts.

Question 5. A dataset contains monthly revenue for 8 product categories over 3 years. You need to show how each category's share of total revenue has changed over time. Which chart type is most appropriate and why?

Expected answer: A stacked area chart (or stacked proportional area chart) is most appropriate because it shows both part-to-whole composition and change over time. A line chart would show trends but not composition. A pie chart would require one pie per month, making temporal comparison difficult.

Question 6. Define "data-ink ratio" and give one example of a chart element that would increase it and one that would decrease it.

Expected answer: Data-ink ratio is the proportion of a chart's ink (or pixels) that represents actual data. Removing a decorative border or background fill increases the ratio. Adding a 3D effect, gradient fill, or clip art decreases the ratio.

Question 7 (Multiple Choice). In the annotation hierarchy for a chart, which element should communicate the key takeaway?

(a) Axis labels (b) Title (c) Legend (d) Source attribution

Answer: (b) Title. The title should state the conclusion or key takeaway, not merely label the variable.

Question 8. Explain why small multiples are often more effective than animation for comparing how a variable changes across categories or time periods.

Expected answer: Small multiples display all comparisons simultaneously, allowing the viewer to scan and compare without relying on memory. Animation requires the viewer to remember earlier frames while watching later ones, which is cognitively demanding and error-prone. Small multiples also allow the viewer to focus on any panel at their own pace.

Question 9. What is the difference between a sequential and a diverging color palette? Give an example dataset appropriate for each.

Expected answer: Sequential palettes map a single direction of magnitude (low to high) using one hue ramp. Example: population density, rainfall amount. Diverging palettes map values that deviate in two directions from a meaningful midpoint using two hue ramps. Example: temperature anomaly (above/below average), profit/loss, election margins.

Question 10. A chart title reads "Quarterly Revenue by Region." Rewrite it as a takeaway-driven title, assuming the data shows that the Western region grew 35% while all others declined.

Expected answer: "Western Region Revenue Grew 35% While All Other Regions Declined" or similar phrasing that states the key finding. The answer must convey the specific insight, not just label the data.

Section B: matplotlib (Questions 11--20, 5 points each)

Question 11. Explain the difference between the pyplot interface and the object-oriented interface in matplotlib. When is the OO interface preferable?

Expected answer: The pyplot interface (plt.plot(), plt.title()) maintains implicit global state and operates on the "current" figure and axes. The OO interface (fig, ax = plt.subplots(); ax.plot()) provides explicit references to Figure and Axes objects. The OO interface is preferable for multi-panel figures, programmatic chart generation, embedding in applications, and any situation where you need to modify a specific Axes independently.

Question 12. Write Python code to create a 2x2 grid of subplots using matplotlib's object-oriented API. The top two panels should share a y-axis. Set the overall figure size to 10x8 inches.

Expected answer:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharey='row')
# or equivalently:
# fig = plt.figure(figsize=(10, 8))
# axes = fig.subplots(2, 2, sharey='row')

Full credit for any correct approach using the OO API that produces a 2x2 grid with shared y-axis on the top row and a 10x8 figure size.

Question 13 (Multiple Choice). In matplotlib's architecture, which object contains Axes objects?

(a) Artist (b) Figure (c) Renderer (d) Backend

Answer: (b) Figure.

Question 14. Describe the purpose of GridSpec and explain one scenario where plt.subplots() is insufficient and GridSpec is necessary.

Expected answer: GridSpec defines a grid of rows and columns that can be sliced to create axes of varying sizes and spans. It is necessary when you need panels of different sizes in the same figure --- for example, a large main panel on the left with two smaller panels stacked on the right, or a panel that spans two columns.

Question 15. Write Python code that takes an existing matplotlib Axes object ax and applies the following customizations: remove the top and right spines, set the x-axis label to "Year" in 12-point Arial, and set the title to "Temperature Anomalies" in 14-point bold.

Expected answer:

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_xlabel('Year', fontsize=12, fontfamily='Arial')
ax.set_title('Temperature Anomalies', fontsize=14, fontweight='bold')

Accept minor variations in parameter names (e.g., fontname vs. fontfamily).

Question 16 (Multiple Choice). Which matplotlib function saves a figure to a file at 300 DPI?

(a) plt.export('chart.png', dpi=300) (b) fig.savefig('chart.png', dpi=300) (c) ax.save('chart.png', dpi=300) (d) fig.render('chart.png', dpi=300)

Answer: (b) fig.savefig('chart.png', dpi=300).

Question 17. Explain what rcParams are and write a code snippet that changes the default font size to 14 and the default figure size to (10, 6).

Expected answer: rcParams is matplotlib's global configuration dictionary that controls default values for virtually all visual properties. Changes persist for the session.

import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (10, 6)

Question 18. A student produces a scatter plot of 50,000 points. All points overlap into a solid blue blob. Suggest two techniques to reveal the underlying distribution without switching to a different chart type.

Expected answer: (1) Reduce alpha (transparency) so overlapping regions appear darker: ax.scatter(x, y, alpha=0.05). (2) Reduce marker size so individual points are smaller and overlap less: ax.scatter(x, y, s=1). Other valid answers include using a hexbin plot (though this is technically a different chart type) or adding jitter.

Question 19. Describe the difference between fig.tight_layout() and fig.set_layout_engine('constrained'). Which is recommended for new code and why?

Expected answer: tight_layout() adjusts subplot parameters to minimize overlap after the figure is created, but it does not account for elements added later (colorbars, legends outside the axes). constrained_layout (or layout='constrained' in plt.subplots()) continuously adjusts spacing as elements are added, producing more reliable results for complex figures. Constrained layout is recommended for new code because it handles colorbars, suptitles, and legends more robustly.

Question 20. Write Python code to create a simple animation using FuncAnimation that moves a single point across a scatter plot from x=0 to x=10 over 100 frames.

Expected answer:

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
import numpy as np

fig, ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(-1, 1)
point, = ax.plot([], [], 'o', markersize=10)

def update(frame):
    point.set_data([frame / 10], [0])
    return point,

anim = FuncAnimation(fig, update, frames=100, interval=50, blit=True)
plt.show()

Accept any correct implementation that creates a FuncAnimation moving a point across the specified range.

Final Exam: All Chapters

Instructions: Answer all 30 questions. Total points: 150. Questions are grouped by part but draw on knowledge from across the course.

Section A: Perception, Design, and Ethics (Questions 1--8, 5 points each)

Question 1. Explain the concept of pre-attentive processing and describe how it should influence the design of a dashboard intended for a busy operations manager.

Expected answer: Pre-attentive processing is the visual system's ability to detect certain features (color, size, orientation, motion) in under 250ms without conscious effort. A dashboard for a busy manager should use pre-attentive features to draw attention to the most important metrics: red/green for status, size for magnitude, position for ranking. Avoid using pre-attentive features for decoration, as they will compete with the signal.

Question 2. A marketing team presents a report with 15 charts, each using a different color palette. What design principle does this violate, and what would you recommend?

Expected answer: This violates the principle of consistency (and undermines the viewer's ability to build a coherent mental model). Recommend establishing a single color palette for the report and applying it consistently across all charts. Categories that appear in multiple charts should use the same color everywhere. This connects to Chapter 32's theming and style guide principles.

Question 3. Compare and contrast storytelling with data (Chapter 9) and lies with data (Chapter 4). Where is the ethical boundary?

Expected answer: Storytelling structures true data for clarity and impact using narrative techniques (context, conflict, resolution). Lying distorts data to produce a false impression (truncated axes, cherry-picked ranges, misleading encodings). The ethical boundary is honesty: a story is ethical when the audience could access the full data and reach the same conclusion. It becomes manipulation when the narrative depends on hiding or distorting evidence.

Question 4 (Multiple Choice). Which of the following is the most effective encoding channel for quantitative comparison?

(a) Color saturation (b) Area (c) Position along a common scale (d) Angle

Answer: (c) Position along a common scale. This is the most accurately perceived quantitative channel according to Cleveland and McGill's research.

Question 5. A chart shows three groups using colors that are distinguishable to you but not to a viewer with deuteranopia. Name two strategies to make the chart accessible without changing the colors.

Expected answer: (1) Add direct labels to each group so color is not the only identifier. (2) Use different line styles (solid, dashed, dotted) or marker shapes (circle, square, triangle) to provide redundant encoding. Other valid answers: add pattern fills to bars, increase lightness contrast between the colors.

Question 6. Describe the "data-ink ratio" concept and explain one situation where strictly maximizing it would produce a worse chart.

Expected answer: Data-ink ratio is the fraction of a chart's visual elements that represent data. Strictly maximizing it might remove gridlines that readers need for precise value reading in a chart where exact numbers matter (e.g., a line chart tracking stock prices where users need to read specific values). Some non-data ink serves a useful purpose.

Question 7. A colleague produces a dashboard with a real-time animation that continuously cycles through 12 monthly views. Explain why this design choice is problematic and propose an alternative.

Expected answer: Animation requires viewers to remember earlier frames while watching later ones, imposing high cognitive load. Viewers cannot compare specific months because they are never visible simultaneously. Alternative: use small multiples (12 panels, one per month) arranged chronologically, or a single chart with a user-controlled slider to select the month.

Question 8. What is the role of a chart's source attribution, and when is it acceptable to omit it?

Expected answer: Source attribution credits the data origin, enabling verification and building trust. It is acceptable to omit only when the audience already knows the source (internal dashboards where the data source is universally understood) or when the chart is exploratory and not intended for external distribution. For any published or shared chart, source attribution is mandatory.

Section B: matplotlib and seaborn (Questions 9--18, 5 points each)

Question 9. Write Python code to create a matplotlib figure with three panels using GridSpec: a large panel spanning the left two-thirds of the figure and two smaller panels stacked on the right third.

Expected answer:

import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

fig = plt.figure(figsize=(12, 6))
gs = GridSpec(2, 3, figure=fig)
ax_main = fig.add_subplot(gs[:, :2])
ax_top_right = fig.add_subplot(gs[0, 2])
ax_bottom_right = fig.add_subplot(gs[1, 2])

Question 10. Explain the difference between seaborn figure-level functions and axes-level functions. Give one example of each and explain when you would choose one over the other.

Expected answer: Figure-level functions (e.g., sns.displot(), sns.catplot(), sns.relplot()) create their own Figure and FacetGrid, supporting row/col faceting natively. Axes-level functions (e.g., sns.histplot(), sns.barplot(), sns.scatterplot()) draw onto an existing matplotlib Axes, allowing integration with custom subplot layouts. Use figure-level for faceted exploratory analysis; use axes-level when you need to place seaborn plots into a custom GridSpec layout.

Question 11 (Multiple Choice). Which seaborn function is appropriate for visualizing the joint distribution of two continuous variables along with their marginal distributions?

(a) sns.pairplot() (b) sns.jointplot() (c) sns.heatmap() (d) sns.catplot()

Answer: (b) sns.jointplot().

Question 12. A student creates a violin plot of exam scores for 4 sections. The KDE extends below 0 and above 100, which are the minimum and maximum possible scores. What caused this artifact and how should it be fixed?

Expected answer: Kernel density estimation extends beyond the data range by default, producing impossible values. Fix by clipping the KDE to the valid range using the cut=0 parameter in sns.violinplot(), or by switching to a box plot or strip plot that does not use KDE.

Question 13. Write Python code to create a matplotlib style sheet (as a dictionary applied via rcParams) that sets: sans-serif font, 12pt default font size, no top/right spines, light gray gridlines, and a white background.

Expected answer:

import matplotlib.pyplot as plt

custom_style = {
    'font.family': 'sans-serif',
    'font.size': 12,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'axes.grid': True,
    'grid.color': '#cccccc',
    'grid.linewidth': 0.5,
    'axes.facecolor': 'white',
    'figure.facecolor': 'white',
}
plt.rcParams.update(custom_style)

Question 14. Describe three differences between a histogram and a KDE plot. When would you prefer one over the other?

Expected answer: (1) Histograms use discrete bins; KDE produces a smooth continuous curve. (2) Histograms are sensitive to bin width choice; KDE is sensitive to bandwidth choice. (3) Histograms show actual counts per bin; KDE estimates a probability density function. Prefer histograms when exact counts matter, for small sample sizes, or when the audience is non-technical. Prefer KDE for comparing multiple distributions on the same axes or when a smooth representation is more informative.

Question 15. Write Python code to create a seaborn pair plot of a DataFrame df using only the columns ['temp', 'co2', 'sea_level'], colored by a column called 'decade', with the diagonal showing KDE plots.

Expected answer:

import seaborn as sns

sns.pairplot(df, vars=['temp', 'co2', 'sea_level'], hue='decade', diag_kind='kde')

Question 16. Explain what constrained_layout does in matplotlib and why it is preferred over manually adjusting hspace and wspace.

Expected answer: constrained_layout automatically calculates spacing between subplots to prevent overlapping labels, titles, colorbars, and legends. It responds dynamically to figure resizing and element addition. Manual hspace/wspace values are static, require trial and error, and break when the figure content changes. Constrained layout automates a tedious and error-prone process.

Question 17 (Multiple Choice). What does ax.set_aspect('equal') do?

(a) Makes the figure square (b) Ensures one unit on the x-axis has the same screen length as one unit on the y-axis (c) Sets the axis limits to be identical (d) Removes padding around the data

Answer: (b) Ensures one unit on the x-axis has the same screen length as one unit on the y-axis.

Question 18. A student creates a seaborn heatmap of a correlation matrix but the annotations are too small to read and the color scale is hard to interpret. Write code that fixes both issues.

Expected answer:

import seaborn as sns

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    annot_kws={'size': 12},
    cmap='RdBu_r',
    center=0,
    vmin=-1,
    vmax=1,
    linewidths=0.5,
)

Key fixes: annot_kws={'size': 12} increases annotation font size; center=0 with a diverging colormap (RdBu_r) makes the color scale interpretable for correlations.

Section C: Interactive, Specialized, and Production (Questions 19--30, 5 points each)

Question 19. Compare Plotly Express and Altair on three dimensions: API philosophy, interactivity model, and handling of large datasets.

Expected answer: API: Plotly Express is imperative (function calls with keyword arguments); Altair is declarative (mark + encoding specification). Interactivity: Plotly provides zoom, pan, and hover by default; Altair supports linked selections and conditional encoding. Large data: Plotly handles large datasets natively; Altair has a default 5,000-row limit requiring pre-aggregation or the vegafusion transformer.

Question 20. Write Plotly Express code to create an animated scatter plot of a DataFrame df with x='year', y='temperature', color='region', and animation_frame='decade'.

Expected answer:

import plotly.express as px

fig = px.scatter(df, x='year', y='temperature', color='region',
                 animation_frame='decade')
fig.show()

Question 21. Explain two risks of using a choropleth map to display population data and suggest an alternative visualization.

Expected answer: (1) Large geographic areas (e.g., Siberia, Alaska) dominate the visual impression regardless of their population. (2) Variation within regions is hidden; a single color represents the entire area. Alternative: a dot density map, a cartogram that sizes regions by population, or a bubble map placed at region centroids.

Question 22 (Multiple Choice). In a Streamlit dashboard, which function would you use to prevent a data-loading function from re-executing on every user interaction?

(a) st.session_state (b) st.cache_data (c) st.experimental_memo (d) st.rerun

Answer: (b) st.cache_data.

Question 23. Describe the callback model in Dash. How does it differ from Streamlit's execution model, and what advantage does it provide?

Expected answer: Dash uses explicit callbacks: Python functions decorated with @app.callback that specify Input, Output, and State components. When an input changes, only the relevant callback fires. Streamlit reruns the entire script on every interaction. Dash's model provides finer-grained control over what updates when, making it more efficient for complex applications with many interdependent components.

Question 24. A time-series chart of stock prices shows 20 years of daily data. The chart is a solid mass of lines with no visible trend. Suggest three visualization strategies to reveal the underlying patterns.

Expected answer: (1) Apply a rolling average (e.g., 200-day moving average) to smooth daily noise and reveal the trend. (2) Resample to weekly or monthly data to reduce the number of points. (3) Use a seasonal decomposition to separate the trend, seasonal, and residual components into separate panels. Other valid answers: horizon charts, log scale to show relative changes, or small multiples by year.

Question 25. Write Python code for a minimal Streamlit app that loads a CSV file, lets the user select a column from a dropdown, and displays a histogram of the selected column.

Expected answer:

import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
column = st.selectbox('Select a column', df.select_dtypes('number').columns)
fig, ax = plt.subplots()
ax.hist(df[column], bins=30, edgecolor='black')
ax.set_xlabel(column)
ax.set_ylabel('Count')
st.pyplot(fig)

Question 26. Explain the purpose of a visualization style guide and list four elements it should contain.

Expected answer: A visualization style guide ensures consistency across all charts produced by a team or organization, building brand recognition and reducing decision fatigue. It should contain: (1) the official color palette with hex codes, (2) approved font families and size hierarchy, (3) chart templates for common use cases (bar, line, scatter), and (4) rules for logo placement, source attribution, and annotation style. Other valid elements: spacing and margin standards, approved chart types, accessibility requirements.

Question 27 (Multiple Choice). Which technique is most appropriate for visualizing 1 million points in a scatter plot?

(a) Increase the figure size to 30x30 inches (b) Use datashader or density-based aggregation (c) Randomly sample 1 million points down to 100 (d) Switch to a 3D scatter plot

Answer: (b) Use datashader or density-based aggregation.

Question 28. Describe the complete visualization workflow from Chapter 33 (all eight steps). For each step, write one sentence explaining its purpose.

Expected answer: (1) Question: define the specific question the visualization will answer. (2) Data: acquire, clean, and structure the data for the intended chart type. (3) Sketch: draw the chart by hand to explore layout, encoding, and annotation before writing code. (4) Encode: map data variables to visual channels (position, color, size, shape) based on perception science. (5) Build: implement the sketch in Python code using the appropriate library. (6) Refine: adjust styling, typography, color, spacing, and annotation for clarity and aesthetics. (7) Critique: review the chart against design principles, test with a sample audience, and check accessibility. (8) Publish: export in the appropriate format and resolution for the target medium.

Question 29. A student's capstone project uses matplotlib for static figures, Plotly for interactive charts, and Streamlit for the dashboard. Each tool uses its own default styling, producing a visually inconsistent project. How should the student fix this?

Expected answer: Create a unified style definition: a color palette (list of hex codes), a font specification, and margin/spacing standards. Apply these to matplotlib via rcParams or a style sheet, to Plotly via a template or layout dictionary, and to Streamlit via custom CSS or theme configuration. The colors, fonts, and overall visual tone should be identical across all three tools.

Question 30. Reflect on the course. Choose one visualization you created early in the semester and one you created near the end. Describe three specific improvements in the later chart that demonstrate what you learned.

Expected answer: This is an open-ended reflection question. Grade based on the specificity and accuracy of the principles cited. Strong answers reference specific concepts by name (pre-attentive processing, data-ink ratio, annotation hierarchy, perceptually uniform color maps, etc.) and connect them to concrete changes in the chart. Weak answers use vague language ("it looks better") without reference to principles.

Capstone Project Rubric

The capstone project requires students to tell a complete data story: from raw data through exploration, design, implementation, and presentation. Grade each category on a 4-point scale.

Category 1: Question and Data (20 points)

Score	Criteria
5	Clear, specific, answerable question. Data is appropriate, well-documented, and properly cited.
4	Good question with minor vagueness. Data is appropriate and cited.
3	Question is broad or generic. Data is adequate but documentation is thin.
2	Question is unclear. Data choice is questionable or poorly documented.
1	No clear question. Data source is unattributed or inappropriate.

Category 2: Design and Perception (25 points)

Score	Criteria
5	Chart types are well-chosen and justified. Color palettes are perceptually sound and accessible. Data-ink ratio is optimized. Design choices are grounded in perception science principles.
4	Good chart type selection. Colors are accessible. Minor design issues (slight clutter, one suboptimal encoding).
3	Adequate chart types but no explicit justification. Some color or encoding issues. Design principles are applied inconsistently.
2	Poor chart type selection for the data. Color palette is inaccessible or misleading. Significant clutter.
1	Chart types are inappropriate. No evidence of design thinking.

Category 3: Technical Implementation (20 points)

Score	Criteria
5	Code is clean, well-organized, and uses appropriate libraries. Object-oriented matplotlib API is used correctly. Figures are properly sized and exported at appropriate resolution.
4	Code is functional and mostly organized. Minor style issues. Figures are properly exported.
3	Code works but is disorganized. Mixed pyplot/OO usage. Some export issues.
2	Code has errors or produces incorrect output. Poor organization.
1	Code does not run or produces fundamentally wrong visualizations.

Category 4: Annotation and Storytelling (20 points)

Score	Criteria
5	Every chart has a takeaway-driven title, proper axis labels with units, source attribution, and targeted annotations. Charts are sequenced into a clear narrative arc.
4	Good annotation on most charts. Narrative structure is present but could be tightened.
3	Some charts lack titles or labels. Narrative is present but weak. Annotations are generic.
2	Most charts lack proper annotation. No clear narrative structure.
1	Charts are unlabeled. No narrative structure.

Category 5: Consistency and Polish (15 points)

Score	Criteria
5	Consistent color palette, typography, and styling across all charts and outputs. Professional presentation quality. All outputs (static, interactive, dashboard) share a unified visual identity.
4	Mostly consistent. One or two charts deviate from the established style.
3	Inconsistent styling across charts. Some professional quality.
2	Each chart looks different. Little attention to consistency.
1	No consistent styling. Outputs use default styling throughout.

Grading Scale

Total Points	Grade
90--100	A
80--89	B
70--79	C
60--69	D
Below 60	F

Submission Requirements

Students must submit:

A Jupyter notebook or Python script containing all code, organized into clearly labeled sections.
Exported static figures in PNG format at 300 DPI.
A written narrative (800--1,200 words) explaining the data story, design choices, and tools used.
If applicable, a link to a deployed Streamlit or Dash dashboard.
A one-paragraph reflection on which course principles most influenced the project.