Chapter 27 Quiz: Logistic Regression and Classification — Predicting Categories

Q: Recall measures: - (A) Of all positive predictions, how many are correct - (B) Of all actual positives, how many are correctly identified - (C) The overall percentage of correct predictions - (D) The probability threshold used for classification

Correct: (B) Recall (sensitivity) = TP / (TP + FN). It answers: "Of all the cases that are actually positive, what fraction did the model catch?" (A) describes precision. (C) describes accuracy. (D) is unrelated.

Q: The F1-score is: - (A) The average of accuracy and precision - (B) The harmonic mean of precision and recall - (C) Another name for R-squared in classification - (D) The sum of true positives and true negatives

Correct: (B) F1 = 2 (precision recall) / (precision + recall). The harmonic mean gives more weight to the lower value, so F1 is only high when both precision and recall are high. This makes it a useful single metric when you want to balance precision and recall.

Q: True or False: A classification model's accuracy is always a reliable measure of its performance.

False. Accuracy is misleading when classes are imbalanced. A model that always predicts the majority class achieves high accuracy without learning anything. For imbalanced datasets, precision, recall, F1-score, and the confusion matrix provide much more meaningful evaluation.

Contributors to Introduction to Data Science

Chapter 27 Quiz: Logistic Regression and Classification — Predicting Categories

Instructions: This quiz tests your understanding of Chapter 27. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. The sigmoid function transforms its input into a value between:

(A) -1 and 1
(B) 0 and 1
(C) -infinity and infinity
(D) 0 and 100

Answer

**Correct: (B)** The sigmoid function maps any real number to the range (0, 1). This is precisely what makes it useful for classification — the output can be interpreted as a probability. (A) describes the range of hyperbolic tangent. (C) describes the input range, not the output. (D) would require multiplying by 100.

Question 2. Despite its name, logistic regression is used for:

(A) Predicting continuous numerical values
(B) Classification — predicting categorical outcomes
(C) Clustering data into groups
(D) Reducing the number of features

Answer

**Correct: (B)** Logistic regression is a classification algorithm that predicts the probability of an observation belonging to a category. The "regression" in its name is historical — it uses a regression framework (weighted sum of features) internally, but the sigmoid function transforms the output into a probability for classification.

Question 3. In a confusion matrix, a false positive occurs when:

(A) The model correctly predicts the positive class
(B) The model predicts positive, but the actual class is negative
(C) The model predicts negative, but the actual class is positive
(D) The model correctly predicts the negative class

Answer

**Correct: (B)** A false positive (also called a Type I error or false alarm) occurs when the model predicts positive but the observation is actually negative. For example, a spam filter marks a legitimate email as spam. (A) describes a true positive. (C) describes a false negative. (D) describes a true negative.

Question 4. Recall measures:

(A) Of all positive predictions, how many are correct
(B) Of all actual positives, how many are correctly identified
(C) The overall percentage of correct predictions
(D) The probability threshold used for classification

Answer

**Correct: (B)** Recall (sensitivity) = TP / (TP + FN). It answers: "Of all the cases that are actually positive, what fraction did the model catch?" (A) describes precision. (C) describes accuracy. (D) is unrelated.

Question 5. The predict_proba method in scikit-learn returns:

(A) Binary predictions (0 or 1)
(B) The model's coefficients
(C) Probability estimates for each class
(D) The confusion matrix

Answer

**Correct: (C)** `predict_proba` returns an array with probability estimates for each class. For binary classification, it returns two columns: the probability of class 0 and the probability of class 1 (which sum to 1). (A) describes `predict`. (B) is accessed via `model.coef_`. (D) is computed using `confusion_matrix`.

Question 6. A model predicting rare disease has 99% accuracy, but the disease occurs in only 1% of the population. This means:

(A) The model is excellent and should be deployed
(B) A model that always predicts "no disease" would also achieve 99% accuracy
(C) The model must have very high recall
(D) Class imbalance is not a concern here

Answer

**Correct: (B)** With 1% disease prevalence, a model that always predicts "no disease" achieves 99% accuracy — no learning required. The 99% accuracy is therefore not evidence that the model has learned anything useful. This is the class imbalance problem: accuracy is misleading when classes are highly unequal. You need to examine precision, recall, and the confusion matrix to evaluate the model properly.

Question 7. Lowering the classification threshold from 0.5 to 0.3 will generally:

(A) Increase both precision and recall
(B) Decrease both precision and recall
(C) Increase recall but decrease precision
(D) Increase precision but decrease recall

Answer

**Correct: (C)** Lowering the threshold means more observations are classified as positive (it's easier to cross the threshold). This catches more actual positives (higher recall — fewer false negatives) but also includes more false positives (lower precision — more false alarms). This is the precision-recall tradeoff.

Question 8. In logistic regression, a positive coefficient for a feature means:

(A) The feature increases the target value by that many units
(B) Higher values of that feature increase the probability of the positive class
(C) The feature is more important than features with negative coefficients
(D) The feature is positively correlated with all other features

Answer

**Correct: (B)** A positive coefficient means that as the feature value increases, the log-odds of the positive class increase, which means the probability of the positive class increases. (A) confuses logistic with linear regression — the effect on probability is not linear. (C) is wrong — importance depends on both the coefficient magnitude and the feature's scale. (D) is unrelated.

Question 9. class_weight='balanced' in scikit-learn's LogisticRegression:

(A) Makes all features equally important
(B) Equalizes the number of samples in each class
(C) Adjusts the loss function to give more weight to the minority class
(D) Removes outliers from the dataset

Answer

**Correct: (C)** The `class_weight='balanced'` parameter adjusts the importance of each class inversely proportional to its frequency. This means the model pays more attention to the minority class during training, effectively treating misclassifying a rare case as a bigger error than misclassifying a common case. It doesn't change the data — it changes how the model learns from it.

Question 10. The F1-score is:

(A) The average of accuracy and precision
(B) The harmonic mean of precision and recall
(C) Another name for R-squared in classification
(D) The sum of true positives and true negatives

Answer

**Correct: (B)** F1 = 2 * (precision * recall) / (precision + recall). The harmonic mean gives more weight to the lower value, so F1 is only high when both precision and recall are high. This makes it a useful single metric when you want to balance precision and recall.

Section 2: True/False (4 questions, 5 points each)

Question 11. True or False: Logistic regression can only be used for binary classification (two classes).

Answer

**False.** While logistic regression is most commonly used for binary classification, it can be extended to handle multiple classes (multinomial logistic regression). scikit-learn's LogisticRegression supports multi-class classification through the `multi_class` parameter. However, in this chapter we focused on the binary case.

Question 12. True or False: A classification model's accuracy is always a reliable measure of its performance.

Answer

**False.** Accuracy is misleading when classes are imbalanced. A model that always predicts the majority class achieves high accuracy without learning anything. For imbalanced datasets, precision, recall, F1-score, and the confusion matrix provide much more meaningful evaluation.

Question 13. True or False: The default classification threshold in scikit-learn is 0.5, and this is always the optimal threshold.

Answer

**False.** The default threshold of 0.5 is a convenient starting point but is rarely optimal. The best threshold depends on the relative costs of false positives and false negatives in the specific application. In cancer screening (where missing cancer is worse than a false alarm), a lower threshold is better. In spam filtering (where losing real email is worse than seeing spam), a higher threshold might be better.

Question 14. True or False: If a logistic regression coefficient for "income" is -0.003, then higher income decreases the probability of the positive class.

Answer

**True.** A negative coefficient means that as the feature increases, the log-odds of the positive class decrease, which means the probability of the positive class decreases. So higher income is associated with a lower probability of whatever the positive class is (e.g., loan default, if default is coded as positive).

Section 3: Short Answer (3 questions, 5 points each)

Question 15. Explain the difference between predict and predict_proba in scikit-learn, and give one scenario where predict_proba is more useful.

Answer

`predict` returns binary class labels (0 or 1) based on the default threshold of 0.5. `predict_proba` returns the underlying probability estimates for each class. `predict_proba` is more useful when you need to rank observations or make nuanced decisions. For example, in customer churn prediction, knowing that Customer A has a 0.92 probability of churning and Customer B has a 0.55 probability allows you to prioritize retention efforts — sending the most expensive offers to the highest-risk customers. With `predict`, both would simply be labeled "will churn," losing the distinction.

Question 16. Why is the precision-recall tradeoff important? Give a specific example where prioritizing one over the other is clearly the right choice.

Answer

The precision-recall tradeoff means you generally cannot maximize both simultaneously — increasing one decreases the other through the threshold. This is important because different applications have different error costs. Example: In airport security screening, recall should be prioritized. Missing an actual threat (false negative) has catastrophic consequences, while a false alarm (false positive) only inconveniences a passenger. The TSA accepts low precision (many false alarms) to achieve very high recall (almost never missing a real threat). Setting the threshold low ensures almost all threats are caught, even at the cost of additional screening for many innocent travelers.

Question 17. What is class imbalance, and why does it make accuracy a poor metric? Suggest two alternative metrics.

Answer

Class imbalance occurs when one class is much more common than the other (e.g., 99% negative, 1% positive). Accuracy becomes misleading because a model can achieve very high accuracy by simply always predicting the majority class — without learning any real patterns or detecting any positive cases. Two better metrics: (1) **Recall** — measures what fraction of actual positives the model catches, which is critical when positive cases are rare and important (fraud, disease). (2) **F1-score** — the harmonic mean of precision and recall, providing a balanced measure that is only high when both precision and recall are reasonable. Both metrics specifically evaluate performance on the positive class rather than rewarding the model for correctly predicting the easy majority.

Section 4: Applied Scenarios (2 questions, 7.5 points each)

Question 18. A bank builds a logistic regression model to predict loan defaults. Results on the test set:

Confusion Matrix:
                  Predicted Default    Predicted No Default
Actually Default        60                    40
Not Default             20                   880

Calculate accuracy, precision, and recall for detecting defaults.
The bank's current policy is to deny loans to anyone the model flags as "default." What percentage of denied applicants would actually have defaulted?
What percentage of actual defaults does the model miss?
The bank considers lowering the threshold. What would happen to precision and recall? Is this a good idea?

Answer

1. **Accuracy** = (60 + 880) / 1000 = 94.0%. **Precision** = 60 / (60 + 20) = 75.0%. **Recall** = 60 / (60 + 40) = 60.0%. 2. Precision = 75%. So 75% of denied applicants would actually have defaulted. 25% of denials are "false alarms" — people who would have repaid but are denied a loan. 3. The model misses 40 out of 100 actual defaults — a 40% miss rate. 40 people who will default are given loans. 4. Lowering the threshold would increase recall (catch more of the 40 missed defaults) but decrease precision (deny loans to more people who would have repaid). Whether this is good depends on the relative cost: is it worse to give a loan to someone who defaults (cost of bad debt) or to deny a loan to someone who would repay (lost revenue and harm to the applicant)? The bank likely loses more money per default than per missed good customer, so slightly lower threshold might be justified — but the 25% of wrongly denied applicants also matters, especially if denials disproportionately affect certain demographic groups.

Question 19. A model classifies countries as "high vaccination" or "low vaccination." Results:

Metric	Value
Accuracy	83%
Precision (high vacc)	88%
Recall (high vacc)	85%
Precision (low vacc)	75%
Recall (low vacc)	80%

Is the model better at identifying high-vaccination or low-vaccination countries? Justify with specific metrics.
The purpose of the model is to identify countries that need vaccination support (low vaccination). Which metric matters most for this purpose?
If 20% of the errors are countries classified as "high" when they're actually "low" (false negatives for the low class), what is the real-world consequence?
Suggest how you would use probability outputs to improve decision-making for a public health organization.

Answer

1. The model is better at identifying **high-vaccination** countries (precision 88%, recall 85%) than low-vaccination countries (precision 75%, recall 80%). Both precision and recall are higher for the high class. 2. **Recall for the low-vaccination class** (80%) matters most. The goal is to find countries that need help — missing a country that needs support (false negative) is worse than incorrectly flagging a country that's doing fine (false positive). 80% recall means 20% of countries needing help are missed. 3. These countries won't receive the vaccination support they need. Resources will be directed elsewhere because the model classified them as "high." This is the most costly error for a public health mission — the countries most in need may be invisible to the system. 4. Instead of binary classification, rank all countries by their probability of being "low vaccination." Share the probability list with the organization so they can: (a) prioritize the lowest-probability countries first, (b) set their own threshold based on available resources and risk tolerance, (c) identify borderline countries (probability near 0.5) that warrant further investigation rather than automatic classification.

Section 5: Code Analysis (1 question, 5 points)

Question 20. The following code has a subtle error that leads to misleading results. Identify the error and explain its impact.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Dataset with 95% negative class, 5% positive class
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"Model accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("Model is performing well!")

Answer

**The error:** The code evaluates a highly imbalanced dataset (95% negative, 5% positive) using *only* accuracy. With 95% negative class prevalence, a model that always predicts "negative" would achieve 95% accuracy. The code concludes the model is "performing well" based solely on accuracy, without checking whether the model can detect the minority class at all. **The impact:** The model might have zero recall for the positive class — it might not catch a single positive case — and the code would still report high accuracy and declare success. This is especially dangerous if the positive class is the class of interest (fraud, disease, etc.). **The fix:** Add evaluation metrics that assess performance on the minority class:

from sklearn.metrics import (classification_report,
    confusion_matrix, recall_score)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Also consider: was the baseline beaten?
baseline = (y_test == 0).mean()  # Always predict majority
print(f"Baseline accuracy: {baseline:.3f}")