Appendix H: Frequently Asked Questions

Direct answers to the questions data science students and early-career practitioners ask most often. These are opinionated answers — informed by experience, not hedged for diplomacy.


Learning and Career

Should I learn R or Python?

Python. Not because R is bad — R has excellent statistical libraries and ggplot2 is genuinely beautiful. But the industry has converged on Python. Job postings overwhelmingly list Python. Production ML systems are built in Python. The scikit-learn, pandas, PyTorch, and FastAPI ecosystem is where the tooling investment is happening.

If you already know R, you are not at a disadvantage — the concepts transfer directly. But if you are starting from scratch or choosing where to invest your next 500 hours, choose Python.

The one exception: if you are going into biostatistics or academic research in certain social sciences, R remains dominant in those communities. Follow the tools your collaborators use.

How much math do I really need?

More than zero, less than a math degree.

For applied data science — the kind this book teaches — you need:

  • Linear algebra: Understand vectors, matrices, dot products, and matrix multiplication. You do not need to prove theorems. You need to know why a feature matrix has shape (n_samples, n_features) and what a linear combination of features means.
  • Calculus: Understand what a derivative is (rate of change), what a gradient is (direction of steepest ascent), and why gradient descent works. You do not need to differentiate functions by hand.
  • Probability and statistics: Understand distributions, conditional probability, Bayes' theorem, hypothesis testing, confidence intervals, and p-values. This is probably the most important math for data science.
  • Optimization: Understand that training a model means minimizing a loss function, and that different algorithms do this differently.

Chapter 4 covers the math you need. If you want to go deeper, see Appendix C for the reference.

The honest truth: you can ship production ML models without deriving the gradient of cross-entropy loss by hand. But the practitioners who understand the math debug faster, choose algorithms more wisely, and recognize when something is going wrong before the metrics tell them.

When should I use deep learning?

When you have at least one of these:

  1. Unstructured data: Images, audio, video, long-form text. Deep learning dominates here. If your input is a photograph or a paragraph of text, use a neural network.
  2. Massive labeled datasets: Millions of examples with labels. Deep learning is data-hungry. With 10,000 rows of tabular data, gradient boosting almost always wins.
  3. A well-established architecture for your problem: Image classification (ResNets, EfficientNet), NLP (transformers, BERT, GPT), speech (Whisper, Wav2Vec). Do not invent a new architecture — use what works.

When you should NOT use deep learning:

  • Tabular data with fewer than 100,000 rows. XGBoost or LightGBM will likely outperform a neural network and train in seconds instead of hours.
  • When interpretability is required. SHAP works on gradient boosting; explaining a 50-layer neural network is harder.
  • When you do not have GPU infrastructure and cannot afford cloud compute.
  • When you need a model in production tomorrow and do not have a deep learning deployment pipeline.

Chapter 36 provides a deeper preview. The rule of thumb: start with gradient boosting on tabular data, always. Reach for deep learning when the data or the problem demands it.

How do I get hired as a data scientist?

The hiring market rewards three things, roughly in this order:

  1. A portfolio of completed projects that demonstrate end-to-end work. Not Kaggle competition leaderboard screenshots — complete projects with problem framing, data cleaning, modeling, evaluation, and a writeup that explains your decisions. The capstone project from Chapter 35 is designed to be portfolio-ready.

  2. The ability to communicate technical results to non-technical people. In interviews, you will be asked to explain your projects. Practice explaining what your model does, why you chose that approach, and what the business impact is, in language a product manager would understand.

  3. Technical skills demonstrated through coding interviews. Most data science interviews include a take-home case study (build a model from a dataset in 4-8 hours) and/or a live coding session (SQL queries, pandas manipulation, algorithm implementation). Practice with real datasets, not LeetCode.

What matters less than people think: a master's degree from a top school (helpful but not necessary), Kaggle rankings (interesting but rarely asked about), knowing every algorithm (depth beats breadth).

What matters more than people think: SQL skills (every company runs on SQL), communication skills (see point 2), and domain knowledge in a specific industry (healthcare, finance, e-commerce).

What is the difference between a data scientist and an ML engineer?

The boundary is blurry and varies by company, but the general distinction:

Data scientist: Focuses on analysis, experimentation, and model development. Spends most of their time in notebooks, working with stakeholders to frame problems, doing EDA, building and evaluating models, and communicating results. Typically reports to a data or analytics team.

ML engineer: Focuses on building the systems that run models in production. Spends most of their time writing production code, building data pipelines, deploying model serving infrastructure, and monitoring performance. Typically reports to an engineering team.

Data analyst: Focuses on descriptive analytics, dashboards, and ad-hoc queries. More SQL and BI tools, less modeling. Often the starting point for a data science career.

MLOps engineer: A specialization of ML engineering focused specifically on the infrastructure and tooling for model deployment, monitoring, and lifecycle management.

In practice, at smaller companies one person does all of this. At larger companies, these are distinct roles with distinct career ladders. This book covers the data scientist role but deliberately includes Chapters 29-32 (software engineering, experiment tracking, deployment, monitoring) because modern data scientists need to understand the full lifecycle even if they do not build every piece themselves.

I finished this book. What should I learn next?

Depends on where you want to go:

  • Deeper into ML: Deep learning (fast.ai course, then the PyTorch documentation). Causal inference (Cunningham's Causal Inference: The Mixtape). Bayesian methods (McElreath's Statistical Rethinking).
  • Toward ML engineering: Software engineering fundamentals (design patterns, testing, CI/CD). Kubernetes and Docker. Distributed systems. Ray or Spark for large-scale ML.
  • Toward specialization: NLP (Hugging Face course, then the transformers paper). Computer vision (Stanford CS231n). Time series (Hyndman's Forecasting: Principles and Practice). Recommender systems (build one and deploy it).
  • Toward leadership: Product management skills. Stakeholder communication. The "business of data science" material from Chapter 34, taken further.

See Appendix I for specific resources.


Technical Questions

Is my model good enough?

That depends entirely on the context, and asking this question correctly is half the battle.

"Good enough" is defined by:

  1. The baseline: Is the model better than the current decision process? If the current process is a human guessing, the bar is low. If the current process is a well-tuned model, the bar is high.
  2. The business impact: What is the cost of a wrong prediction? A recommendation engine that is 60% accurate might be fine. A medical diagnostic model at 60% accuracy is dangerous.
  3. The cost of improvement: Can you get 2% more accuracy by spending another month? Is that month worth it given the business impact?
  4. The theoretical ceiling: Some problems are inherently noisy. If the Bayes error rate is 15%, you will never get below 15% no matter how sophisticated your model is. Churn prediction with 85% AUC is probably near the ceiling; 60% AUC is not.

Rules of thumb (not rules of law):

Task Metric "Good" "Excellent"
Binary classification (balanced) ROC AUC > 0.80 > 0.90
Binary classification (imbalanced) PR AUC > 0.50 > 0.75
Regression (general) R-squared > 0.70 > 0.85
Churn prediction ROC AUC > 0.75 > 0.85
Fraud detection Recall@1% FPR > 0.30 > 0.60

These benchmarks are approximate and domain-dependent. The real answer is: your model is good enough when the expected value of deploying it exceeds the expected value of not deploying it.

How do I handle data leakage?

Data leakage means your model has access to information during training that it would not have at prediction time. It is the most common source of suspiciously good results.

Common sources:

  1. Target leakage: A feature that is a direct consequence of the target. Example: including "cancellation_date" as a feature when predicting churn.
  2. Train-test contamination: Fitting a scaler or encoder on the full dataset before splitting. The test set's statistics leak into the training process.
  3. Temporal leakage: Using future data to predict the past. Example: using next month's usage metrics to predict this month's churn.
  4. Group leakage: Patients from the same hospital in both train and test sets, where hospital-specific patterns create false generalization.

How to prevent it:

  • Split your data FIRST, before any analysis or transformation.
  • Use scikit-learn Pipelines so that all transformations are fit on training data only.
  • For time series, always use temporal splits (train on the past, test on the future).
  • Ask yourself: "Would I have this feature available when I need to make a real prediction?"
  • If your model seems too good, it probably is. Investigate.

Why does my model perform well in cross-validation but poorly in production?

Common causes, in order of likelihood:

  1. Data drift: The production data distribution has shifted from your training data. Feature distributions change, user behavior changes, the world changes.
  2. Data leakage in your evaluation: Your CV setup leaks information (see above). The model never actually generalized.
  3. Feature pipeline differences: The features computed in production differ subtly from those computed in training. A timestamp parsed differently, a categorical value not seen in training, a join that behaves differently on live data.
  4. Sample bias: Your training data is not representative of the production population. You trained on one customer segment; production serves all segments.
  5. Label definition mismatch: The target variable you trained on does not match the outcome you care about in production.

Chapter 32 covers monitoring strategies that help you diagnose these issues quickly.

Should I use scikit-learn Pipelines?

Yes. Always. Without exception.

Pipelines guarantee that your preprocessing and modeling steps are applied consistently to training and test data. They prevent train-test contamination, make your code reproducible, and are required for proper cross-validation.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", GradientBoostingClassifier(random_state=42))
])

# This fits the scaler on training folds only — no leakage
pipe.fit(X_train, y_train)
pipe.predict(X_test)

The initial setup cost is slightly higher than writing separate steps. The debugging cost you save over the life of the project is enormous.

XGBoost or LightGBM?

Both are excellent. For most tabular problems, you will not see a meaningful performance difference between them. Choose based on practical factors:

  • LightGBM trains faster on large datasets (histogram-based splitting by default, efficient memory usage). Better when you have millions of rows.
  • XGBoost has slightly broader community support and more tutorials. Its dart booster can sometimes outperform on smaller datasets.
  • CatBoost handles categorical features natively without preprocessing. Worth trying if you have many categoricals.

Start with LightGBM for speed. Switch to XGBoost if you need specific features (e.g., monotone constraints, GPU training on Windows). Try CatBoost if you have messy categoricals.

The real performance gains come from feature engineering and proper evaluation, not from choosing between these three libraries.

How do I handle categorical features with many levels?

Depends on how many levels:

  • 2-10 levels: One-hot encoding. Simple, interpretable, works everywhere.
  • 10-50 levels: One-hot still works for tree-based models (they handle sparse features). For linear models, consider target encoding with cross-validation (Chapter 7).
  • 50-1,000 levels: Target encoding or embedding. One-hot creates too many sparse columns. Leave-one-out target encoding with proper CV is the standard approach for gradient boosting.
  • 1,000+ levels (e.g., zip codes, product IDs): Aggregate to a higher level (state instead of zip code, product category instead of product ID), or use target encoding, or use embedding layers (if using deep learning). Do not one-hot encode 10,000 zip codes.

CatBoost handles high-cardinality categoricals natively. If you are using XGBoost or LightGBM, target encoding with sklearn.preprocessing.TargetEncoder (scikit-learn >= 1.3) is the recommended approach.

When should I use SQL vs. pandas?

Use SQL when:

  • The data lives in a database and you need to extract/filter/aggregate before bringing it into Python.
  • The dataset is too large to fit in memory. Let the database engine do the heavy lifting.
  • You need to join multiple tables. SQL joins are more readable and often faster than pandas merges.
  • You need windowed aggregations (rolling averages, cumulative sums, rank within groups). SQL window functions are purpose-built for this.

Use pandas when:

  • The data fits in memory and you need iterative, exploratory analysis.
  • You need complex feature engineering with conditional logic.
  • You need integration with scikit-learn pipelines.
  • You need visualization (pandas integrates with matplotlib and seaborn).

In practice, most data science workflows use both: SQL to extract and aggregate, pandas to engineer features and prepare for modeling. Chapter 5 covers this workflow in detail.

My model is overfitting. What should I try?

In this order:

  1. Get more data. This is always the best solution and almost always the hardest.
  2. Reduce model complexity. Fewer trees, shallower depth, stronger regularization. For gradient boosting: lower max_depth, lower n_estimators, higher min_child_weight, higher reg_alpha/reg_lambda.
  3. Remove noisy features. Feature selection (Chapter 9) can reduce overfitting by removing features that fit noise.
  4. Add regularization. L1 (Lasso) for feature selection, L2 (Ridge) for coefficient shrinkage, or both (ElasticNet).
  5. Use cross-validation to tune hyperparameters. Do not tune on a single train/test split. Use 5-fold CV minimum.
  6. Use early stopping for gradient boosting. Train with many estimators but stop when validation performance plateaus. This is free regularization.
  7. Try a simpler model. If your gradient boosting model overfits, see if logistic regression or a single decision tree does better. Sometimes the signal is simple and the complex model is fitting noise.

How do I deal with a dataset that does not fit in memory?

Several options, from simplest to most complex:

  1. Sample. If you have 100 million rows and 10,000 is enough to learn the patterns, sample first and iterate fast.
  2. Use efficient dtypes. df["col"] = df["col"].astype("category") and df["col"] = pd.to_numeric(df["col"], downcast="integer") can reduce memory by 50-80%.
  3. Use chunked reading. pd.read_csv("file.csv", chunksize=100000) processes chunks in a loop.
  4. Use Polars. A Rust-based DataFrame library that is faster and more memory-efficient than pandas.
  5. Use Dask or Vaex. Lazy evaluation DataFrames that operate on larger-than-memory data.
  6. Use SQL. Aggregate in the database, bring only the result into Python.
  7. Use Spark (PySpark). For truly massive datasets distributed across a cluster.

Chapter 28 covers large dataset strategies in detail.


Process and Strategy

How many features should my model have?

There is no universal number, but there are guidelines:

  • More features than rows is a red flag. You need regularization or dimensionality reduction.
  • For gradient boosting on tabular data: 20-200 features is typical. Feature importance usually shows that 10-30 features carry most of the signal.
  • For linear models: Fewer is better. Multicollinearity hurts. Feature selection matters more.
  • The curse of dimensionality: As features increase, the data becomes increasingly sparse in high-dimensional space. Models need exponentially more data to maintain performance.

Start with domain knowledge to engineer the features that should matter. Then let feature importance and feature selection (Chapter 9) prune the rest.

Should I always tune hyperparameters?

Yes, but know when the returns diminish.

The biggest performance gains come from: 1. Better features (huge impact) 2. Choosing the right algorithm family (large impact) 3. Fixing data quality issues (large impact) 4. Hyperparameter tuning (moderate impact) 5. Ensemble methods (small impact)

Hyperparameter tuning on a bad feature set is rearranging deck chairs on the Titanic. Get your features right first. Then tune.

For gradient boosting, the most impactful hyperparameters are: learning_rate, max_depth, n_estimators, min_child_weight, and subsample. Start with a coarse random search, then refine with Bayesian optimization if the dataset is large enough to warrant it. Chapter 18 covers this in detail.

When should I use ensemble methods?

Almost always, at least implicitly — gradient boosting IS an ensemble method.

Beyond that:

  • Stacking/blending: When you are in a Kaggle competition and need every 0.1% of accuracy. In production, the complexity and maintenance cost of stacking multiple models rarely justifies the marginal gain.
  • Simple averaging: When you have two models with similar performance but different failure modes (e.g., a linear model and a tree-based model). Averaging their predictions often outperforms either one alone.
  • Voting classifiers: Same idea as averaging but for classification. Easy to implement, modest improvement, low risk.

The practical advice: use a single well-tuned gradient boosting model in production. Use ensembles when the business value of 0.5% more accuracy exceeds the engineering cost of maintaining multiple models.