Index

How to use this index: Entries are organized alphabetically. Chapter references use the format Ch.N for the chapter and a section reference where applicable. Bold page references indicate the primary discussion of a topic.

A

Absolute value, Appendix A Accuracy (model evaluation), Ch.29 29.2, Ch.30 30.4 Aesthetic mapping, Ch.14 14.2, Ch.15 15.1, Ch.16 16.1 Aggregation, Ch.7 7.6, Ch.9 9.3, Ch.11 11.4 Algorithm, Ch.25 25.1, Ch.28 28.1 Anaconda, installation of, Ch.2 2.2, Appendix C API (Application Programming Interface), Ch.13 13.3, Appendix D Assignment operator, Ch.3 3.2, Ch.3 3.5 AUC-ROC, Ch.29 29.4

B

Bar chart, Ch.15 15.3, Ch.18 18.2 Bias, in models, Ch.25 25.4, Ch.29 29.5 Bias, in sampling, Ch.22 22.2, Ch.32 32.3 Bias-variance tradeoff, Ch.25 25.4, Ch.29 29.5, Ch.30 30.3 Binomial distribution, Ch.21 21.3 Boolean indexing, Ch.7 7.4, Ch.8 8.2 Boolean type, Ch.3 3.4, Ch.3 3.6 Box plot, Ch.16 16.3, Ch.19 19.3 Broadcasting (NumPy), Ch.7 7.5

C

Cartesian coordinate system, Ch.14 14.4, Appendix A Categorical variable, Ch.7 7.2, Ch.14 14.2, Ch.19 19.5 Causal question, Ch.1 1.3, Ch.24 24.4, Ch.32 32.2 Central limit theorem, Ch.21 21.5, Ch.22 22.3 Classification, Ch.27 27.1, Ch.28 28.1, Ch.29 29.2 Cleaning data, Ch.8 8.1--8.7, Ch.10 10.3, Ch.11 11.2 Coefficient, regression, Ch.26 26.3, Ch.27 27.3 Color, in visualization, Ch.14 14.3, Ch.18 18.3 Comparison operators, Ch.3 3.5 Concatenation, of DataFrames, Ch.9 9.2 Conditional probability, Ch.20 20.4 Confidence interval, Ch.22 22.4, Ch.23 23.1, Ch.31 31.3 Confounding variable, Ch.24 24.3, Ch.32 32.2 Confusion matrix, Ch.29 29.2 Control flow, Ch.4 4.1--4.3, Appendix B Coordinate system, Ch.14 14.4 Correlation, Ch.24 24.1, Ch.24 24.2, Ch.26 26.1 vs. causation, Ch.24 24.3 Pearson, Ch.24 24.1 Spearman, Ch.24 24.2 Cross-validation, Ch.30 30.2, Ch.29 29.5 CSV files, Ch.6 6.3, Ch.12 12.1

D

Dashboard, Ch.17 17.4 Data dictionary, Ch.7 7.1, Ch.33 33.2 Data engineering, Ch.1 1.4, Appendix E DataFrame, Ch.7 7.2--7.6 Data science lifecycle, Ch.1 1.2, Ch.6 6.1, Ch.35 35.1 Data types, Python, Ch.3 3.3, Appendix B Data types, pandas, Ch.7 7.2, Ch.8 8.3 datetime, Ch.11 11.1--11.3 Decision tree, Ch.28 28.2, Ch.28 28.3 Debugging, Ch.3 3.8, Ch.4 4.5, Appendix B, Appendix E Dependent variable, Ch.25 25.2 Descriptive question, Ch.1 1.3 Descriptive statistics, Ch.19 19.1--19.5 Dictionary, Python, Ch.5 5.3, Appendix B Distribution, probability, Ch.21 21.1--21.5 normal, Ch.21 21.2 skewed, Ch.19 19.3, Ch.21 21.3 Domain knowledge, Ch.1 1.5, Ch.25 25.1, Ch.32 32.1 Duplicates, removing, Ch.8 8.4

E

Encoding, text, Ch.12 12.3 Environment, conda, Ch.2 2.3, Appendix C Ethics, Ch.32 32.1--32.6, Ch.13 13.5, Ch.23 23.6 Excel files, reading, Ch.12 12.2 Exploratory data analysis, Ch.6 6.4, Ch.7 7.6, Ch.19 19.1

F

F1 score, Ch.29 29.3 Faceting, Ch.14 14.5, Ch.16 16.5 False positive / false negative, Ch.29 29.2 Feature, Ch.25 25.2, Ch.26 26.2 Feature engineering, Ch.30 30.1 Figure and axes (matplotlib), Ch.15 15.2 File I/O, Ch.12 12.1--12.4, Appendix B Filter (boolean indexing), Ch.7 7.4, Ch.8 8.2 Float, Ch.3 3.3 for loop, Ch.4 4.2, Appendix B f-string, Ch.3 3.7, Appendix B Function, Python, Ch.4 4.4, Appendix B

G

Git, Ch.33 33.3 GitHub, Ch.33 33.4, Ch.34 34.2 Glossary (this book), Glossary Google Colab, Appendix C Grammar of graphics, Ch.14 14.1--14.5 Groupby, Ch.9 9.3, Ch.11 11.4, Ch.19 19.5

H

Heatmap, Ch.16 16.4, Ch.24 24.1 Histogram, Ch.15 15.4, Ch.19 19.2, Ch.21 21.1 HTML, parsing, Ch.13 13.2 Hypothesis testing, Ch.23 23.1--23.6 null hypothesis, Ch.23 23.2 p-value, Ch.23 23.3 significance level, Ch.23 23.2 Type I and Type II errors, Ch.23 23.5

I

iloc, Ch.7 7.3 Immutability, Ch.3 3.4, Ch.5 5.1 Imputation, Ch.8 8.5 Independent variable, Ch.25 25.2 Index, pandas, Ch.7 7.3, Ch.11 11.3 Inner join, Ch.9 9.1 Interquartile range (IQR), Ch.19 19.2, Ch.19 19.3 Interactive visualization, Ch.17 17.1--17.4 Iteration, Ch.4 4.2--4.3

J

Join, Ch.9 9.1 inner, Ch.9 9.1 left, Ch.9 9.1 outer, Ch.9 9.1 JSON, Ch.12 12.3, Ch.13 13.3 Jupyter notebook, Ch.2 2.4, Appendix C JupyterLab, Ch.2 2.4, Appendix C

K

Kaggle, Appendix D Kernel, Jupyter, Ch.2 2.4 Key-value pair, Ch.5 5.3

L

Lambda function, Ch.4 4.4, Appendix B Left join, Ch.9 9.1 Linear regression, Ch.26 26.1--26.5 assumptions, Ch.26 26.4 coefficients, Ch.26 26.3 residuals, Ch.26 26.4 List, Python, Ch.5 5.1, Appendix B List comprehension, Ch.4 4.4, Appendix B loc, Ch.7 7.3 Logarithm, Ch.15 15.5, Ch.26 26.5, Appendix A Logistic regression, Ch.27 27.1--27.5 Long format, Ch.9 9.4

M

Machine learning workflow, Ch.30 30.1--30.5 Markdown, Ch.2 2.5 matplotlib, Ch.15 15.1--15.6 Mean, Ch.19 19.1 vs. median, Ch.19 19.1 Median, Ch.19 19.1 Melt (pandas), Ch.9 9.4 Merge (pandas), Ch.9 9.1 Missing values, Ch.7 7.5, Ch.8 8.5 NaN, Ch.7 7.5 imputation strategies, Ch.8 8.5 dropna, Ch.8 8.5 fillna, Ch.8 8.5 Model, Ch.25 25.1--25.5 evaluation, Ch.29 29.1--29.5 overfitting, Ch.25 25.4 underfitting, Ch.25 25.4 Multicollinearity, Ch.26 26.5

N

NaN, Ch.7 7.5, Ch.8 8.5 Normal distribution, Ch.21 21.2 68-95-99.7 rule, Ch.21 21.2 Q-Q plot, Ch.21 21.4 z-score, Ch.21 21.2 Null hypothesis, Ch.23 23.2 NumPy, Ch.7 7.5

O

Ordinal variable, Ch.19 19.5 Outer join, Ch.9 9.1 Outliers, Ch.19 19.3, Ch.8 8.6 Overfitting, Ch.25 25.4, Ch.28 28.4, Ch.30 30.3

P

p-value, Ch.23 23.3, Ch.23 23.4 pandas, Ch.7 7.1--7.6 DataFrame, Ch.7 7.2 Series, Ch.7 7.2 importing, Ch.7 7.1 Percentile, Ch.19 19.2 Pipeline (scikit-learn), Ch.30 30.4 Pivot (pandas), Ch.9 9.4 Plotly, Ch.17 17.1--17.3 Population, Ch.22 22.1 Portfolio, building, Ch.34 34.1--34.4 Precision, Ch.29 29.3 Predictive question, Ch.1 1.3 Probability, Ch.20 20.1--20.5 conditional, Ch.20 20.4 Bayes' theorem, Ch.20 20.5 Proxy variable, Ch.24 24.3

Q

Q-Q plot, Ch.21 21.4

R

R-squared, Ch.26 26.3, Ch.29 29.1 Random forest, Ch.28 28.4 range(), Ch.4 4.2 Recall, Ch.29 29.3 Regression, linear, Ch.26 26.1--26.5 Regression, logistic, Ch.27 27.1--27.5 Regular expression, Ch.10 10.3 Reproducibility, Ch.33 33.1--33.4 Resampling (time series), Ch.11 11.4 Residual, Ch.26 26.4 Rolling window, Ch.11 11.4

S

Sample, Ch.22 22.1 Sampling distribution, Ch.22 22.3 Scale, in grammar of graphics, Ch.14 14.3 Scatter plot, Ch.15 15.3, Ch.24 24.1 scikit-learn, Ch.25 25.3, Ch.30 30.4 seaborn, Ch.16 16.1--16.5 Selection bias, Ch.22 22.2 Series, pandas, Ch.7 7.2 Set, Python, Ch.5 5.4, Appendix B Significance level, Ch.23 23.2 Skewness, Ch.19 19.3, Ch.21 21.3 Slice, Ch.3 3.6, Ch.5 5.1 Split-apply-combine, Ch.9 9.3 SQL, Ch.12 12.4, Appendix E Standard deviation, Ch.19 19.2 Standard error, Ch.22 22.3 Statistic (sample), Ch.22 22.1 String methods, Ch.3 3.6, Ch.10 10.1, Appendix B Structured data, Ch.1 1.4 Supervised learning, Ch.25 25.2

T

t-test, Ch.23 23.4 Test set, Ch.25 25.3, Ch.30 30.2 Tidy data, Ch.9 9.4 Time series, Ch.11 11.1--11.5 Train-test split, Ch.25 25.3 Truthiness, Ch.3 3.4 Tuple, Ch.5 5.2 Type I / Type II error, Ch.23 23.5

U

Underfitting, Ch.25 25.4 Unstructured data, Ch.1 1.4 Unsupervised learning, Ch.25 25.2

V

Variable, Python, Ch.3 3.2 Variable, statistical, Ch.1 1.4 Variance, Ch.19 19.2 Variance, in models, Ch.25 25.4 Vectorized operation, Ch.7 7.5 Version control, Ch.33 33.3 Violin plot, Ch.16 16.3 Visualization, Ch.14--Ch.18 accessibility, Ch.18 18.4 choosing chart types, Ch.14 14.6 color, Ch.18 18.3 design principles, Ch.18 18.1 misleading charts, Ch.18 18.5

W

Web scraping, Ch.13 13.1--13.2 BeautifulSoup, Ch.13 13.2 ethics of, Ch.13 13.5 robots.txt, Ch.13 13.5 Wide format, Ch.9 9.4 while loop, Ch.4 4.3

Z

z-score, Ch.21 21.2, Ch.22 22.3 zip(), Ch.4 4.4, Appendix B

This index covers the major concepts, tools, and techniques discussed in all 36 chapters and appendices. For Python-specific syntax, see also Appendix B (Python Quick Reference). For term definitions, see the Glossary.