Glossary of Terms
Python for Business for Beginners: Coding for Every Person
This glossary defines all key technical terms introduced throughout the book. Terms are organized alphabetically. The chapter number in brackets (e.g., [Ch. 3]) indicates where the term was first introduced or most fully explained. Terms marked with a domain tag indicate specialized vocabulary from that field.
A
Absolute Path
A file path that starts from the root of the file system and provides the complete location of a file or directory (e.g., C:\Users\Sandra\data\sales.csv on Windows, or /home/sandra/data/sales.csv on Linux). Absolute paths work regardless of the current working directory. Contrasted with relative paths, which are specified relative to the current location. [Ch. 6]
Accuracy A classification model evaluation metric that measures the proportion of correct predictions out of all predictions made. Calculated as (True Positives + True Negatives) / Total Predictions. While intuitive, accuracy can be misleading with imbalanced datasets — a model that always predicts the majority class can achieve high accuracy while being practically useless. [Ch. 35]
Aggregation
The process of computing a summary statistic (such as sum, mean, count, or maximum) over a group of values. In pandas, aggregation is typically performed using .groupby() combined with .agg() or named aggregation functions. In SQL contexts, aggregation functions like SUM and AVG serve the same purpose. [Ch. 14]
Algorithm A step-by-step procedure or set of rules that a computer follows to solve a specific problem. In machine learning contexts, an algorithm defines how a model learns from training data. In general programming, algorithms describe approaches to tasks like sorting, searching, or data transformation. [Ch. 1]
Alias
A shorthand name given to an imported module using the as keyword. The conventions import pandas as pd and import numpy as np are so universal in data science that they are recognized by virtually all practitioners. Using aliases reduces typing while maintaining code readability. [Ch. 8]
API (Application Programming Interface) A defined set of rules and protocols that allows one software application to communicate with another. Web APIs (also called REST APIs) allow programs to request data over the internet using HTTP. In business contexts, APIs enable Python code to retrieve live data from services like Salesforce, Stripe, or financial data providers. [Ch. 26]
API Key
A unique identifier, typically a string of letters and numbers, used to authenticate a program's requests to an external API. API keys should be treated as passwords — never hardcoded directly in scripts or committed to version control. Best practice is to store them in environment variables or a .env file. [Ch. 26]
Argument
A value passed into a function when it is called. Distinguished from a parameter, which is the variable in the function's definition that receives the value. In print("hello"), the string "hello" is an argument. Arguments can be positional (matched by order) or keyword (matched by name). [Ch. 4]
Array
An ordered collection of elements, all of the same data type, stored in contiguous memory. NumPy arrays (called ndarray) are the foundation of most numerical computing in Python and are far more efficient than lists for mathematical operations. Pandas DataFrames are built on top of NumPy arrays. [Ch. 9]
Assignment
The operation of binding a value to a variable name using the = operator. Python uses dynamic typing, so a variable's type is determined by the value assigned to it. Multiple assignment (a, b = 1, 2) and augmented assignment (x += 1) are common patterns. [Ch. 2]
Attribute
A variable or property associated with an object, accessed using dot notation. For example, df.shape accesses the shape attribute of a DataFrame. Attributes store the state of an object, while methods define its behavior. [Ch. 5]
Automation The use of programs to perform repetitive tasks that would otherwise require manual effort. In business contexts, Python automation commonly covers tasks like generating reports, processing files, sending notifications, scraping web data, and scheduling recurring data jobs. [Ch. 19]
B
Bar Chart
A visualization that displays categorical data using rectangular bars whose heights or lengths represent values. Bar charts are ideal for comparing quantities across discrete categories, such as revenue by region or headcount by department. Created in matplotlib with plt.bar() and in pandas with df.plot(kind='bar'). [Ch. 16]
Base Class
In object-oriented programming, the class from which another class inherits. Also called a parent class or superclass. The derived (child) class gains all the attributes and methods of the base class and can add or override them. In Python, every class implicitly inherits from the built-in object base class. [Ch. 22]
Boolean
A data type with only two possible values: True or False. In Python, bool is a subclass of int, where True equals 1 and False equals 0. Booleans are produced by comparison operators (==, !=, <, >, <=, >=) and logical operators (and, or, not). [Ch. 2]
Boolean Indexing
A technique for selecting rows or elements from a DataFrame or array using a boolean Series or array as a filter. For example, df[df['revenue'] > 10000] returns only rows where the revenue column exceeds 10,000. This is one of the most common and powerful pandas operations. [Ch. 11]
Break Statement
A Python control flow statement that immediately exits a for or while loop, regardless of whether the loop's natural termination condition has been met. Useful for searching through data and stopping once a target is found, or for exiting an infinite loop when a condition is satisfied. [Ch. 3]
Business Intelligence (BI) The set of technologies, processes, and practices used to collect, integrate, analyze, and present business data to support decision-making. Python can serve as both a data preparation tool for BI platforms (like Tableau or Power BI) and as a complete BI solution when combined with visualization and reporting libraries. [Ch. 15]
C
Callable
Any Python object that can be called like a function using parentheses. Functions, methods, classes (when instantiated), and objects with a __call__ method are all callable. The built-in callable() function returns True if an object is callable. [Ch. 5]
Categorical Data
Data that represents discrete groups or categories rather than continuous numeric values. Examples include department names, product categories, and customer segments. Pandas provides the Categorical dtype for efficient storage and manipulation of categorical data, and it is important for correct grouping and visualization. [Ch. 12]
Chaining
Writing multiple method calls in sequence on a single object, each method returning a modified version of the object. For example: df.dropna().sort_values('date').reset_index(). Pandas is designed to support method chaining, and it makes for concise, readable data pipelines. [Ch. 13]
Class
A blueprint or template that defines the structure and behavior of a type of object. Classes encapsulate related data (attributes) and functions (methods) together. In Python, classes are defined with the class keyword. All Python data types — lists, strings, DataFrames — are themselves classes. [Ch. 22]
CLI (Command Line Interface)
A text-based interface for interacting with a computer by typing commands into a terminal or shell. Python scripts are often run from the CLI using python script.py. Understanding basic CLI navigation is essential for running Python programs, managing files, and working with tools like pip and virtual environments. [Ch. 1]
Column
In a pandas DataFrame, a single named field containing values of a consistent data type across all rows. Columns correspond to variables or features in a dataset. Accessed by name using bracket notation (df['column_name']) or dot notation (df.column_name). [Ch. 10]
Comment
Text in source code that is ignored by the Python interpreter, used to explain the purpose of code to human readers. Single-line comments start with #. Multi-line documentation is typically written as docstrings using triple quotes rather than multiple comment lines. [Ch. 2]
Compression
The process of encoding data to reduce its file size. Python's gzip, bz2, and zipfile standard library modules support compressed file formats. Pandas can read and write compressed CSV files directly (e.g., df.to_csv('file.csv.gz')), which is useful when working with large datasets. [Ch. 20]
Concatenation
Joining two or more strings, lists, or DataFrames end-to-end. String concatenation uses the + operator. In pandas, pd.concat() stacks DataFrames vertically (row-wise) or horizontally (column-wise). Concatenation is fundamentally different from merging, which joins on shared key values. [Ch. 13]
Conditional Expression
Also called a ternary expression, a compact way to write an if/else statement in a single line: value_if_true if condition else value_if_false. For example: status = "active" if sales > 0 else "inactive". Useful for simple assignments but can reduce readability when overused. [Ch. 3]
Context Manager
An object that defines setup and teardown behavior using __enter__ and __exit__ methods, invoked with the with statement. The most common use is file handling (with open('file.csv') as f:), which ensures files are always properly closed even if an error occurs. Custom context managers can also be created. [Ch. 6]
Continue Statement
A Python control flow statement that skips the remainder of the current loop iteration and proceeds to the next one. Unlike break, which exits the loop entirely, continue just skips the current pass through the loop body. [Ch. 3]
Correlation
A statistical measure of the linear relationship between two numeric variables, ranging from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear relationship. In pandas, df.corr() computes a correlation matrix. Correlation does not imply causation. [Ch. 15]
Cross-Validation A model evaluation technique that divides the training dataset into multiple subsets (folds) and trains/tests the model multiple times, each time using a different fold as the test set. K-fold cross-validation is the most common variant. It provides a more reliable estimate of model performance than a single train/test split. [Ch. 36]
CSV (Comma-Separated Values)
A plain text file format where values are separated by commas (or sometimes other delimiters) and rows are separated by newlines. CSV is one of the most universally supported formats for tabular data exchange. Pandas reads CSV files with pd.read_csv() and writes them with df.to_csv(). [Ch. 6]
CRUD An acronym for the four basic operations on persistent data: Create, Read, Update, and Delete. These operations correspond to SQL's INSERT, SELECT, UPDATE, and DELETE. Understanding CRUD is foundational for working with databases, APIs, and web applications. [Ch. 27]
D
Dashboard An interactive visual display that consolidates key metrics and data visualizations in one place. In Python, dashboards can be built using libraries like Dash (by Plotly), Streamlit, or Panel. Business dashboards allow stakeholders to monitor KPIs without running code themselves. [Ch. 32]
Data Cleaning The process of detecting and correcting (or removing) errors, inconsistencies, and missing values in a dataset to improve its quality for analysis. Common tasks include handling missing values, fixing data types, removing duplicates, standardizing formats, and filtering outliers. Data cleaning is often the most time-consuming step in any analysis project. [Ch. 12]
Data Frame / DataFrame The primary two-dimensional, labeled data structure in pandas. A DataFrame is essentially a table with rows and columns, similar to a spreadsheet or a SQL table. Each column is a Series. DataFrames provide hundreds of methods for data manipulation, aggregation, and analysis. [Ch. 10]
Data Pipeline
A sequence of data processing steps where the output of each step becomes the input of the next. Pipelines automate the flow of data from raw sources through transformation to final output (reports, models, dashboards). Scikit-learn's Pipeline class formally implements this concept for machine learning workflows. [Ch. 38]
Data Type (dtype)
The classification of data that tells the computer how to store and process it. In Python, built-in types include int, float, str, bool, and complex. Pandas extends this with types like int64, float64, object, bool, datetime64, and category. Checking and converting dtypes is a critical step in data preparation. [Ch. 2]
Data Wrangling
The process of transforming and mapping raw data into a more useful format for analysis. Includes cleaning, reshaping, merging, aggregating, and enriching data. Also called data munging. In pandas, data wrangling operations include .merge(), .pivot(), .melt(), .groupby(), and string/date operations. [Ch. 13]
Decorator
A function that takes another function as an argument, adds some behavior, and returns the modified function. Decorators use the @ syntax. In Flask, @app.route('/path') is a decorator that registers a function as a web route. Python's built-in decorators include @staticmethod, @classmethod, and @property. [Ch. 33]
Deep Learning A subset of machine learning that uses artificial neural networks with many layers to learn complex patterns from large datasets. Deep learning excels at tasks like image recognition, natural language processing, and speech recognition. Libraries like TensorFlow and PyTorch implement deep learning in Python. Introduced conceptually in this book but not covered in depth. [Ch. 39]
Default Parameter
A function parameter that has a predefined value, used when no argument is provided for that parameter in the function call. For example: def greet(name, greeting="Hello"):. Default parameters must come after required parameters in the function definition. [Ch. 4]
Delimiter
A character used to separate fields in a text file. The comma (,) is the delimiter in CSV files. Other common delimiters include tab (\t), pipe (|), and semicolon (;). Pandas read_csv() accepts a sep parameter to specify any delimiter. [Ch. 6]
Dependency
An external package or library that a program requires to function. When you import pandas, pandas is a dependency. Dependencies are specified in a requirements.txt file or pyproject.toml so that others can reproduce your environment. Managing dependencies is a key part of professional Python project setup. [Ch. 7]
Deployment The process of making a machine learning model or application available for use in a production environment where real users or systems can access it. Deployment options include REST APIs (using Flask or FastAPI), cloud platforms (AWS, Azure, GCP), Docker containers, and serverless functions. [Ch. 40]
Dictionary (dict)
A Python built-in data structure that stores key-value pairs. Keys must be unique and hashable (strings, numbers, tuples); values can be any type. Dictionaries are created with curly braces: {'name': 'Sandra', 'dept': 'Sales'}. They provide O(1) average-case lookup by key. [Ch. 2]
Dtype
See Data Type (dtype). In pandas context specifically, dtype refers to the data type of a column or Series, checked via .dtype (singular, on a Series) or .dtypes (plural, on a DataFrame). [Ch. 10]
Duplicate
A row in a DataFrame that is identical to one or more other rows across some or all columns. Duplicates can distort analysis and should be identified with .duplicated() and removed with .drop_duplicates(). Understanding which columns define uniqueness is crucial before dropping duplicates. [Ch. 12]
E
Encoding
In data contexts, the process of converting categorical values to numeric representations for use in machine learning models. Common encoding methods include label encoding (assigning integers) and one-hot encoding (creating binary indicator columns). Pandas provides pd.get_dummies() for one-hot encoding; scikit-learn provides LabelEncoder and OneHotEncoder. [Ch. 35]
Endpoint
A specific URL in a web API or web application that accepts requests and returns responses. In Flask, each route decorated with @app.route() defines an endpoint. For example, /api/forecast might be an endpoint that accepts sales data and returns predictions. [Ch. 33]
Environment Variable
A variable stored in the operating system's environment, accessible by running processes. In Python, environment variables are accessed with os.environ.get('VARIABLE_NAME'). They are the recommended way to store sensitive configuration values like API keys and database passwords without embedding them in code. [Ch. 7]
ETL (Extract, Transform, Load) A data integration pattern where data is extracted from source systems, transformed (cleaned, reformatted, enriched), and loaded into a destination (database, data warehouse, or file). Python is widely used to build ETL pipelines, especially with pandas for transformation logic. [Ch. 19]
Exception
An error that occurs during program execution, interrupting normal flow. Python has a hierarchy of built-in exception types (e.g., ValueError, TypeError, FileNotFoundError, KeyError). Exceptions can be caught and handled using try/except blocks, preventing program crashes in predictable error situations. [Ch. 6]
Exception Handling
The practice of anticipating and managing runtime errors using try, except, else, and finally blocks. Proper exception handling makes programs robust and user-friendly. In business applications, exception handling is critical when reading external files, calling APIs, or processing user input where errors are inevitable. [Ch. 6]
Exploratory Data Analysis (EDA) An approach to analyzing datasets by summarizing their main characteristics, often using visual methods. EDA is the first step in any data project and involves checking distributions, identifying outliers, examining relationships between variables, and validating data quality. EDA informs all subsequent modeling decisions. [Ch. 15]
Expression
A combination of values, variables, and operators that Python evaluates to produce a value. For example, sales * 0.1 is an expression that produces the 10% commission amount. Expressions can be simple (a single value) or complex (involving function calls and multiple operators). [Ch. 2]
F
F-string (Formatted String Literal)
A Python string prefixed with f or F that allows embedding expressions inside curly braces {} directly within the string. Introduced in Python 3.6, f-strings are the preferred way to format strings: f"Revenue: ${revenue:,.2f}". They support arbitrary Python expressions, format specifications, and conversions. [Ch. 2]
Feature In machine learning, an individual measurable input variable used by a model to make predictions. Also called a predictor, independent variable, or column. Feature engineering — creating new features from existing ones — is often the most impactful way to improve model performance. [Ch. 35]
Feature Engineering The process of using domain knowledge to create, transform, or select features from raw data to improve machine learning model performance. Examples include creating a "profit margin" column from revenue and cost columns, extracting the day of week from a date, or binning continuous values into categories. [Ch. 37]
Feature Scaling The process of normalizing or standardizing the range of features in a dataset. Many machine learning algorithms (like k-nearest neighbors and support vector machines) are sensitive to the scale of features. StandardScaler (zero mean, unit variance) and MinMaxScaler (scale to [0,1]) are common scaling methods in scikit-learn. [Ch. 36]
File Handle
The object returned by open() that allows reading from or writing to a file. The file handle provides methods like .read(), .readlines(), .write(), and .close(). Using a context manager (with open() as f:) ensures the handle is automatically closed. [Ch. 6]
Filter
An operation that selects a subset of rows (or elements) based on a condition. In pandas, filtering is done using boolean indexing: df[df['region'] == 'West']. Multiple conditions can be combined with & (and) and | (or), and each condition must be wrapped in parentheses. [Ch. 11]
Flask A lightweight Python web framework used to build web applications and REST APIs. Flask follows the WSGI standard and uses decorators to define routes. It is called "micro" because it provides core functionality without making decisions about database layers or templating engines. [Ch. 33]
Float
A Python data type for floating-point (decimal) numbers, such as 3.14 or -0.005. Floats in Python are double-precision (64-bit) by default. Important caveat: floating-point arithmetic is not exact (e.g., 0.1 + 0.2 != 0.3), which matters in financial calculations where the Decimal module should be used instead. [Ch. 2]
For Loop
A control flow construct that iterates over each element of an iterable (such as a list, range, string, or DataFrame column). The standard form is for item in iterable:. For loops are the primary tool for processing collections of items in Python. [Ch. 3]
Foreign Key In relational databases, a column in one table that references the primary key of another table, establishing a relationship between the tables. When merging DataFrames in pandas, the merge key columns serve the role of foreign keys. Understanding foreign key relationships is essential for correct data joins. [Ch. 14]
Function
A named, reusable block of code that performs a specific task. Functions are defined with the def keyword and may accept parameters and return values. Functions promote code reuse, readability, and testability. In Python, functions are first-class objects and can be passed as arguments to other functions. [Ch. 4]
G
Generator
A type of iterator that yields values one at a time, computing them on demand rather than storing them all in memory. Generators are created with functions that use the yield keyword, or with generator expressions (like list comprehensions but with parentheses). They are memory-efficient for processing large datasets. [Ch. 21]
Git
A distributed version control system that tracks changes to files over time. Git allows teams to collaborate on code, track changes, revert to previous versions, and manage different development branches. Understanding basic git commands (git init, git add, git commit, git push) is an important professional skill. [Ch. 7]
Global Variable
A variable defined at the module level, outside of any function. Global variables can be read inside functions but require the global keyword to be modified. Overuse of global variables makes code hard to understand and test; prefer passing values as function arguments. [Ch. 4]
GroupBy
An operation that splits a DataFrame into groups based on the values of one or more columns, applies a function to each group, and combines the results. The pandas method df.groupby('category').agg(...) follows the "split-apply-combine" paradigm and is the workhorse of aggregate analysis. [Ch. 14]
H
Heatmap
A visualization where data values are represented as colors in a two-dimensional grid. Correlation matrices are commonly visualized as heatmaps using seaborn's sns.heatmap(). Color intensity indicates the magnitude and direction of relationships. [Ch. 16]
HTTP (HyperText Transfer Protocol)
The foundational protocol for data communication on the web. HTTP defines methods like GET (retrieve data), POST (submit data), PUT (update data), and DELETE (remove data). When Python programs call web APIs with the requests library, they use HTTP to communicate. [Ch. 26]
Hyperparameter A parameter in a machine learning algorithm that is set before training begins, as opposed to parameters that are learned from the data. Examples include the maximum depth of a decision tree, the number of estimators in a random forest, and the regularization strength in logistic regression. Hyperparameter tuning is the process of finding optimal values. [Ch. 37]
I
IDE (Integrated Development Environment) A software application that provides a comprehensive set of tools for writing, running, and debugging code. Popular Python IDEs include VS Code, PyCharm, and Spyder. Jupyter Notebook/Lab is a web-based interactive environment commonly used for data analysis. [Ch. 1]
Immutable A property of objects that cannot be changed after creation. In Python, strings, numbers, tuples, and frozensets are immutable. Lists, dictionaries, and sets are mutable. Understanding mutability is important for understanding how Python passes objects to functions and avoids unintended side effects. [Ch. 2]
Import
The mechanism for making code from one Python module available in another. The import statement loads a module and makes its contents accessible. Common patterns include import module, from module import name, and import module as alias. [Ch. 1]
Index
In pandas, the row labels of a DataFrame or the labels of a Series. The default index is an integer starting at 0, but indexes can be set to any column using .set_index(). The index is used for row selection with .loc[] and for aligning data in operations between DataFrames. [Ch. 10]
Integer (int)
A Python data type for whole numbers without decimal points, such as 42, -7, or 0. Python integers can be arbitrarily large (no overflow). Integer division is performed with //, and the modulo operator % returns the remainder of integer division. [Ch. 2]
Iterable
Any Python object that can return its elements one at a time. Lists, strings, tuples, sets, dictionaries, files, ranges, and generators are all iterables. An iterable can be used in a for loop or passed to functions like list(), sum(), and zip(). [Ch. 3]
J
JSON (JavaScript Object Notation)
A lightweight, human-readable data interchange format that uses key-value pairs and lists, similar in structure to Python dictionaries and lists. JSON is the dominant format for web API responses. Python's built-in json module provides json.loads() to parse JSON strings and json.dumps() to convert Python objects to JSON strings. [Ch. 26]
Jupyter Notebook
An interactive, browser-based computing environment that allows mixing executable code cells, text (in Markdown), equations, and visualizations in a single document. Widely used for data exploration, analysis, and sharing results. Notebooks are saved as .ipynb files. JupyterLab is the modern successor to classic Jupyter Notebook. [Ch. 1]
Join
An operation that combines rows from two tables (or DataFrames) based on matching values in specified columns. Join types include inner (only matching rows), left (all rows from the left, matching from right), right (opposite of left), and outer (all rows from both). The pandas .merge() function implements all join types. [Ch. 14]
K
Key-Value Pair A fundamental data association where a key (unique identifier) is linked to a value (the data). Dictionaries in Python are the primary key-value data structure. JSON and many APIs are structured as nested key-value pairs. The key enables efficient lookup of the associated value. [Ch. 2]
KPI (Key Performance Indicator) A measurable value that demonstrates how effectively a company is achieving key business objectives. Examples include monthly revenue, customer acquisition cost, churn rate, and inventory turnover. Python dashboards are often designed to track and visualize KPIs for business stakeholders. [Ch. 32]
k-Nearest Neighbors (kNN) A simple machine learning algorithm that classifies new data points based on the majority label (or average value for regression) of the k closest training examples in the feature space. The choice of k and the distance metric significantly affect model performance. kNN requires feature scaling because it is distance-based. [Ch. 35]
L
Lambda Function
An anonymous, single-expression function defined with the lambda keyword. Syntax: lambda arguments: expression. For example: lambda x: x * 2. Lambda functions are commonly used with map(), filter(), sorted(), and pandas .apply() for concise one-liner transformations. [Ch. 4]
Library A collection of pre-written code (modules, functions, classes) that can be imported and used in programs. Python's standard library comes with Python; third-party libraries (like pandas, matplotlib, scikit-learn) are installed via pip. Libraries extend Python's capabilities for specialized tasks. [Ch. 1]
Linear Regression A statistical and machine learning technique that models the relationship between a continuous target variable and one or more predictor variables as a linear equation. Simple linear regression has one predictor; multiple linear regression has several. Used to predict quantities like sales, prices, and demand. [Ch. 35]
List
A Python built-in data type that stores an ordered, mutable collection of items. Lists are created with square brackets: [1, 2, 3] or ['apple', 'banana', 'cherry']. Lists can contain mixed types and can be nested. Common operations include .append(), .extend(), .sort(), and slicing. [Ch. 2]
List Comprehension
A concise syntax for creating a list by applying an expression to each element of an iterable, optionally filtered by a condition. Syntax: [expression for item in iterable if condition]. For example: [x**2 for x in range(10) if x % 2 == 0]. More readable and often faster than equivalent for loops. [Ch. 5]
Local Variable A variable defined inside a function that is only accessible within that function's scope. Local variables exist only for the duration of the function call. This scoping prevents functions from accidentally interfering with each other's data. [Ch. 4]
Logistic Regression Despite its name, a classification algorithm (not regression) that models the probability of a binary outcome using a logistic (sigmoid) function. Widely used for customer churn prediction, loan default prediction, and any yes/no classification task. Outputs probabilities that can be thresholded to make predictions. [Ch. 35]
M
Machine Learning (ML) A branch of artificial intelligence where systems learn from data to improve their performance on tasks without being explicitly programmed. The main types are supervised learning (labeled training data), unsupervised learning (no labels), and reinforcement learning (learning through rewards). [Ch. 34]
Matplotlib
Python's foundational plotting library, upon which many other visualization libraries (including pandas plotting and seaborn) are built. Matplotlib provides granular control over every aspect of a figure. The pyplot interface (import matplotlib.pyplot as plt) provides a MATLAB-like, state-based API. [Ch. 16]
Mean
The arithmetic average of a set of numbers: the sum divided by the count. In pandas, computed with .mean(). The mean is sensitive to outliers, so it should be compared with the median for skewed distributions. In business contexts, mean revenue per customer or mean transaction value are common metrics. [Ch. 15]
Median
The middle value of a sorted dataset, representing the 50th percentile. Less sensitive to outliers than the mean. In pandas, computed with .median(). When analyzing income, prices, or any metric with extreme outliers, median is often more representative than mean. [Ch. 15]
Melt
A pandas reshaping operation that converts a wide-format DataFrame (many columns) into a long-format DataFrame (fewer columns, more rows) by "unpivoting" specified columns into rows. The pandas method .melt() is the inverse of .pivot(). Long format is often required for certain visualization and analysis tasks. [Ch. 13]
Merge
The pandas operation for combining two DataFrames based on matching values in one or more key columns, equivalent to SQL JOIN. The pd.merge() function supports inner, left, right, and outer joins. Merging is used to enrich one dataset with information from another. [Ch. 14]
Method
A function that belongs to an object and operates on that object's data. Methods are called using dot notation: my_list.append(5), df.dropna(). Methods can modify the object in place or return a new object, depending on the operation. [Ch. 5]
Missing Value
A data entry that has no recorded value, represented as NaN (Not a Number) in pandas for numeric columns and None for object columns. Missing values must be identified (using .isna() or .isnull()) and handled — either by dropping rows (.dropna()), filling with a constant or statistic (.fillna()), or imputing with a model. [Ch. 12]
Model In machine learning, a mathematical representation of a pattern or relationship in data, learned during training. The model is then used to make predictions on new data. In pandas/statistics contexts, model can also refer to a statistical model like linear regression. [Ch. 34]
Module
A Python file (.py) containing definitions and statements that can be imported into other Python programs. Python's standard library consists of modules. Third-party packages (installed via pip) provide modules as well. Organizing your own code into modules promotes reusability and organization. [Ch. 4]
Multicollinearity A condition in regression analysis where two or more predictor variables are highly correlated with each other. Multicollinearity can make coefficient estimates unstable and difficult to interpret. It can be detected using correlation matrices and VIF (Variance Inflation Factor) and addressed by removing or combining correlated predictors. [Ch. 35]
Mutation Changing the value of a mutable object (like a list or dictionary) in place. This can lead to unexpected bugs when the same object is referenced in multiple places. Understanding when Python copies an object versus references the same object is an important conceptual skill. [Ch. 5]
N
NaN (Not a Number)
A special floating-point value used in pandas and NumPy to represent missing or undefined numeric data. Introduced into datasets from missing CSV fields, failed type conversions, or calculations that produce undefined results (like 0/0). NaN is not equal to itself (NaN != NaN), so checking for NaN requires .isna() or math.isnan(). [Ch. 12]
Normalization
The process of scaling numeric data to a standard range, typically [0, 1], by subtracting the minimum and dividing by the range. Also used loosely to refer to any form of feature scaling. Scikit-learn's MinMaxScaler implements normalization. Distinct from standardization, which scales to zero mean and unit variance. [Ch. 36]
NumPy
The foundational package for numerical computing in Python, providing multi-dimensional arrays (ndarray), mathematical functions, linear algebra routines, random number generation, and more. Pandas and scikit-learn are both built on NumPy. Imported as import numpy as np. [Ch. 9]
O
Object
An instance of a class, containing attributes (data) and methods (behavior). In Python, everything is an object — integers, strings, functions, and DataFrames are all objects. Objects are created by calling a class as if it were a function: df = pd.DataFrame(data). [Ch. 5]
Object-Oriented Programming (OOP) A programming paradigm that organizes code around objects — instances of classes that combine data and behavior. Core OOP concepts include encapsulation (bundling data with methods), inheritance (classes sharing behavior), and polymorphism (objects of different types responding to the same interface). [Ch. 22]
Outlier A data point that differs significantly from other observations. Outliers can result from data entry errors, measurement errors, or genuinely unusual events. In data analysis, outliers must be identified (via box plots, z-scores, or IQR methods) and a decision made to keep, cap, or remove them. [Ch. 12]
Overfitting A condition where a machine learning model learns the training data too well — including its noise and random fluctuations — and consequently performs poorly on new, unseen data. Indicators include high training accuracy but low validation accuracy. Remedies include regularization, simpler models, and more training data. [Ch. 36]
P
Package
A collection of Python modules organized in a directory with an __init__.py file. Packages group related modules together. pip install pandas installs the pandas package, which contains many modules. The terms library, package, and module are often used informally to mean the same thing. [Ch. 7]
Pandas
The primary Python library for data manipulation and analysis, providing the DataFrame and Series data structures. Pandas makes it easy to read, clean, transform, aggregate, and export tabular data. Named after "panel data," an econometrics term. Imported as import pandas as pd. [Ch. 10]
Parameter
A variable in a function definition that acts as a placeholder for the value that will be passed when the function is called. Distinct from an argument, which is the actual value passed. Functions can have positional parameters, keyword parameters, default parameters, *args (variable positional), and **kwargs (variable keyword). [Ch. 4]
Pickle
Python's native serialization format, which converts Python objects into a byte stream that can be saved to a file and later loaded back. Commonly used to save trained machine learning models. The pickle module provides pickle.dump() and pickle.load(). Note: never unpickle data from untrusted sources. [Ch. 38]
pip
Python's package installer. Used to install third-party packages from PyPI (the Python Package Index). Key commands: pip install package_name, pip install -r requirements.txt, pip freeze > requirements.txt. Should be run with python -m pip to ensure the correct Python environment is used. [Ch. 7]
Pivot Table
A data summarization tool that reorganizes data by aggregating values grouped by one or more row and column keys. Pandas pd.pivot_table() mimics Excel's pivot table functionality. Pivot tables are powerful for exploring multidimensional business data (e.g., revenue by region and product line). [Ch. 13]
Plot A graphical representation of data. In Python, plots are created using matplotlib, seaborn, plotly, or pandas' built-in plotting (which wraps matplotlib). Effective plots communicate insights clearly and are appropriately labeled with titles, axis labels, and legends. [Ch. 16]
Precision In classification metrics, the proportion of positive predictions that are actually correct: True Positives / (True Positives + False Positives). High precision means that when the model says "yes," it is usually right. Precision is important when false positives are costly (e.g., flagging good customers as fraud risks). [Ch. 35]
Primary Key In relational databases, a column (or combination of columns) that uniquely identifies each row in a table. When merging DataFrames, the column(s) you merge on should ideally be unique in at least one DataFrame to avoid unintended row multiplication. [Ch. 14]
Print
The built-in print() function outputs text to the standard output (typically the console or Jupyter cell). In Python 3, print is a function (not a statement as in Python 2). It accepts multiple arguments, a sep parameter for separating values, and an end parameter (default '\n'). [Ch. 1]
Python A high-level, interpreted, dynamically typed programming language known for its readability and broad applicability. Created by Guido van Rossum and first released in 1991. Python 3 (specifically 3.10+) is used throughout this book. Python's extensive ecosystem of scientific and data libraries makes it the dominant language for data science and analytics. [Ch. 1]
R
Random Forest
An ensemble machine learning algorithm that builds multiple decision trees on random subsets of the training data and features, then aggregates their predictions. Random forests tend to be more accurate and robust to overfitting than individual decision trees. Implemented in scikit-learn as RandomForestClassifier and RandomForestRegressor. [Ch. 37]
Recall In classification metrics, the proportion of actual positive cases that the model correctly identifies: True Positives / (True Positives + False Negatives). Also called sensitivity or true positive rate. High recall is important when false negatives are costly (e.g., missing actual fraud cases or cancer diagnoses). [Ch. 35]
Regression A supervised machine learning task where the goal is to predict a continuous numeric output variable. Examples include predicting sales revenue, house prices, or customer lifetime value. Contrasted with classification, where the output is a discrete category. [Ch. 35]
Regular Expression (Regex)
A sequence of characters that defines a search pattern. In Python, regular expressions are implemented via the re module. They are powerful for finding, extracting, or replacing patterns in text data. For example, extracting phone numbers or email addresses from a column of text. [Ch. 25]
Relative Path
A file path specified relative to the current working directory rather than the root of the file system. For example, data/sales.csv means the sales.csv file inside the data folder in the current directory. Relative paths are shorter but can break if code is run from a different directory. [Ch. 6]
Requirements File
A text file (typically named requirements.txt) that lists the Python packages and versions required to run a project. Generated with pip freeze > requirements.txt. Collaborators install the same environment with pip install -r requirements.txt. [Ch. 7]
REST API (Representational State Transfer API)
A web service architecture that uses HTTP methods (GET, POST, PUT, DELETE) to perform operations on resources identified by URLs. REST APIs are stateless and return data typically in JSON format. Most modern web services expose REST APIs that Python can consume using the requests library. [Ch. 26]
Return Statement
The statement that exits a function and optionally sends a value back to the caller. Functions without an explicit return statement return None. A function can return multiple values as a tuple: return value1, value2. [Ch. 4]
ROC Curve (Receiver Operating Characteristic) A graphical plot showing the trade-off between the true positive rate (recall) and false positive rate at various classification thresholds. The area under the ROC curve (AUC-ROC) is a single metric summarizing classifier performance across all thresholds, where 1.0 is perfect and 0.5 is random. [Ch. 35]
Row
A single record or observation in a DataFrame, corresponding to one entity being described (e.g., one sale transaction, one customer, one product). Rows are accessed by position with .iloc[] or by index label with .loc[]. [Ch. 10]
S
Scikit-learn
Python's primary machine learning library, providing a consistent API for classification, regression, clustering, dimensionality reduction, model evaluation, and preprocessing. Scikit-learn's fit(), predict(), and transform() conventions make it easy to swap algorithms. Imported as from sklearn import .... [Ch. 34]
Scope The region of a program where a variable is accessible. Python uses the LEGB rule: Local, Enclosing, Global, Built-in scopes are searched in order. Understanding scope prevents bugs from accidentally modifying variables in outer scopes and helps design clean function interfaces. [Ch. 4]
Seaborn
A statistical data visualization library built on matplotlib that provides a higher-level interface for creating attractive, informative visualizations with less code. Seaborn excels at visualizing distributions, relationships between variables, and categorical data. Imported as import seaborn as sns. [Ch. 16]
Series
A one-dimensional labeled array in pandas that can hold data of any type. Every column of a DataFrame is a Series. A Series has both a values array and an index. Created with pd.Series([1, 2, 3]) or extracted from a DataFrame with df['column_name']. [Ch. 10]
Set
A Python built-in data type that stores an unordered collection of unique elements. Sets are created with curly braces {1, 2, 3} or set(). They support fast membership testing and set operations: union (|), intersection (&), difference (-), and symmetric difference (^). [Ch. 2]
Slice
An operation that extracts a portion of a sequence (list, string, tuple) or DataFrame using the [start:stop:step] syntax. For example, my_list[1:4] returns elements at indices 1, 2, and 3. Pandas .iloc[] supports slicing for row/column selection. [Ch. 2]
Sorting
Arranging elements in a specified order. In Python, sorted() returns a new sorted list; list.sort() sorts in place. In pandas, df.sort_values('column') sorts rows by a column's values (ascending by default; use ascending=False for descending). [Ch. 11]
SQL (Structured Query Language)
A domain-specific language for managing and querying relational databases. Key SQL operations — SELECT, WHERE, GROUP BY, JOIN, ORDER BY — have direct pandas equivalents. Python connects to SQL databases via libraries like sqlite3, psycopg2 (PostgreSQL), or pyodbc. [Ch. 28]
Standard Deviation
A measure of the spread or dispersion of a set of values around their mean. Higher standard deviation indicates more variability. In pandas, computed with .std(). Useful for understanding how consistent sales figures, production times, or other business metrics are. [Ch. 15]
Standard Library
The collection of modules that come built into Python and are available without installing additional packages. Includes essential modules like os, sys, datetime, json, csv, re, collections, itertools, pathlib, logging, and many more. [Ch. 6]
Statement
A complete instruction that Python can execute. Statements include assignments (x = 5), function calls (print("hello")), and control flow structures (if, for, while). Unlike expressions (which produce values), statements perform actions. [Ch. 2]
String (str)
A Python data type for text, defined as an immutable sequence of Unicode characters. Strings are created with single quotes, double quotes, or triple quotes. Python provides extensive string methods: .upper(), .lower(), .strip(), .split(), .join(), .replace(), .startswith(), .endswith(), and many more. [Ch. 2]
Supervised Learning A type of machine learning where the model is trained on labeled data — data that includes both the input features and the correct output (target variable). The model learns to map inputs to outputs and then predicts outputs for new, unseen inputs. Classification and regression are both supervised learning tasks. [Ch. 34]
T
Target Variable The output variable that a supervised machine learning model is trained to predict. Also called the dependent variable, label, or response variable. In sales forecasting, next month's revenue is the target; in customer churn prediction, whether a customer leaves is the target. [Ch. 35]
Timestamp
A data type representing a specific point in time, including date and time information. Pandas uses pd.Timestamp (similar to Python's datetime.datetime) to represent timestamps. Time-series analysis depends on having correctly typed timestamps, usually created via pd.to_datetime(). [Ch. 18]
Train-Test Split
The practice of dividing a dataset into two subsets: a training set (used to fit the model) and a test set (used to evaluate model performance on unseen data). Scikit-learn's train_test_split() function performs this split. A typical split is 80% training, 20% testing. [Ch. 36]
Tuple
A Python built-in data type for an ordered, immutable collection of items. Tuples are created with parentheses: (1, 2, 3) or ('Sandra', 'Sales', 5). Because they are immutable, tuples are often used for fixed data like coordinates, database records, or function return values. [Ch. 2]
Type Casting
Converting a value from one data type to another using constructor functions: int(), float(), str(), bool(), list(), tuple(), set(). For example, int("42") converts the string "42" to the integer 42. In pandas, column types are changed using .astype(). [Ch. 2]
Type Error
A Python exception raised when an operation is applied to an object of an inappropriate type. For example, trying to add a string and an integer ("revenue" + 500) raises a TypeError. Type errors are commonly encountered when reading data where numeric columns are stored as strings. [Ch. 6]
U
Underfitting A condition where a machine learning model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data. Remedies include using a more complex model, adding more features, or reducing regularization. [Ch. 36]
Unicode An international encoding standard that assigns a unique code point to every character in every human writing system. Python 3 strings are Unicode by default, enabling programs to work correctly with text in any language. Important when reading CSV files containing non-English characters or special symbols. [Ch. 6]
Unit Test
A type of software test that verifies the behavior of individual functions or components in isolation. In Python, the unittest module and the third-party pytest framework provide testing infrastructure. Unit tests catch bugs early, document expected behavior, and enable safe refactoring. [Ch. 23]
Unsupervised Learning A type of machine learning where the model is trained on unlabeled data and discovers structure, patterns, or groupings without explicit guidance. Clustering (k-means) and dimensionality reduction (PCA) are common unsupervised learning techniques. [Ch. 39]
V
ValueError
A Python exception raised when a function receives an argument of the correct type but inappropriate value. For example, int("hello") raises a ValueError because the string is not a valid integer representation. Common in data cleaning when converting columns to numeric types. [Ch. 6]
Variable
A named storage location in a program's memory that holds a value. In Python, variables are created by assignment and can hold any type of value. Python is dynamically typed, so the type of a variable can change. Variable names should be descriptive: monthly_revenue is better than x. [Ch. 2]
Version Control A system that tracks changes to files over time, allowing developers to recall specific versions, compare changes, and collaborate. Git is by far the most common version control system. Using version control for Python projects is a professional standard and enables collaboration and error recovery. [Ch. 7]
Virtual Environment
An isolated Python environment that has its own installed packages, independent of the system Python installation and other virtual environments. Created with python -m venv env_name. Essential for managing project-specific dependencies and preventing conflicts between projects. [Ch. 7]
Visualization The graphical representation of data to communicate patterns, trends, and insights. Effective visualization is both an art and a science. Python's main visualization libraries include matplotlib (foundational), seaborn (statistical), plotly (interactive), and pandas' built-in plotting. [Ch. 16]
W
Web Scraping
The automated extraction of data from websites by parsing HTML. Python's requests library fetches web pages and BeautifulSoup parses the HTML structure. Web scraping is useful when data is not available via an API, but practitioners must respect websites' terms of service and robots.txt files. [Ch. 27]
While Loop
A control flow construct that repeatedly executes a block of code as long as a condition remains True. Unlike for loops (which iterate a fixed number of times over an iterable), while loops continue indefinitely until the condition becomes False or a break statement is hit. Always ensure a while loop will eventually terminate. [Ch. 3]
Widget
In interactive notebook contexts (using ipywidgets), a graphical control element such as a slider, dropdown, or checkbox that allows users to interact with visualizations and analyses without writing code. Widgets are useful for building exploratory data analysis tools in Jupyter. [Ch. 32]
X
XML (eXtensible Markup Language)
A markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. Some legacy business systems export data in XML format. Python's xml.etree.ElementTree module parses XML data. Less common than JSON in modern APIs but still encountered in enterprise systems. [Ch. 26]
Z
Z-Score
A statistical measure expressing how many standard deviations a data point is from the mean of the dataset. Calculated as (value - mean) / std_dev. Z-scores are used to detect outliers (values more than 3 standard deviations from the mean) and are the output of StandardScaler in scikit-learn. [Ch. 36]
This glossary covers terms introduced across all 40 chapters. For full explanations with context, code examples, and exercises, refer to the chapter where each term was first introduced. Additional terms related to specific libraries can be found in their official documentation, links for which are provided in the Bibliography.