Case Study 1: Setting Up a Data Science Environment for a University Research Lab

Contributors to Introduction to Data Science

Case Study 1: Setting Up a Data Science Environment for a University Research Lab

Tier 3 — Illustrative/Composite Example: Dr. Ananya Chakraborty and the University of Clearwater Data Science Lab are fictional, but this case study reflects common real-world decisions that instructors, IT administrators, and research lab managers face when setting up data science environments for groups of students. The technical challenges, trade-offs, and solutions described here are composites of widely documented institutional experiences. All names, institutions, and specific configurations are invented for pedagogical purposes.

The Setting

Dr. Ananya Chakraborty is an associate professor of public health at the University of Clearwater. This semester, she's teaching a new course: "Data-Driven Public Health: Tools and Methods." It's the first time the public health department has offered a course that requires programming, and enrollment is capped at 30 students.

Her students are a mix: some are public health majors who have never touched code, a few have taken an introductory programming class, and one or two are double-majoring in computer science. None of them have used Jupyter notebooks before.

Dr. Chakraborty has three weeks before the semester starts, and she needs to answer a question that sounds simple but turns out to be surprisingly complex: How do I set up 30 data science environments so that every student can start coding on Day 1?

This question sits at an intersection that doesn't get much attention in data science textbooks but is critically important in practice: the logistics of teaching data science, not just doing it.

The Requirements

Before choosing tools, Dr. Chakraborty lists what she needs:

Every student must have Python, Jupyter, pandas, matplotlib, and scikit-learn. These are the core tools for the course.
Setup should take no more than 30 minutes of class time. She has 16 weeks of material to cover. She can't spend three classes debugging installation issues.
The environment must be consistent. If one student has pandas 2.x and another has pandas 3.x, code that works on one machine might fail on another. She's seen this before in workshops and knows it's a disaster.
It must work on Windows, macOS, and Linux. Her students use all three.
It should be free. The department has no software budget for this course.
Students should be able to work offline — the campus Wi-Fi is unreliable in some buildings.

Decision 1: Anaconda vs. pip vs. Cloud

Dr. Chakraborty considers three approaches.

Option A: Anaconda (Full Distribution)

Pros: One-click installer. Includes everything she needs. Tested and reliable. Works on all three operating systems. Students can work offline.

Cons: The installer is approximately 800 MB, and the full installation uses about 4 GB of disk space. Some student laptops (especially older ones or Chromebooks) may struggle with this. The download itself takes 10-15 minutes on campus Wi-Fi.

Option B: Miniconda + Manual Installation

Pros: Miniconda is only about 50 MB. Students install only what they need. Uses less disk space. Still uses conda for package management, so version conflicts are managed.

Cons: Requires students to run command-line installation commands (conda install pandas matplotlib scikit-learn jupyter). For students who have never used a terminal, this adds a layer of complexity and a new class of potential errors.

Option C: Google Colab (Cloud-Based)

Pros: Zero installation. Students open a browser, log into their Google account, and they're ready. Libraries are pre-installed. No disk space issues. No version conflicts.

Cons: Requires an internet connection at all times (a problem with unreliable campus Wi-Fi). Google Colab's interface differs slightly from standard Jupyter, so tutorials and screenshots won't always match. Files are stored on Google Drive, which adds another layer of potential confusion. Students don't learn how to set up a local environment — a skill they'll need in their careers.

Dr. Chakraborty's Decision

She chooses Anaconda as the primary option, with Google Colab as the backup. Here's her reasoning:

"I want students to have a real, local data science environment. When they leave this course and start doing research or working in industry, they'll need to install and manage their own tools. If I hide all of that behind a cloud service, I'm teaching them to drive without ever opening the hood.

"But I also need a safety net. If someone's laptop is too old for Anaconda, or if the installer fails and I can't debug it in 10 minutes, I need them to still be able to participate. Colab is that safety net."

She prepares a one-page installation guide with step-by-step instructions for Windows, macOS, and Linux — essentially a condensed version of what you read in Section 2.2 of this chapter. She emails it to students two weeks before class and asks them to install before Day 1.

Decision 2: Jupyter Notebook vs. JupyterLab

Both come with Anaconda. Which should she teach?

Jupyter Notebook (Classic) is the original interface. It's simpler, with fewer features and less visual complexity. Most online tutorials, Stack Overflow answers, and course materials use it. The interface has one purpose: editing a notebook.

JupyterLab is the newer interface. It has a file browser sidebar, tabbed editing (like a code editor), a built-in terminal, and support for multiple windows. It's more powerful but also more complex — more things to click, more panels to manage, more potential confusion.

Dr. Chakraborty chooses Classic Jupyter Notebook for the course. "JupyterLab is better for experienced users," she writes in her course notes, "but it's overwhelming for beginners. My students need to focus on learning Python and data analysis, not on navigating a complex IDE. The classic notebook has one big advantage: simplicity. There's one thing on the screen, and that's the notebook you're working on."

She mentions JupyterLab in Week 1 and tells students they're welcome to try it, but all class demos and screenshots will use the classic interface.

Decision 3: Shared Environments

Here's a problem Dr. Chakraborty has seen in other courses: students end up with different package versions, and code that works for one student fails for another. This is especially frustrating when students are helping each other — "It works on my machine!" is not a helpful debugging statement.

Her solution: she creates a conda environment file — a simple text file that specifies exactly which packages and versions to install. She calls it ph-data-env.yml:

name: ph-data
channels:
  - defaults
dependencies:
  - python=3.12
  - jupyter=1.0
  - pandas=2.3
  - matplotlib=3.10
  - scikit-learn=1.6
  - numpy=2.1
  - seaborn=0.13

She posts this file on the course website with a one-line command to install it:

conda env create -f ph-data-env.yml

This creates a separate environment called ph-data with exactly the specified versions. Every student who runs this command gets an identical setup.

We haven't covered environments in depth in this chapter — it's a more advanced topic that becomes important in collaborative work (we'll revisit it in Chapter 33 on Reproducibility). But even mentioning it here gives you a preview of how professionals manage the "it works on my machine" problem.

Day 1: What Actually Happened

Day 1 arrives. Dr. Chakraborty allocated the first 30 minutes for setup verification and troubleshooting. Here's what she encountered:

22 out of 30 students had Anaconda installed and working before class. They followed the pre-class instructions, everything went smoothly, and they were able to launch Jupyter and run a test notebook within the first five minutes.

4 students had installation problems: - Two Windows students got a "Windows Defender SmartScreen" warning that blocked the installer. Fix: click "More info" then "Run anyway." - One macOS student on an older MacBook Air ran out of disk space. Fix: switched to Miniconda with only the essential packages. - One Linux student installed Anaconda but didn't initialize it for their shell. Fix: ran conda init bash and opened a new terminal.

2 students hadn't installed Anaconda at all — they hadn't checked their email before class. Dr. Chakraborty gave them the Colab backup link and they were coding within two minutes. They installed Anaconda after class.

2 students were on Chromebooks, which don't run Anaconda. They used Colab for the entire semester. "It wasn't ideal," Dr. Chakraborty noted later, "but it worked. They learned the same concepts. The only thing they missed was the local installation experience, which I supplemented with a one-on-one walkthrough on a department computer."

Common Student Problems (Weeks 1-3)

Over the first three weeks, Dr. Chakraborty kept a log of the most common problems her students encountered:

Problem 1: "My notebook won't open — it downloads a file instead." When students double-click a .ipynb file, their operating system doesn't know what to do with it. The fix: always open notebooks from within Jupyter (launch Jupyter first, then navigate to the file), not by double-clicking the file.

Problem 2: "I keep getting NameError." Students would define a variable in one cell, close the notebook, reopen it, and try to use the variable without re-running the cell that defined it. The fix: explain that closing and reopening a notebook restarts the kernel. You need to run all the cells again. "Kernel > Restart & Run All" becomes their best friend.

Problem 3: "My cells are running out of order." Students would run cells non-sequentially (editing cell 3, then jumping to cell 7, then back to cell 1) and end up with confusing results. The fix: establish the habit of running cells top-to-bottom, and using "Restart & Run All" as a sanity check.

Problem 4: "I can't find my notebook." Students would create notebooks without paying attention to which directory they were in. The notebooks ended up scattered across random folders. The fix: on Day 2, Dr. Chakraborty spent 10 minutes showing students how to create a dedicated course folder and always navigate there before creating new notebooks.

Problem 5: "Jupyter won't start." Two students accidentally closed the terminal window where the notebook server was running, then couldn't figure out why Jupyter stopped working. The fix: explain that the terminal window is the server, and it needs to stay open (but can be minimized).

Lessons Learned

At the end of the semester, Dr. Chakraborty wrote a reflection for the department:

"The setup is the hardest part." She found that once students were past the installation and interface learning curve (about 2 weeks), the technical barriers dropped dramatically. The Python programming itself was hard, but in a productive, educational way. The installation problems were hard in an unproductive, frustrating way. Anything that reduces setup friction is worth doing.

"Have a backup plan." The Google Colab fallback saved two students from falling behind on Day 1. In future semesters, she plans to have a Colab version of every class notebook ready.

"Teach the interface explicitly." She initially assumed students would figure out the Jupyter interface through exploration. They didn't. After the first semester, she added a dedicated 20-minute "Jupyter tour" to Day 1, covering cells, cell types, keyboard shortcuts, and the Restart & Run All workflow. This reduced interface-related questions by about 60%.

"The environment file was worth the effort." Not a single student had a version conflict during the semester. The ph-data-env.yml file took 10 minutes to create and saved dozens of hours of debugging.

"Students who struggled with setup often struggled later." Not because the setup problems were a sign of weakness — they weren't. But because students who fell behind in Week 1 sometimes stayed behind, feeling like they were "already lost." Dr. Chakraborty now sends the installation instructions three weeks before class and offers drop-in office hours for setup help.

Connection to Your Learning

If you're reading this case study, you just finished Section 2.2 — which means you've already done what 22 of Dr. Chakraborty's students did successfully. You installed Anaconda, verified it, and launched Jupyter.

But this case study matters for a reason beyond your own setup. As your data science skills grow, you'll eventually find yourself in Dr. Chakraborty's position: setting up an environment for a team, a class, or a research group. You'll need to make the same decisions she made — Anaconda vs. Miniconda, classic vs. Lab, local vs. cloud — and the right answer will depend on your specific context.

Data science isn't just about analyzing data. It's about creating the conditions under which analysis can happen. The best algorithm in the world is useless if your team can't install the software to run it.

Discussion Questions

If you were setting up a data science environment for a team of 10 colleagues at a company (not a university), would you make the same choices Dr. Chakraborty made? What might be different in a corporate setting?
What are the trade-offs between teaching students to set up their own environments (educational but time-consuming) versus providing a pre-configured cloud environment (fast but less educational)?
Dr. Chakraborty chose classic Jupyter Notebook over JupyterLab for simplicity. Can you think of a scenario where JupyterLab would be the better choice, even for beginners?
The five common student problems listed above are all interface problems, not programming problems. Why do you think interface issues are so common for beginners, and what could tool designers do to reduce them?