Case Study 2: Transfer Learning for Domain-Specific Image Classification

Overview

In this case study, we apply transfer learning to classify images in a domain very different from ImageNet: aerial satellite imagery for land-use classification. We compare three strategies---feature extraction, fine-tuning the last layers, and fine-tuning the entire network---and demonstrate that transfer learning from ImageNet provides substantial benefits even when the target domain is visually dissimilar.

This case study brings together CNN architectures (Section 14.6), transfer learning techniques (Section 14.8), and training best practices from Chapter 12.

Problem Statement

Our task is to classify aerial images into 6 land-use categories: agricultural, commercial, industrial, residential, forest, and water. The dataset contains 2,400 images (400 per class), each 64x64 pixels RGB. This is small by deep learning standards, making transfer learning essential.

Our goals: 1. Compare training from scratch against three transfer learning strategies 2. Demonstrate discriminative learning rates and gradual unfreezing 3. Achieve the highest possible accuracy on a held-out test set 4. Analyze which pretrained features transfer to aerial imagery

Strategy 1: Training from Scratch (Baseline)

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models

torch.manual_seed(42)


class SmallCNN(nn.Module):
    """A small CNN designed for 64x64 images."""

    def __init__(self, num_classes: int = 6) -> None:
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32),
            nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64),
            nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128),
            nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(128, 256, 3, padding=1), nn.BatchNorm2d(256),
            nn.ReLU(), nn.AdaptiveAvgPool2d(1),
        )
        self.classifier = nn.Linear(256, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

# Result: ~62% test accuracy after 100 epochs

Training from scratch with only 2,400 images leads to severe overfitting: training accuracy reaches 98% while test accuracy plateaus around 62%.

Strategy 2: Feature Extraction (Frozen Backbone)

We use a pretrained ResNet-18, freeze all convolutional layers, and only train a new classification head:

def create_feature_extractor(num_classes: int = 6) -> nn.Module:
    """ResNet-18 with frozen backbone, new classification head."""
    model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

    # Freeze all parameters
    for param in model.parameters():
        param.requires_grad = False

    # Replace the final FC layer (unfrozen by default)
    model.fc = nn.Sequential(
        nn.Linear(512, 128),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(128, num_classes),
    )

    return model

model = create_feature_extractor()
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)
# Only the head parameters are updated

# Result: ~78% test accuracy after 30 epochs

Feature extraction achieves 78% accuracy with very fast training (only the head is trained). The pretrained features capture general visual patterns (edges, textures, shapes) that transfer surprisingly well to aerial imagery.

Strategy 3: Fine-Tuning Last Layers

We unfreeze the last residual block and the classification head, using discriminative learning rates:

def create_partial_finetune(num_classes: int = 6) -> tuple[nn.Module, optim.Optimizer]:
    """ResNet-18 with last block and head unfrozen."""
    model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

    # Freeze everything
    for param in model.parameters():
        param.requires_grad = False

    # Unfreeze layer4 (last residual block)
    for param in model.layer4.parameters():
        param.requires_grad = True

    # New classification head
    model.fc = nn.Sequential(
        nn.Linear(512, 128),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(128, num_classes),
    )

    # Discriminative learning rates
    optimizer = optim.AdamW([
        {"params": model.layer4.parameters(), "lr": 1e-4},
        {"params": model.fc.parameters(), "lr": 1e-3},
    ], weight_decay=0.01)

    return model, optimizer

# Result: ~84% test accuracy after 50 epochs

Fine-tuning the last block allows the model to adapt high-level features to the aerial domain while preserving general low-level features.

Strategy 4: Full Fine-Tuning with Gradual Unfreezing

The most sophisticated approach gradually unfreezes the network from the head backward:

def gradual_unfreeze_training(
    model: nn.Module,
    train_loader,
    val_loader,
    num_classes: int = 6,
) -> None:
    """Train with gradual unfreezing."""
    # Phase 1: Train only the head (10 epochs)
    for param in model.parameters():
        param.requires_grad = False
    for param in model.fc.parameters():
        param.requires_grad = True

    optimizer = optim.AdamW(model.fc.parameters(), lr=1e-3)
    train_epochs(model, optimizer, train_loader, 10)

    # Phase 2: Unfreeze layer4 (15 epochs)
    for param in model.layer4.parameters():
        param.requires_grad = True

    optimizer = optim.AdamW([
        {"params": model.layer4.parameters(), "lr": 1e-4},
        {"params": model.fc.parameters(), "lr": 5e-4},
    ], weight_decay=0.01)
    train_epochs(model, optimizer, train_loader, 15)

    # Phase 3: Unfreeze everything (25 epochs)
    for param in model.parameters():
        param.requires_grad = True

    optimizer = optim.AdamW([
        {"params": model.layer1.parameters(), "lr": 1e-5},
        {"params": model.layer2.parameters(), "lr": 5e-5},
        {"params": model.layer3.parameters(), "lr": 1e-4},
        {"params": model.layer4.parameters(), "lr": 5e-4},
        {"params": model.fc.parameters(), "lr": 1e-3},
    ], weight_decay=0.01)

    scheduler = optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=25, eta_min=1e-6
    )
    train_epochs(model, optimizer, train_loader, 25, scheduler=scheduler)

# Result: ~88% test accuracy

Results Summary

Strategy	Test Accuracy	Training Epochs	Trainable Params
From scratch	62%	100	1.2M (all)
Feature extraction	78%	30	66K (head only)
Fine-tune last block	84%	50	2.7M (layer4 + head)
Gradual unfreezing	88%	50 (phased)	11.2M (all, gradually)

Analysis

Why Transfer Learning Works for Aerial Imagery

Despite the visual difference between ImageNet (everyday objects) and aerial imagery, transfer learning provides a 26% accuracy improvement over training from scratch. This works because:

Low-level features are universal: Edge detectors, texture filters, and color blob detectors learned on ImageNet are equally useful for aerial imagery.
Mid-level features partially transfer: Patterns like corners, junctions, and repetitive textures appear in both domains.
High-level features need adaptation: Object-level features (faces, cars) do not transfer, which is why fine-tuning the last layers helps most.

Discriminative Learning Rates

Using lower learning rates for earlier layers is critical because: - Early layers contain general features that should change slowly (or not at all). - Later layers need to adapt to the new domain and can tolerate larger updates. - The new head has randomly initialized weights and needs the largest learning rate.

A typical ratio is 10x between adjacent groups: head at 1e-3, last block at 1e-4, earlier blocks at 1e-5.

When to Use Each Strategy

Scenario	Best Strategy
Very small dataset (<500 images)	Feature extraction
Small dataset, similar domain	Fine-tune last layers
Small dataset, different domain	Fine-tune last layers with augmentation
Medium dataset (5K-50K)	Full fine-tuning with gradual unfreezing
Large dataset (>50K)	Full fine-tuning from the start

Key Takeaways

Transfer learning should be the default for any image classification task with fewer than 50,000 images. Training from scratch is almost never the right choice.
Gradual unfreezing with discriminative learning rates consistently outperforms other strategies, though it requires more engineering effort.
ImageNet features transfer broadly---even to domains as different as satellite imagery, medical imaging, and industrial inspection.
Data augmentation is even more critical with transfer learning on small datasets, as it prevents the fine-tuned layers from overfitting.
The choice of pretrained model matters less than the choice of strategy. A well-fine-tuned ResNet-18 often outperforms a poorly-fine-tuned ResNet-50.

Discussion Questions

Would you expect transfer learning from ImageNet to work for non-RGB images (e.g., infrared, multispectral)? How would you adapt the approach?
If you had 100,000 unlabeled aerial images and 2,400 labeled ones, how would you modify this approach to leverage the unlabeled data?
How would you decide between using ResNet-18, ResNet-50, or EfficientNet as the pretrained backbone?
The gradual unfreezing approach used three phases. Would more or fewer phases change the results significantly?