Case Study 2: Transfer Learning for Domain-Specific Image Classification
Overview
In this case study, we apply transfer learning to classify images in a domain very different from ImageNet: aerial satellite imagery for land-use classification. We compare three strategies---feature extraction, fine-tuning the last layers, and fine-tuning the entire network---and demonstrate that transfer learning from ImageNet provides substantial benefits even when the target domain is visually dissimilar.
This case study brings together CNN architectures (Section 14.6), transfer learning techniques (Section 14.8), and training best practices from Chapter 12.
Problem Statement
Our task is to classify aerial images into 6 land-use categories: agricultural, commercial, industrial, residential, forest, and water. The dataset contains 2,400 images (400 per class), each 64x64 pixels RGB. This is small by deep learning standards, making transfer learning essential.
Our goals: 1. Compare training from scratch against three transfer learning strategies 2. Demonstrate discriminative learning rates and gradual unfreezing 3. Achieve the highest possible accuracy on a held-out test set 4. Analyze which pretrained features transfer to aerial imagery
Strategy 1: Training from Scratch (Baseline)
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models
torch.manual_seed(42)
class SmallCNN(nn.Module):
"""A small CNN designed for 64x64 images."""
def __init__(self, num_classes: int = 6) -> None:
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32),
nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64),
nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128),
nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(128, 256, 3, padding=1), nn.BatchNorm2d(256),
nn.ReLU(), nn.AdaptiveAvgPool2d(1),
)
self.classifier = nn.Linear(256, num_classes)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
# Result: ~62% test accuracy after 100 epochs
Training from scratch with only 2,400 images leads to severe overfitting: training accuracy reaches 98% while test accuracy plateaus around 62%.
Strategy 2: Feature Extraction (Frozen Backbone)
We use a pretrained ResNet-18, freeze all convolutional layers, and only train a new classification head:
def create_feature_extractor(num_classes: int = 6) -> nn.Module:
"""ResNet-18 with frozen backbone, new classification head."""
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Replace the final FC layer (unfrozen by default)
model.fc = nn.Sequential(
nn.Linear(512, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, num_classes),
)
return model
model = create_feature_extractor()
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)
# Only the head parameters are updated
# Result: ~78% test accuracy after 30 epochs
Feature extraction achieves 78% accuracy with very fast training (only the head is trained). The pretrained features capture general visual patterns (edges, textures, shapes) that transfer surprisingly well to aerial imagery.
Strategy 3: Fine-Tuning Last Layers
We unfreeze the last residual block and the classification head, using discriminative learning rates:
def create_partial_finetune(num_classes: int = 6) -> tuple[nn.Module, optim.Optimizer]:
"""ResNet-18 with last block and head unfrozen."""
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Freeze everything
for param in model.parameters():
param.requires_grad = False
# Unfreeze layer4 (last residual block)
for param in model.layer4.parameters():
param.requires_grad = True
# New classification head
model.fc = nn.Sequential(
nn.Linear(512, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, num_classes),
)
# Discriminative learning rates
optimizer = optim.AdamW([
{"params": model.layer4.parameters(), "lr": 1e-4},
{"params": model.fc.parameters(), "lr": 1e-3},
], weight_decay=0.01)
return model, optimizer
# Result: ~84% test accuracy after 50 epochs
Fine-tuning the last block allows the model to adapt high-level features to the aerial domain while preserving general low-level features.
Strategy 4: Full Fine-Tuning with Gradual Unfreezing
The most sophisticated approach gradually unfreezes the network from the head backward:
def gradual_unfreeze_training(
model: nn.Module,
train_loader,
val_loader,
num_classes: int = 6,
) -> None:
"""Train with gradual unfreezing."""
# Phase 1: Train only the head (10 epochs)
for param in model.parameters():
param.requires_grad = False
for param in model.fc.parameters():
param.requires_grad = True
optimizer = optim.AdamW(model.fc.parameters(), lr=1e-3)
train_epochs(model, optimizer, train_loader, 10)
# Phase 2: Unfreeze layer4 (15 epochs)
for param in model.layer4.parameters():
param.requires_grad = True
optimizer = optim.AdamW([
{"params": model.layer4.parameters(), "lr": 1e-4},
{"params": model.fc.parameters(), "lr": 5e-4},
], weight_decay=0.01)
train_epochs(model, optimizer, train_loader, 15)
# Phase 3: Unfreeze everything (25 epochs)
for param in model.parameters():
param.requires_grad = True
optimizer = optim.AdamW([
{"params": model.layer1.parameters(), "lr": 1e-5},
{"params": model.layer2.parameters(), "lr": 5e-5},
{"params": model.layer3.parameters(), "lr": 1e-4},
{"params": model.layer4.parameters(), "lr": 5e-4},
{"params": model.fc.parameters(), "lr": 1e-3},
], weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=25, eta_min=1e-6
)
train_epochs(model, optimizer, train_loader, 25, scheduler=scheduler)
# Result: ~88% test accuracy
Results Summary
| Strategy | Test Accuracy | Training Epochs | Trainable Params |
|---|---|---|---|
| From scratch | 62% | 100 | 1.2M (all) |
| Feature extraction | 78% | 30 | 66K (head only) |
| Fine-tune last block | 84% | 50 | 2.7M (layer4 + head) |
| Gradual unfreezing | 88% | 50 (phased) | 11.2M (all, gradually) |
Analysis
Why Transfer Learning Works for Aerial Imagery
Despite the visual difference between ImageNet (everyday objects) and aerial imagery, transfer learning provides a 26% accuracy improvement over training from scratch. This works because:
- Low-level features are universal: Edge detectors, texture filters, and color blob detectors learned on ImageNet are equally useful for aerial imagery.
- Mid-level features partially transfer: Patterns like corners, junctions, and repetitive textures appear in both domains.
- High-level features need adaptation: Object-level features (faces, cars) do not transfer, which is why fine-tuning the last layers helps most.
Discriminative Learning Rates
Using lower learning rates for earlier layers is critical because: - Early layers contain general features that should change slowly (or not at all). - Later layers need to adapt to the new domain and can tolerate larger updates. - The new head has randomly initialized weights and needs the largest learning rate.
A typical ratio is 10x between adjacent groups: head at 1e-3, last block at 1e-4, earlier blocks at 1e-5.
When to Use Each Strategy
| Scenario | Best Strategy |
|---|---|
| Very small dataset (<500 images) | Feature extraction |
| Small dataset, similar domain | Fine-tune last layers |
| Small dataset, different domain | Fine-tune last layers with augmentation |
| Medium dataset (5K-50K) | Full fine-tuning with gradual unfreezing |
| Large dataset (>50K) | Full fine-tuning from the start |
Key Takeaways
-
Transfer learning should be the default for any image classification task with fewer than 50,000 images. Training from scratch is almost never the right choice.
-
Gradual unfreezing with discriminative learning rates consistently outperforms other strategies, though it requires more engineering effort.
-
ImageNet features transfer broadly---even to domains as different as satellite imagery, medical imaging, and industrial inspection.
-
Data augmentation is even more critical with transfer learning on small datasets, as it prevents the fine-tuned layers from overfitting.
-
The choice of pretrained model matters less than the choice of strategy. A well-fine-tuned ResNet-18 often outperforms a poorly-fine-tuned ResNet-50.
Discussion Questions
-
Would you expect transfer learning from ImageNet to work for non-RGB images (e.g., infrared, multispectral)? How would you adapt the approach?
-
If you had 100,000 unlabeled aerial images and 2,400 labeled ones, how would you modify this approach to leverage the unlabeled data?
-
How would you decide between using ResNet-18, ResNet-50, or EfficientNet as the pretrained backbone?
-
The gradual unfreezing approach used three phases. Would more or fewer phases change the results significantly?