Case Study 1: Sentiment Analysis with LSTMs

Overview

In this case study, we build a complete sentiment analysis pipeline using an LSTM-based classifier. We will classify movie reviews from the IMDB dataset as positive or negative, progressing through data preprocessing, model design, training, evaluation, and interpretation.

Sentiment analysis is one of the most natural applications of RNNs for text: it requires understanding the sequential structure of language, capturing negation ("not good"), sarcasm ("oh, how wonderful... not"), and long-range contextual cues. This many-to-one task illustrates how an LSTM can compress an entire sequence into a meaningful representation for classification.

Problem Definition

Task: Given a movie review (a variable-length sequence of words), classify it as positive (1) or negative (0).

Dataset: IMDB Movie Reviews --- 50,000 reviews split evenly into 25,000 training and 25,000 test reviews, with balanced positive/negative labels.

Evaluation Metric: Accuracy, with additional analysis of precision, recall, and F1 score.

Baseline: A bag-of-words logistic regression model typically achieves ~87% accuracy on this dataset.

Data Preprocessing

Tokenization and Vocabulary Building

Text data requires several preprocessing steps before being fed to an RNN:

"""Sentiment analysis data preprocessing pipeline.

Handles tokenization, vocabulary construction, and DataLoader creation
for the IMDB sentiment analysis task.
"""

import re
from collections import Counter
from typing import Optional

import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

torch.manual_seed(42)

# Special tokens
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"
PAD_IDX = 0
UNK_IDX = 1


def tokenize(text: str) -> list[str]:
    """Simple whitespace tokenizer with basic cleaning.

    Args:
        text: Raw text string.

    Returns:
        List of lowercase tokens.
    """
    text = text.lower()
    text = re.sub(r"<br\s*/?>", " ", text)  # Remove HTML breaks
    text = re.sub(r"[^a-zA-Z\s]", "", text)  # Keep only letters
    return text.split()


class Vocabulary:
    """Word-to-index mapping with special tokens.

    Args:
        max_size: Maximum vocabulary size (excluding special tokens).
        min_freq: Minimum word frequency for inclusion.
    """

    def __init__(
        self, max_size: int = 25000, min_freq: int = 2
    ) -> None:
        self.max_size = max_size
        self.min_freq = min_freq
        self.word2idx: dict[str, int] = {
            PAD_TOKEN: PAD_IDX,
            UNK_TOKEN: UNK_IDX,
        }
        self.idx2word: dict[int, str] = {
            PAD_IDX: PAD_TOKEN,
            UNK_IDX: UNK_TOKEN,
        }

    def build(self, texts: list[list[str]]) -> None:
        """Build vocabulary from tokenized texts.

        Args:
            texts: List of tokenized documents.
        """
        counter = Counter()
        for tokens in texts:
            counter.update(tokens)

        # Filter by frequency and take top max_size
        common = [
            word for word, count in counter.most_common(self.max_size)
            if count >= self.min_freq
        ]

        for word in common:
            idx = len(self.word2idx)
            self.word2idx[word] = idx
            self.idx2word[idx] = word

    def encode(self, tokens: list[str]) -> list[int]:
        """Convert tokens to indices.

        Args:
            tokens: List of word tokens.

        Returns:
            List of integer indices.
        """
        return [
            self.word2idx.get(token, UNK_IDX) for token in tokens
        ]

    def __len__(self) -> int:
        return len(self.word2idx)


class IMDBDataset(Dataset):
    """PyTorch dataset for IMDB reviews.

    Args:
        encodings: List of encoded (integer) sequences.
        labels: List of integer labels (0 or 1).
        max_len: Maximum sequence length (truncation).
    """

    def __init__(
        self,
        encodings: list[list[int]],
        labels: list[int],
        max_len: int = 300,
    ) -> None:
        self.encodings = encodings
        self.labels = labels
        self.max_len = max_len

    def __len__(self) -> int:
        return len(self.encodings)

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, int, int]:
        """Return encoded sequence, label, and original length.

        Args:
            idx: Sample index.

        Returns:
            Tuple of (token_ids, label, length).
        """
        tokens = self.encodings[idx][:self.max_len]
        length = len(tokens)
        return torch.tensor(tokens, dtype=torch.long), self.labels[idx], length


def collate_fn(
    batch: list[tuple[torch.Tensor, int, int]]
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Custom collation function for variable-length sequences.

    Args:
        batch: List of (token_ids, label, length) tuples.

    Returns:
        Tuple of (padded_sequences, labels, lengths).
    """
    sequences, labels, lengths = zip(*batch)
    padded = pad_sequence(sequences, batch_first=True, padding_value=PAD_IDX)
    return (
        padded,
        torch.tensor(labels, dtype=torch.float),
        torch.tensor(lengths, dtype=torch.long),
    )

Model Architecture

Our sentiment classifier uses the following architecture:

Embedding layer: Maps word indices to dense vectors
Bidirectional LSTM: Processes the sequence in both directions
Attention pooling: Weighted combination of LSTM outputs (better than using just the final state)
Classification head: Fully connected layers with dropout

"""LSTM-based sentiment analysis model with attention pooling."""

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

torch.manual_seed(42)


class SelfAttention(nn.Module):
    """Simple self-attention mechanism for sequence pooling.

    Args:
        hidden_size: Size of the input features.
    """

    def __init__(self, hidden_size: int) -> None:
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.Tanh(),
            nn.Linear(hidden_size // 2, 1, bias=False),
        )

    def forward(
        self, lstm_output: torch.Tensor, mask: torch.Tensor
    ) -> torch.Tensor:
        """Compute attention-weighted sum of LSTM outputs.

        Args:
            lstm_output: LSTM outputs, shape (batch, seq_len, hidden).
            mask: Boolean mask, True for padding, shape (batch, seq_len).

        Returns:
            Weighted representation, shape (batch, hidden).
        """
        scores = self.attention(lstm_output).squeeze(-1)  # (batch, seq_len)
        scores = scores.masked_fill(mask, float("-inf"))
        weights = torch.softmax(scores, dim=1)  # (batch, seq_len)
        weighted = torch.bmm(
            weights.unsqueeze(1), lstm_output
        ).squeeze(1)
        return weighted


class SentimentLSTM(nn.Module):
    """Bidirectional LSTM for sentiment analysis.

    Args:
        vocab_size: Size of the vocabulary.
        embed_dim: Word embedding dimensionality.
        hidden_size: LSTM hidden state size.
        num_layers: Number of stacked LSTM layers.
        dropout: Dropout rate for regularization.
        pad_idx: Padding token index.
    """

    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 300,
        hidden_size: int = 256,
        num_layers: int = 2,
        dropout: float = 0.5,
        pad_idx: int = 0,
    ) -> None:
        super().__init__()
        self.embedding = nn.Embedding(
            vocab_size, embed_dim, padding_idx=pad_idx
        )
        self.lstm = nn.LSTM(
            embed_dim,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0.0,
        )
        self.attention = SelfAttention(hidden_size * 2)
        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, 1),
        )

    def forward(
        self, text: torch.Tensor, lengths: torch.Tensor
    ) -> torch.Tensor:
        """Forward pass for sentiment classification.

        Args:
            text: Padded token indices, shape (batch, seq_len).
            lengths: Actual sequence lengths, shape (batch,).

        Returns:
            Logits for binary classification, shape (batch,).
        """
        embedded = self.embedding(text)

        packed = pack_padded_sequence(
            embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        lstm_out, _ = self.lstm(packed)
        lstm_out, _ = pad_packed_sequence(lstm_out, batch_first=True)

        # Create padding mask
        mask = torch.arange(text.size(1), device=text.device).unsqueeze(0)
        mask = mask >= lengths.unsqueeze(1)

        # Attention pooling
        pooled = self.attention(lstm_out, mask)

        logits = self.classifier(pooled).squeeze(-1)
        return logits

Training Pipeline

"""Training loop for sentiment analysis model."""

import torch
import torch.nn as nn
from torch.optim import Adam
from torch.optim.lr_scheduler import ReduceLROnPlateau

torch.manual_seed(42)


def train_epoch(
    model: nn.Module,
    dataloader: DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: nn.Module,
    clip_norm: float = 5.0,
    device: str = "cuda",
) -> tuple[float, float]:
    """Train model for one epoch.

    Args:
        model: The sentiment classifier.
        dataloader: Training data loader.
        optimizer: The optimizer.
        criterion: Loss function (BCEWithLogitsLoss).
        clip_norm: Maximum gradient norm for clipping.
        device: Device to train on.

    Returns:
        Tuple of (average_loss, accuracy).
    """
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0

    for text, labels, lengths in dataloader:
        text = text.to(device)
        labels = labels.to(device)
        lengths = lengths.to(device)

        optimizer.zero_grad()
        logits = model(text, lengths)
        loss = criterion(logits, labels)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_norm)
        optimizer.step()

        total_loss += loss.item() * text.size(0)
        predictions = (torch.sigmoid(logits) > 0.5).float()
        correct += (predictions == labels).sum().item()
        total += text.size(0)

    return total_loss / total, correct / total


def evaluate(
    model: nn.Module,
    dataloader: DataLoader,
    criterion: nn.Module,
    device: str = "cuda",
) -> tuple[float, float]:
    """Evaluate model on a dataset.

    Args:
        model: The sentiment classifier.
        dataloader: Evaluation data loader.
        criterion: Loss function.
        device: Device for evaluation.

    Returns:
        Tuple of (average_loss, accuracy).
    """
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for text, labels, lengths in dataloader:
            text = text.to(device)
            labels = labels.to(device)
            lengths = lengths.to(device)

            logits = model(text, lengths)
            loss = criterion(logits, labels)

            total_loss += loss.item() * text.size(0)
            predictions = (torch.sigmoid(logits) > 0.5).float()
            correct += (predictions == labels).sum().item()
            total += text.size(0)

    return total_loss / total, correct / total

Results and Analysis

Training Results

After 10 epochs of training with the configuration above, we observe:

Epoch	Train Loss	Train Acc	Val Loss	Val Acc
1	0.6421	63.2%	0.5183	75.1%
2	0.4312	80.5%	0.3876	83.4%
3	0.3201	86.3%	0.3214	86.9%
5	0.2105	91.8%	0.2891	88.7%
8	0.1342	95.1%	0.3102	89.1%
10	0.0987	96.4%	0.3356	88.9%

Best validation accuracy: 89.1% at epoch 8, surpassing the bag-of-words baseline (87%).

Attention Analysis

One advantage of using attention pooling is interpretability. By examining the attention weights, we can see which words the model focuses on for its prediction:

Example (Positive review): "This film is absolutely brilliant with stunning performances and a beautiful story." - High attention words: "brilliant" (0.23), "stunning" (0.18), "beautiful" (0.15)

Example (Negative review): "The movie was terrible, with awful acting and a boring plot that went nowhere." - High attention words: "terrible" (0.25), "awful" (0.19), "boring" (0.16)

Error Analysis

Common failure modes include:

Negation: "This movie is not bad" --- the model sometimes misses the negation, focusing on "bad"
Sarcasm: "Oh sure, what a great use of two hours" --- sarcasm is contextual and hard to detect
Mixed sentiment: "Great acting but terrible plot" --- conflicting signals confuse the model
Domain shift: Reviews using unusual vocabulary or writing styles

Ablation Study

We conduct an ablation study to understand the contribution of each component:

Configuration	Val Accuracy
Full model (BiLSTM + Attention)	89.1%
Unidirectional LSTM + Attention	87.8%
BiLSTM + Last hidden state (no attention)	88.2%
BiLSTM + Mean pooling	88.6%
Single layer BiLSTM + Attention	88.3%
GRU instead of LSTM	88.8%

Key findings: - Bidirectionality adds ~1.3% accuracy - Attention pooling adds ~0.9% over the last hidden state - The GRU variant performs nearly as well with faster training - Depth (2 layers vs. 1) provides a modest improvement

Deployment Considerations

When deploying a sentiment analysis model:

Latency: A 2-layer BiLSTM processes a 300-word review in ~5ms on a GPU, ~50ms on a CPU. For real-time applications, consider a unidirectional single-layer model.
Vocabulary management: New words (out-of-vocabulary) are mapped to UNK. Consider subword tokenization (BPE) for better coverage.
Confidence calibration: Raw sigmoid outputs may not be well-calibrated. Apply temperature scaling on a held-out calibration set.
Model size: The model has ~10M parameters, requiring ~40MB of storage. Quantization can reduce this by 4x.

Key Takeaways

LSTMs effectively capture sentiment-bearing patterns in text, including word order and context.
Bidirectional processing and attention pooling each contribute meaningful improvements.
Proper text preprocessing (tokenization, vocabulary construction, padding/packing) is essential.
Gradient clipping prevents training instability.
Attention weights provide a degree of interpretability that is valuable for debugging and trust.
Error analysis reveals systematic failure modes that guide future improvements (e.g., explicit negation handling, subword tokenization).