Case Study 1: Sentiment Analysis with LSTMs

Overview

In this case study, we build a complete sentiment analysis pipeline using an LSTM-based classifier. We will classify movie reviews from the IMDB dataset as positive or negative, progressing through data preprocessing, model design, training, evaluation, and interpretation.

Sentiment analysis is one of the most natural applications of RNNs for text: it requires understanding the sequential structure of language, capturing negation ("not good"), sarcasm ("oh, how wonderful... not"), and long-range contextual cues. This many-to-one task illustrates how an LSTM can compress an entire sequence into a meaningful representation for classification.


Problem Definition

Task: Given a movie review (a variable-length sequence of words), classify it as positive (1) or negative (0).

Dataset: IMDB Movie Reviews --- 50,000 reviews split evenly into 25,000 training and 25,000 test reviews, with balanced positive/negative labels.

Evaluation Metric: Accuracy, with additional analysis of precision, recall, and F1 score.

Baseline: A bag-of-words logistic regression model typically achieves ~87% accuracy on this dataset.


Data Preprocessing

Tokenization and Vocabulary Building

Text data requires several preprocessing steps before being fed to an RNN:

"""Sentiment analysis data preprocessing pipeline.

Handles tokenization, vocabulary construction, and DataLoader creation
for the IMDB sentiment analysis task.
"""

import re
from collections import Counter
from typing import Optional

import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

torch.manual_seed(42)

# Special tokens
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"
PAD_IDX = 0
UNK_IDX = 1


def tokenize(text: str) -> list[str]:
    """Simple whitespace tokenizer with basic cleaning.

    Args:
        text: Raw text string.

    Returns:
        List of lowercase tokens.
    """
    text = text.lower()
    text = re.sub(r"<br\s*/?>", " ", text)  # Remove HTML breaks
    text = re.sub(r"[^a-zA-Z\s]", "", text)  # Keep only letters
    return text.split()


class Vocabulary:
    """Word-to-index mapping with special tokens.

    Args:
        max_size: Maximum vocabulary size (excluding special tokens).
        min_freq: Minimum word frequency for inclusion.
    """

    def __init__(
        self, max_size: int = 25000, min_freq: int = 2
    ) -> None:
        self.max_size = max_size
        self.min_freq = min_freq
        self.word2idx: dict[str, int] = {
            PAD_TOKEN: PAD_IDX,
            UNK_TOKEN: UNK_IDX,
        }
        self.idx2word: dict[int, str] = {
            PAD_IDX: PAD_TOKEN,
            UNK_IDX: UNK_TOKEN,
        }

    def build(self, texts: list[list[str]]) -> None:
        """Build vocabulary from tokenized texts.

        Args:
            texts: List of tokenized documents.
        """
        counter = Counter()
        for tokens in texts:
            counter.update(tokens)

        # Filter by frequency and take top max_size
        common = [
            word for word, count in counter.most_common(self.max_size)
            if count >= self.min_freq
        ]

        for word in common:
            idx = len(self.word2idx)
            self.word2idx[word] = idx
            self.idx2word[idx] = word

    def encode(self, tokens: list[str]) -> list[int]:
        """Convert tokens to indices.

        Args:
            tokens: List of word tokens.

        Returns:
            List of integer indices.
        """
        return [
            self.word2idx.get(token, UNK_IDX) for token in tokens
        ]

    def __len__(self) -> int:
        return len(self.word2idx)


class IMDBDataset(Dataset):
    """PyTorch dataset for IMDB reviews.

    Args:
        encodings: List of encoded (integer) sequences.
        labels: List of integer labels (0 or 1).
        max_len: Maximum sequence length (truncation).
    """

    def __init__(
        self,
        encodings: list[list[int]],
        labels: list[int],
        max_len: int = 300,
    ) -> None:
        self.encodings = encodings
        self.labels = labels
        self.max_len = max_len

    def __len__(self) -> int:
        return len(self.encodings)

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, int, int]:
        """Return encoded sequence, label, and original length.

        Args:
            idx: Sample index.

        Returns:
            Tuple of (token_ids, label, length).
        """
        tokens = self.encodings[idx][:self.max_len]
        length = len(tokens)
        return torch.tensor(tokens, dtype=torch.long), self.labels[idx], length


def collate_fn(
    batch: list[tuple[torch.Tensor, int, int]]
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Custom collation function for variable-length sequences.

    Args:
        batch: List of (token_ids, label, length) tuples.

    Returns:
        Tuple of (padded_sequences, labels, lengths).
    """
    sequences, labels, lengths = zip(*batch)
    padded = pad_sequence(sequences, batch_first=True, padding_value=PAD_IDX)
    return (
        padded,
        torch.tensor(labels, dtype=torch.float),
        torch.tensor(lengths, dtype=torch.long),
    )

Model Architecture

Our sentiment classifier uses the following architecture:

  1. Embedding layer: Maps word indices to dense vectors
  2. Bidirectional LSTM: Processes the sequence in both directions
  3. Attention pooling: Weighted combination of LSTM outputs (better than using just the final state)
  4. Classification head: Fully connected layers with dropout
"""LSTM-based sentiment analysis model with attention pooling."""

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

torch.manual_seed(42)


class SelfAttention(nn.Module):
    """Simple self-attention mechanism for sequence pooling.

    Args:
        hidden_size: Size of the input features.
    """

    def __init__(self, hidden_size: int) -> None:
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.Tanh(),
            nn.Linear(hidden_size // 2, 1, bias=False),
        )

    def forward(
        self, lstm_output: torch.Tensor, mask: torch.Tensor
    ) -> torch.Tensor:
        """Compute attention-weighted sum of LSTM outputs.

        Args:
            lstm_output: LSTM outputs, shape (batch, seq_len, hidden).
            mask: Boolean mask, True for padding, shape (batch, seq_len).

        Returns:
            Weighted representation, shape (batch, hidden).
        """
        scores = self.attention(lstm_output).squeeze(-1)  # (batch, seq_len)
        scores = scores.masked_fill(mask, float("-inf"))
        weights = torch.softmax(scores, dim=1)  # (batch, seq_len)
        weighted = torch.bmm(
            weights.unsqueeze(1), lstm_output
        ).squeeze(1)
        return weighted


class SentimentLSTM(nn.Module):
    """Bidirectional LSTM for sentiment analysis.

    Args:
        vocab_size: Size of the vocabulary.
        embed_dim: Word embedding dimensionality.
        hidden_size: LSTM hidden state size.
        num_layers: Number of stacked LSTM layers.
        dropout: Dropout rate for regularization.
        pad_idx: Padding token index.
    """

    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 300,
        hidden_size: int = 256,
        num_layers: int = 2,
        dropout: float = 0.5,
        pad_idx: int = 0,
    ) -> None:
        super().__init__()
        self.embedding = nn.Embedding(
            vocab_size, embed_dim, padding_idx=pad_idx
        )
        self.lstm = nn.LSTM(
            embed_dim,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0.0,
        )
        self.attention = SelfAttention(hidden_size * 2)
        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, 1),
        )

    def forward(
        self, text: torch.Tensor, lengths: torch.Tensor
    ) -> torch.Tensor:
        """Forward pass for sentiment classification.

        Args:
            text: Padded token indices, shape (batch, seq_len).
            lengths: Actual sequence lengths, shape (batch,).

        Returns:
            Logits for binary classification, shape (batch,).
        """
        embedded = self.embedding(text)

        packed = pack_padded_sequence(
            embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        lstm_out, _ = self.lstm(packed)
        lstm_out, _ = pad_packed_sequence(lstm_out, batch_first=True)

        # Create padding mask
        mask = torch.arange(text.size(1), device=text.device).unsqueeze(0)
        mask = mask >= lengths.unsqueeze(1)

        # Attention pooling
        pooled = self.attention(lstm_out, mask)

        logits = self.classifier(pooled).squeeze(-1)
        return logits

Training Pipeline

"""Training loop for sentiment analysis model."""

import torch
import torch.nn as nn
from torch.optim import Adam
from torch.optim.lr_scheduler import ReduceLROnPlateau

torch.manual_seed(42)


def train_epoch(
    model: nn.Module,
    dataloader: DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: nn.Module,
    clip_norm: float = 5.0,
    device: str = "cuda",
) -> tuple[float, float]:
    """Train model for one epoch.

    Args:
        model: The sentiment classifier.
        dataloader: Training data loader.
        optimizer: The optimizer.
        criterion: Loss function (BCEWithLogitsLoss).
        clip_norm: Maximum gradient norm for clipping.
        device: Device to train on.

    Returns:
        Tuple of (average_loss, accuracy).
    """
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0

    for text, labels, lengths in dataloader:
        text = text.to(device)
        labels = labels.to(device)
        lengths = lengths.to(device)

        optimizer.zero_grad()
        logits = model(text, lengths)
        loss = criterion(logits, labels)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_norm)
        optimizer.step()

        total_loss += loss.item() * text.size(0)
        predictions = (torch.sigmoid(logits) > 0.5).float()
        correct += (predictions == labels).sum().item()
        total += text.size(0)

    return total_loss / total, correct / total


def evaluate(
    model: nn.Module,
    dataloader: DataLoader,
    criterion: nn.Module,
    device: str = "cuda",
) -> tuple[float, float]:
    """Evaluate model on a dataset.

    Args:
        model: The sentiment classifier.
        dataloader: Evaluation data loader.
        criterion: Loss function.
        device: Device for evaluation.

    Returns:
        Tuple of (average_loss, accuracy).
    """
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for text, labels, lengths in dataloader:
            text = text.to(device)
            labels = labels.to(device)
            lengths = lengths.to(device)

            logits = model(text, lengths)
            loss = criterion(logits, labels)

            total_loss += loss.item() * text.size(0)
            predictions = (torch.sigmoid(logits) > 0.5).float()
            correct += (predictions == labels).sum().item()
            total += text.size(0)

    return total_loss / total, correct / total

Results and Analysis

Training Results

After 10 epochs of training with the configuration above, we observe:

Epoch Train Loss Train Acc Val Loss Val Acc
1 0.6421 63.2% 0.5183 75.1%
2 0.4312 80.5% 0.3876 83.4%
3 0.3201 86.3% 0.3214 86.9%
5 0.2105 91.8% 0.2891 88.7%
8 0.1342 95.1% 0.3102 89.1%
10 0.0987 96.4% 0.3356 88.9%

Best validation accuracy: 89.1% at epoch 8, surpassing the bag-of-words baseline (87%).

Attention Analysis

One advantage of using attention pooling is interpretability. By examining the attention weights, we can see which words the model focuses on for its prediction:

Example (Positive review): "This film is absolutely brilliant with stunning performances and a beautiful story." - High attention words: "brilliant" (0.23), "stunning" (0.18), "beautiful" (0.15)

Example (Negative review): "The movie was terrible, with awful acting and a boring plot that went nowhere." - High attention words: "terrible" (0.25), "awful" (0.19), "boring" (0.16)

Error Analysis

Common failure modes include:

  1. Negation: "This movie is not bad" --- the model sometimes misses the negation, focusing on "bad"
  2. Sarcasm: "Oh sure, what a great use of two hours" --- sarcasm is contextual and hard to detect
  3. Mixed sentiment: "Great acting but terrible plot" --- conflicting signals confuse the model
  4. Domain shift: Reviews using unusual vocabulary or writing styles

Ablation Study

We conduct an ablation study to understand the contribution of each component:

Configuration Val Accuracy
Full model (BiLSTM + Attention) 89.1%
Unidirectional LSTM + Attention 87.8%
BiLSTM + Last hidden state (no attention) 88.2%
BiLSTM + Mean pooling 88.6%
Single layer BiLSTM + Attention 88.3%
GRU instead of LSTM 88.8%

Key findings: - Bidirectionality adds ~1.3% accuracy - Attention pooling adds ~0.9% over the last hidden state - The GRU variant performs nearly as well with faster training - Depth (2 layers vs. 1) provides a modest improvement


Deployment Considerations

When deploying a sentiment analysis model:

  1. Latency: A 2-layer BiLSTM processes a 300-word review in ~5ms on a GPU, ~50ms on a CPU. For real-time applications, consider a unidirectional single-layer model.
  2. Vocabulary management: New words (out-of-vocabulary) are mapped to UNK. Consider subword tokenization (BPE) for better coverage.
  3. Confidence calibration: Raw sigmoid outputs may not be well-calibrated. Apply temperature scaling on a held-out calibration set.
  4. Model size: The model has ~10M parameters, requiring ~40MB of storage. Quantization can reduce this by 4x.

Key Takeaways

  1. LSTMs effectively capture sentiment-bearing patterns in text, including word order and context.
  2. Bidirectional processing and attention pooling each contribute meaningful improvements.
  3. Proper text preprocessing (tokenization, vocabulary construction, padding/packing) is essential.
  4. Gradient clipping prevents training instability.
  5. Attention weights provide a degree of interpretability that is valuable for debugging and trust.
  6. Error analysis reveals systematic failure modes that guide future improvements (e.g., explicit negation handling, subword tokenization).