Input: 80-channel log-mel spectrogram computed from 30-second audio segments (padded or trimmed) - Two 1D convolution layers with GELU activation for initial feature processing - Sinusoidal positional encoding - Standard Transformer encoder blocks with pre-layer normalization