Remove duplicates (exact and near-duplicate using MinHash or similar). - Filter for quality using at least two heuristics: - Minimum length (e.g., at least 50 words). - Language detection (ensure all text is in the target language). - Perplexity filtering: use a reference language model to remove te