Case Study 2: Building an AI Engineering Team from Scratch
Background
Meridian Financial Services (a composite based on real industry patterns) is a regional financial services company with approximately 3,000 employees. Founded in 1995, Meridian offers consumer banking, commercial lending, wealth management, and insurance products. The company has a strong technology department of about 250 people, primarily focused on maintaining core banking systems, web and mobile applications, and data warehousing.
In early 2022, Meridian's Chief Technology Officer (CTO) presented the board with a strategic assessment: competitor banks were deploying AI-powered systems for fraud detection, credit risk modeling, customer service automation, and personalized product recommendations. Meridian's existing analytics capabilities --- primarily SQL-based reporting and a small team of two business analysts using Excel and Tableau --- were insufficient to compete.
The board approved a two-year initiative to build an internal AI engineering capability, with a budget of $4.2 million (covering personnel, infrastructure, and tooling). The CTO appointed Sarah Chen, a Senior Director of Engineering with 15 years of experience (including three years at a fintech startup where she had led ML initiatives), to build and lead the AI team.
The Challenge
Sarah faced a multi-dimensional challenge:
1. Talent - The local job market (a mid-sized city in the southeastern United States) had limited AI/ML talent. - Competing for experienced ML engineers against major tech companies and well-funded startups was difficult. - Existing engineering staff had strong software skills but limited ML experience.
2. Infrastructure - No GPU compute infrastructure existed. - The data warehouse was a traditional SQL Server-based system optimized for reporting, not for ML training workloads. - No experiment tracking, model registry, or model serving infrastructure was in place.
3. Organizational Readiness - Business stakeholders had vague expectations ("We need AI") without specific, well-defined use cases. - Regulatory compliance (banking regulators require model risk management, explainability, and audit trails for models used in lending decisions) imposed constraints that most AI tutorials and blog posts do not address. - The existing engineering culture was risk-averse and oriented toward waterfall development processes.
4. Proving Value - The board expected tangible ROI within 18 months. - Without quick wins, the initiative risked losing organizational support and budget.
The Strategy
Sarah developed a phased strategy that balanced short-term impact with long-term capability building.
Phase 1: Foundation (Months 1--6)
Hiring the Core Team
Sarah's first hire was a Staff ML Engineer named James, recruited from a larger financial institution where he had built fraud detection systems. James brought both ML expertise and financial domain knowledge --- a combination Sarah considered essential for the first hire.
Her hiring strategy prioritized:
- Versatility over specialization: For a small, new team, engineers who could work across the full stack (data processing, model development, deployment) were more valuable than deep specialists.
- Domain affinity: Candidates with financial services experience or strong interest in the domain were preferred, as they could contribute more quickly.
- Teaching ability: Early hires would need to mentor future team members and educate business stakeholders.
Over the first six months, Sarah hired: - 2 ML Engineers (James + one junior hire from a local university with a strong ML thesis) - 1 Data Engineer (internal transfer from the data warehouse team, eager to expand skills) - 1 ML Platform Engineer (remote hire with experience building ML infrastructure at a mid-sized company)
The total team, including Sarah, was five people.
Organizational structure:
CTO
└── Sarah Chen (Director of AI Engineering)
├── James Park (Staff ML Engineer)
├── Lisa Nguyen (ML Engineer)
├── Marcus Williams (Data Engineer, internal transfer)
└── Priya Sharma (ML Platform Engineer, remote)
Upskilling Existing Staff
Rather than relying exclusively on new hires, Sarah launched an internal upskilling program:
- AI Literacy Workshops: Monthly two-hour sessions for business stakeholders covering what AI can and cannot do, how to define good AI use cases, and what to expect from AI projects.
- ML Engineering Bootcamp: A 12-week part-time program for interested software engineers covering Python for data science, ML fundamentals, and model deployment. Three engineers from the existing team completed the program and became "AI-adjacent" contributors who could support the core team on data integration and deployment tasks.
- Reading Group: A bi-weekly paper reading group open to the entire engineering department.
Infrastructure Setup
Priya, the ML Platform Engineer, led the infrastructure build-out:
- Compute: Set up a cloud-based ML environment on AWS using SageMaker for managed training and EC2 GPU instances for experimentation. Chose cloud over on-premises to avoid large upfront capital expenditure and to provide flexibility.
- Data access: Built secure data pipelines (using Apache Airflow) that extracted relevant data from the core banking data warehouse, applied PII masking and encryption, and loaded it into a dedicated ML data lake (S3 with Lake Formation for access control).
- Experiment tracking: Deployed MLflow on an internal server for experiment tracking and model registry.
- Version control: Standardized on Git (GitHub Enterprise) with branching strategies, code review requirements, and CI/CD pipelines.
- Development environment: Standardized JupyterHub for exploration and VS Code for production code development.
Total infrastructure cost: approximately $8,000/month in the initial phase.
Use Case Selection
Sarah worked with business stakeholders to identify candidate AI use cases. She evaluated each candidate along four dimensions:
| Criterion | Description | Weight |
|---|---|---|
| Business impact | Revenue increase, cost reduction, or risk mitigation | 30% |
| Technical feasibility | Data availability, problem complexity, known approaches | 30% |
| Organizational readiness | Stakeholder engagement, integration complexity | 20% |
| Regulatory risk | Compliance requirements, explainability needs | 20% |
From an initial list of 14 candidate use cases, three were selected for the first year:
- Transaction fraud detection (high impact, proven approaches, high stakeholder urgency)
- Customer churn prediction (moderate impact, well-understood problem, good data availability)
- Document classification for loan processing (moderate impact, clear efficiency gains, lower regulatory risk)
Phase 2: First Deliverables (Months 4--12)
Project 1: Transaction Fraud Detection
Problem: Meridian's existing fraud detection relied on static rules (e.g., flag transactions over $5,000 from unusual locations). This system had a high false-positive rate (3.2% of legitimate transactions flagged) that frustrated customers and cost the company $2.1 million annually in manual review labor.
Approach: 1. Marcus (Data Engineer) built a feature pipeline that computed 87 features from transaction data, including transaction velocity, merchant category patterns, geographic anomalies, time-of-day patterns, and device fingerprint signals.
-
James and Lisa developed and compared multiple models: - Logistic regression baseline: AUC = 0.89 - Random Forest: AUC = 0.93 - XGBoost: AUC = 0.96 - Neural network (feedforward): AUC = 0.95
-
The team selected XGBoost for initial deployment due to its strong performance, interpretability (feature importance), and lower infrastructure requirements compared to the neural network.
-
Priya deployed the model as a real-time inference service behind an API, integrated with the transaction processing system. Inference latency was kept under 50ms per transaction.
Results (after 6 months in production): - False-positive rate reduced from 3.2% to 0.8% - Fraud detection rate improved from 74% to 91% - Annual savings in manual review costs: approximately $1.4 million - Customer satisfaction (measured by fraud-related complaint rate) improved by 35%
Key challenges encountered: - Class imbalance: Fraudulent transactions represented only 0.12% of all transactions. The team used SMOTE oversampling and cost-sensitive learning to address this. - Feature engineering at scale: Computing real-time features (e.g., "number of transactions in the last hour") required a streaming architecture (Kafka + custom aggregation service) that was new to Meridian's infrastructure. - Model risk management: Banking regulators require documentation, validation, and ongoing monitoring of models used in fraud detection. The team developed a model risk management framework adapted from the Federal Reserve's SR 11-7 guidance.
Project 2: Customer Churn Prediction
Problem: Meridian's wealth management division was losing high-value clients to competitors at a rate of 8.3% annually. Relationship managers had no systematic way to identify at-risk clients before they left.
Approach: 1. The team assembled a dataset of client activity over 36 months, including transaction patterns, product holdings, customer service interactions, digital engagement metrics, and demographic data.
-
They trained a gradient-boosted model (LightGBM) to predict the probability of churn within the next 90 days.
-
The model was deployed as a weekly batch scoring job that produced a ranked list of at-risk clients, along with the top contributing factors for each prediction.
Results (after 9 months): - The model identified 73% of eventual churners in the top decile of risk scores. - Relationship managers, armed with risk scores and contributing factors, initiated proactive outreach. - Client retention in the targeted group improved by 22%. - Estimated annual revenue retention: $3.8 million.
Project 3: Document Classification
Problem: Loan applications required manual review of multiple document types (pay stubs, tax returns, bank statements, employment verification letters). Document sorting consumed an estimated 15 minutes per application.
Approach: 1. The team fine-tuned a pre-trained text classification model on 12,000 labeled documents (classified into 8 document types). 2. They also used an image classification model (ResNet-50, fine-tuned) for scanned documents where OCR output was unreliable. 3. An ensemble of both models provided robust classification across document formats.
Results: - Document classification accuracy: 94.7% - Processing time reduced from 15 minutes to under 30 seconds per application (with human review of low-confidence classifications). - The loan processing team was able to handle 40% more applications without additional staff.
Phase 3: Scaling and Maturing (Months 12--24)
With three successful projects demonstrating value, the team entered a scaling phase:
Team Growth
Based on demonstrated ROI, the budget was expanded. By month 24, the team grew to 12 people:
CTO
└── Sarah Chen (Director of AI Engineering)
├── ML Engineering Pod
│ ├── James Park (Staff ML Engineer, Pod Lead)
│ ├── Lisa Nguyen (Senior ML Engineer)
│ ├── New Hire (ML Engineer)
│ └── New Hire (ML Engineer)
├── Data & Platform Pod
│ ├── Marcus Williams (Senior Data Engineer, Pod Lead)
│ ├── Priya Sharma (Staff ML Platform Engineer)
│ ├── New Hire (Data Engineer)
│ └── New Hire (ML Platform Engineer)
└── Applied AI Pod
├── New Hire (Senior Applied Scientist)
├── New Hire (NLP Engineer)
└── Internal Transfer (ML Engineer, bootcamp graduate)
Platform Maturation
The team invested in platforms and processes:
- Feature store: Deployed Feast to manage shared features across projects, reducing duplicate feature engineering effort and ensuring consistency between training and serving.
- Model monitoring: Implemented Evidently AI for data drift detection and model performance monitoring, with automated alerts.
- Standardized ML pipeline: Created a template-based system (using MLflow Projects and custom tooling) that reduced the time to deploy a new model from 6 weeks to 2 weeks.
- Model governance: Established a Model Review Board (including compliance, risk, and business representatives) that approved all models before production deployment.
New Capabilities
The expanded team took on more ambitious projects: - Conversational AI: An NLP-powered customer service chatbot that handled 40% of routine inquiries without human intervention. - Credit risk modeling: A next-generation credit scoring model that incorporated alternative data sources, subject to rigorous regulatory review. - Personalized product recommendations: A recommendation engine for the mobile banking app that suggested relevant financial products based on customer profiles and behavior.
Cost-Benefit Analysis
At the end of year two, Sarah presented the following financial summary to the board:
Investment (over 24 months):
| Category | Cost |
|---|---|
| Personnel (salaries, benefits, recruiting) | $3,200,000 |
| Cloud infrastructure | $480,000 |
| Tooling and software licenses | $180,000 |
| Training and upskilling | $90,000 |
| Total investment | $3,950,000 |
Returns (annualized, year 2):
| Project | Annual Value |
|---|---|
| Fraud detection (cost savings + loss prevention) | $2,800,000 |
| Customer churn reduction (retained revenue) | $3,800,000 |
| Document processing efficiency | $620,000 |
| Chatbot (customer service cost reduction) | $1,100,000 |
| Credit risk (improved loan performance) | $950,000 |
| Product recommendations (incremental revenue) | $420,000 |
| Total annual value | $9,690,000 |
The initiative achieved a positive ROI within 15 months and was generating approximately 2.5x annual return on investment by the end of year two.
Lessons Learned
1. Start with a business problem, not a technology. The most successful projects started with a clearly defined business problem and a measurable success criterion. Projects that started with "let's apply AI to X" without a clear business case struggled to gain traction.
2. The first hire sets the culture. James's combination of ML expertise, financial domain knowledge, and collaborative temperament set the tone for the entire team. He mentored junior engineers, partnered effectively with business stakeholders, and established engineering practices that scaled as the team grew.
3. Internal upskilling creates force multipliers. The bootcamp graduates and AI-literate business stakeholders accelerated every project. Engineers who understood ML concepts could integrate more effectively with the AI team, and business stakeholders who understood AI's capabilities and limitations could define better use cases.
4. Infrastructure investment pays compound returns. The feature store, experiment tracking, and standardized deployment pipelines were expensive to build initially but dramatically accelerated subsequent projects. The time to deploy a model dropped from six weeks to two weeks, enabling the team to take on more projects with the same headcount.
5. Regulatory compliance is a feature, not a bug. In financial services, the discipline required by regulatory compliance (documentation, validation, monitoring, audit trails) actually improved the quality and reliability of the AI systems. Teams in less regulated industries can learn from these practices.
6. Quick wins build organizational trust. The fraud detection project, which delivered clear, measurable value within nine months, was essential for building the organizational support needed for the larger initiative. Without that early success, the team might not have received the budget and executive sponsorship needed to scale.
7. Remote and hybrid teams can work for AI engineering. Priya, the ML Platform Engineer, worked fully remotely and was one of the most effective team members. The key was clear communication norms, strong documentation practices, and regular (but not excessive) synchronous collaboration.
8. The AI engineer role is broader than expected. Sarah found that the most effective AI engineers on her team were those who could move fluidly between data exploration, model development, systems engineering, and stakeholder communication. Pure ML specialists who could not engage with infrastructure or business context were less effective in the small-team environment.
Discussion Questions
-
Sarah chose to build an internal AI team rather than outsource to a consulting firm or use a fully managed AI platform. What are the trade-offs of each approach? Under what circumstances would outsourcing be the better choice?
-
The team selected XGBoost over a neural network for fraud detection, despite the neural network achieving similar performance. What factors beyond raw accuracy should influence model selection in a production environment?
-
How should Meridian handle the transition for employees whose roles are affected by AI automation (e.g., document reviewers, customer service agents)? What is the company's ethical responsibility?
-
The use case selection framework weighted business impact and technical feasibility at 30% each, with organizational readiness and regulatory risk at 20% each. Do you agree with these weights? How would you adjust them, and why?
-
If you were Sarah and could change one decision from the first two years, what would it be? If you were planning year three, what would be your top priority?
-
Compare the team structures at month 6 and month 24. What principles guided the evolution, and how might the team need to change as it scales to 25 or 50 people?