← HOME / PROJECTS / CRDB BANK - ML FRAUD DETECTION (TANZANIA)

ACTIVE DELIVERY · Planned go-live mid-July 2026 · Fraud analytics

CRDB Bank - ML Fraud Detection (Tanzania).

Lead Data Engineer / ML Fraud Owner
Cairo, Egypt
Source: model audit documentation

I owned the data and model engineering foundation for a fraud anomaly detection platform at CRDB Bank PLC, Tanzania's largest bank, building the Oracle feature pipeline, a governed scoring workflow, score reproducibility controls, and supervised fraud-learning runway for a planned mid-July 2026 production launch.

Transactions

84M/month

Production transaction-level scale after migration

Customers

9M+

Population coverage

Channels

30+

Digital and traditional fraud surfaces

Model

350

Trees in the current anomaly model build

Threshold

99.5%

Training-derived anomaly cutoff for current scoring

// 02 — CHALLENGE

WHY IT MATTERED

CRDB Bank operated at national scale as Tanzania's largest bank, serving more than 9 million customers and processing tens of millions of transactions every month. The fraud platform had to identify unusual behavior across a very large population without turning into a noisy generic-alert engine.

The risk surface was broad: mobile money, ATM/POS, digital banking, agency banking, SWIFT and TISS wires, TIPS, cheques, trade finance, and account-opening activity each carried different fraud patterns, latency expectations, and data-quality constraints.

A major source migration replaced the original data with an 84M-record-per-month transaction-level schema with partial field alignment. I treated that as an ownership problem, not a mapping exercise: the feature logic, ledger flattening, preprocessing, validation evidence, lineage, and score-time assumptions all had to be rebuilt and governed.

// 03 — APPROACH

HOW I BUILT IT

I led the model-ready data architecture from raw Oracle staging through transaction base tables, ledger flattening, FX normalization, behavioral features, model-prep tables, final model input, and lineage maps. The pipeline produced one analytical row per customer transaction while keeping operational identifiers and labels outside the score-time feature set.

The feature layer captured transaction amount and direction, USD-normalized monetary values, cyclical time signals, posting lag, interarrival timing, balance context, channel and currency behavior, novelty flags, velocity windows, amount deviation, ledger complexity, profile completeness, and data-quality indicators.

For the current model, I implemented a governed anomaly-detection pipeline with reproducible preprocessing, persisted median/IQR-style transformation parameters, an ASTORE model artifact, and a training-derived anomaly threshold. The model was intentionally positioned as anomaly detection and alert candidate generation, not a final calibrated fraud probability.

In parallel, I built the positive-fraud feature path from confirmed fraud records into a schema-compatible FINAL_MDL_INPUT_PF table. That gave the project the supervised-learning bridge needed for the next phase without contaminating the unsupervised model input.

// 04 — KEY DECISIONS

WHAT I CHOSE & WHY

Decision · 01

Keep model features separate from labels and lineage

I designed FINAL_MDL_INPUT as the numerical score-time feature contract and kept labels, source identifiers, and investigation lineage in separate mapping tables. That prevented leakage and made the model auditable from raw transaction to scored output.

Decision · 02

Make preprocessing a versioned scoring dependency

Training learned the imputation and robust-scaling parameters once, stored them in FMI_PREPROCESS_PARAMS, and reused them during production scoring. No batch score was allowed to quietly relearn medians, IQRs, or thresholds at score time.

Decision · 03

Start with a governed anomaly-detection champion

Confirmed fraud labels were still being integrated, so I used a governed anomaly-detection approach as the initial champion for unusual transaction behavior. It produced anomaly scores, candidate alerts, and a reusable signal for the future hybrid fraud model.

Decision · 04

Build supervised learning without polluting anomaly training

I built the positive-fraud pipeline in parallel, producing schema-compatible fraud examples for the supervised phase while keeping confirmed fraud rows out of the unsupervised training table until a formal labeled assembly was ready.

// 05 — ARCHITECTURE

HOW IT FITS TOGETHER

The architecture was a transaction-level fraud-scoring platform with a governed handoff between Oracle and SAS Viya: source transactions were shaped into base and ledger-flattened analytical tables, converted into model-ready feature contracts, preprocessed with persisted training parameters, scored through an anomaly-model ASTORE artifact, and prepared for SFM case-management integration with tiering, reason codes, and supervised challengers planned before the mid-July 2026 launch.

// FIG. SYSTEM DIAGRAM

SCALE 1:N

// 06 — HIGHLIGHTS

KEY TAKEAWAYS

▸ LARGEST-BANK SCALE

Built for CRDB Bank PLC, Tanzania's largest bank, including a migrated transaction-level schema with roughly 84M monthly records.

▸ 30+ CHANNEL FRAUD COVERAGE

Covered mobile money, ATM/POS, digital banking, agency banking, wires, TIPS, cheques, trade finance, and account-opening surfaces.

▸ ORACLE-FIRST FEATURE FACTORY

Implemented base-table creation, ledger flattening, FX conversion, behavior windows, encoding, model-prep, final input, and lineage outputs.

▸ GOVERNED ANOMALY MODEL

Trained the SAS Viya anomaly-detection workflow, persisted the ASTORE model artifact, and stored the threshold separately for production scoring.

▸ SUPERVISED MODEL RUNWAY

Built the positive-fraud feature pipeline so confirmed fraud examples could be appended into a future supervised and hybrid champion build.

// 07 — OUTCOMES

RESULTS AND LESSONS

→Owned the Oracle-to-SAS model engineering path from source staging through feature creation, lineage, reproducible preprocessing, model training, and batch scoring scripts.
→Delivered the current unsupervised anomaly detection foundation as an analyst triage and alert-candidate generator, with production launch planned for mid-July 2026 after validation and integration hardening.
→Separated model features, labels, lineage, and positive-fraud examples so the architecture could pass model-risk review and mature into supervised learning without leakage.
→Implemented reusable controls around schema parity, feature existence, row-count validation, train-time preprocessing parameters, ASTORE reuse, and score-time thresholding.
→Defined the next production architecture: reason codes, tiering, false-positive suppression, SFM payload mapping, monitoring dashboards, and a hybrid supervised-plus-unsupervised champion strategy.

// 08 — STACK

THE TOOLS

Data

Oracle SQLLedger flatteningFINAL_MDL_INPUTLineage mapsSchema validation

Features

Velocity windowsAmount deviationLedger complexityFX diagnosticsData-quality flags

Models

SAS ViyaUnsupervised anomaly detectionASTOREPercentile thresholding

Delivery

Batch scoringSAS Fraud Management planReason-code roadmapTiering roadmapModel monitoring

// 09 — LINKS

SOURCE TRAIL

LinkedIn profile Old portfolio