AMR.ALFAYOUMY
// CHALLENGE & SOLUTION · FRAUD RISK MODEL

When the same fraud model gave two different answers.

A production ML lesson about reproducibility, hardware, and not trusting the obvious explanation.

In fraud detection, the worst failures are not always the loud ones.

Sometimes the model runs. The pipeline completes. The data loads. The code version looks correct. The output table is populated exactly where everyone expects it to be.

And still, the result is wrong.

This is a story about one of those failures: a fraud risk model that behaved normally in development, then produced a dramatically different volume of alerts in production. The first instinct in that situation is usually to blame the data, the threshold, the model artifact, or some hidden change in package versions.

In this case, none of those were the root cause.

The issue was lower than the model. Lower than Python. Lower than the feature code. It came from the CPU.

More specifically, it came from different low-level numeric execution paths being selected on two machines that were supposed to produce the same fraud scores.

That sounds esoteric, but the business impact was very concrete: the development environment detected 55 cases, while production detected 20,354. High-risk cases moved from 12 to 1,430.

For a fraud team, that is not a rounding error. That is the difference between a usable investigative queue and an operational fire.

The Symptom

The project involved a fraud risk model for a client environment where consistency mattered. The model was not being used as a casual experiment; it was part of a controlled scoring process where investigators, risk owners, and technical teams needed confidence that the same inputs would produce the same outputs.

The surprise appeared during parity testing between development and production.

Both environments were scoring what should have been the same data with the same model. But the alert volumes were wildly different:

Metric Development Production
Detected cases 55 20,354
High-risk cases 12 1,430

At that point, the easy explanation would have been: production must be receiving different input data.

That is the first thing I checked.

It was also the first thing the diagnostics ruled out.

What I Checked First

When a model behaves differently across environments, I do not start by guessing. I try to narrow the problem boundary until there is only one layer left that can explain the difference.

The parity diagnostics compared development and production across the raw scoring data:

  • raw schema
  • raw values
  • data types
  • null counts
  • feature means
  • package versions
  • model artifacts
  • single-thread runtime settings

All of those matched.

That was important. It meant the problem was not caused by a missing column, a different source extract, a bad join, or a model file being out of sync.

The first real divergence appeared later, at the transformed feature layer:

hashes.features_transformed.value_hash_unordered

That detail mattered more than the alert counts.

The raw data was still identical when loaded. The difference only appeared after preprocessing transformed the raw input into the numeric feature matrix used by the model.

In plain English: both environments started with the same facts, but they did not turn those facts into model features in exactly the same way.

Once the transformed features drifted, the rest was predictable. Probability hashes diverged. Prediction hashes diverged. Risk bands diverged. The final case volumes became impossible to trust.

Why This Was Strange

Most production ML issues live in familiar places.

The data changed. A scheduled job picked up a different date range. A feature table was refreshed in one environment and stale in another. A dependency upgraded quietly. A threshold was copied incorrectly. A model artifact was replaced without the matching preprocessing artifact.

This case was frustrating because the usual suspects kept clearing themselves.

The Python versions matched. NumPy, pandas, scikit-learn, and machine learning models matched. The scoring code matched. The model artifact matched. The input data matched. Runtime threading had already been constrained to reduce nondeterminism.

But the transformed features still differed.

That pushed the investigation below the application layer and into the numeric runtime.

The Root Cause

The development server and production server did not expose the same CPU capabilities.

The development machine had AVX512-capable execution paths available. Production did not.

AVX512 is a set of CPU instructions that allows processors to perform certain numeric operations in wider, faster batches. For most business users, the exact name does not matter. The useful analogy is this:

Two kitchens can follow the same recipe with the same ingredients, but use different industrial mixers. The cake is still "the same cake" in theory, but tiny differences in how the mixing happens can change the texture.

In machine learning pipelines, those tiny differences are floating-point differences. Usually they are harmless. Sometimes, especially after preprocessing, scaling, model thresholds, and risk banding, they can be amplified into very visible business differences.

In this project, development NumPy/OpenBLAS selected a SkylakeX execution path with X86_V4 / AVX512_ICL available. Production selected a Haswell path with only X86_V3 available.

So even with the same Python code and the same library versions, the numeric engine underneath the code was not doing exactly the same work in exactly the same way.

That was the actual gap.

Not the data.

Not the model.

Not the threshold.

The CPU-specific numeric path changed the transformed features, and the feature drift propagated into the fraud scores.

Why Small Numeric Differences Can Become Big Risk Differences

It is reasonable to ask: how can a tiny floating-point difference create thousands of extra alerts?

The answer is that fraud models are often full of boundary decisions.

A transaction is not just scored once and forgotten. It passes through transformations, scaling, model inference, probability or anomaly-score generation, thresholds, risk bands, and sometimes queue-routing rules.

If many transactions sit near a decision boundary, a small numerical shift can push a large population from one side of the line to the other.

That is especially dangerous in fraud because alert capacity is finite. A model that generates too few cases misses risk. A model that generates too many cases overwhelms investigators and makes the signal harder to use.

So the real requirement is not just model accuracy. It is operational reproducibility.

The organization needs to know that when the model says "high risk," that label came from controlled data, controlled features, controlled scoring logic, and a controlled runtime environment.

The Fix

The fix was to make the scoring environment deterministic at the level where the problem actually lived.

I containerized the fraud pipeline and forced both training and scoring to run under the same effective CPU instruction-set profile.

The important part was not merely "use Docker." Containerization alone does not automatically solve CPU-level numeric differences, because containers still run on the host CPU. If the numeric libraries are allowed to dispatch to whatever CPU instructions are available, two containers can still behave differently on two machines.

So the workaround had to cap the numeric execution path itself.

The Docker image was built from a fixed base:

python:3.12.13-slim-bookworm

Then I pinned the runtime controls that commonly affect reproducibility:

ENV OMP_NUM_THREADS=1
ENV OPENBLAS_NUM_THREADS=1
ENV MKL_NUM_THREADS=1
ENV NUMEXPR_NUM_THREADS=1
ENV PYTHONHASHSEED=0
ENV TZ=UTC

The bigger move was excluding AVX512-related NumPy CPU features and building NumPy from source with a capped CPU dispatch profile:

ENV NPY_DISABLE_CPU_FEATURES="AVX512F,X86_V4,..."

And:

python -m pip install --no-binary=numpy \
  --config-settings=setup-args=-Dcpu-baseline=min \
  --config-settings=setup-args=-Dcpu-dispatch=max,-X86_V4,-AVX512F \
  numpy==2.4.4

That matters because AVX512 was excluded at build time, not only hidden at runtime.

In practical terms, I stopped the numeric stack from choosing a faster but different execution path on the development server. Both environments now had to operate under the same CPU dispatch ceiling.

Training, diagnostics, and production scoring were then run inside the same Docker image, with the same package versions, same threading model, same timezone behavior, same hash behavior, and same numeric instruction-set ceiling.

Why This Workaround Was Bulletproof

The workaround was not just a patch for one machine.

It changed the deployment contract.

Before the fix, the contract already covered the normal things a production ML owner should control:

  • the same scoring code
  • the same input data
  • the same preprocessing logic
  • the same persisted preprocessing parameters
  • the same model artifact
  • the same package versions
  • constrained runtime threading

That was why the issue was so hard to diagnose. The obvious ML reproducibility controls were already in place. What was still missing was control over the numeric execution path underneath those controls.

After the fix, the contract became stronger:

"Run the same controlled scoring image, with the same data contract, preprocessing artifacts, model artifact, threading model, numeric stack, and CPU dispatch ceiling."

That last part was the difference. The model was no longer only reproducible at the Python and artifact level; it was reproducible at the effective math-engine level too.

It also removed a dangerous hidden dependency: the assumption that two servers with the same Python packages will produce identical numeric behavior. In most enterprise ML environments, that assumption is not strong enough.

The final setup made the fraud model reproducible across a development server with AVX512 support and a production server without it.

The model no longer depended on whatever optimized SIMD path the host machine happened to expose.

The Leadership Lesson

The technical lesson is about CPU dispatch and floating-point determinism.

The delivery lesson is broader: production machine learning needs evidence, not confidence.

When a scoring pipeline fails, a senior engineer should be able to show where the systems match, where they first diverge, and why that layer is the correct one to fix. Without that evidence chain, teams tend to argue from intuition:

  • "It must be the data."
  • "It must be the model."
  • "It must be the threshold."
  • "It must be a package version."

Any of those could have been true. In this case, none of them were.

The useful move was to keep shrinking the problem:

  1. Prove the raw inputs match.
  2. Prove the schema and distributions match.
  3. Prove the model artifacts match.
  4. Locate the first divergent hash.
  5. Investigate the runtime below the feature transformation layer.
  6. Fix the actual source of nondeterminism.
  7. Make the fix repeatable, not tribal knowledge.

That is the difference between debugging and production ownership.

Debugging finds the bug. Ownership changes the system so the same class of issue is less likely to return.

The Plain English Summary

The simplest explanation is this:

The model was not "changing its mind." The two environments were doing the model's math differently because their processors supported different low-level shortcuts.

The inputs were the same. The model was the same. But the machine-level math path was not the same, and that was enough to change the transformed features and risk scores.

The fix was to put the model inside a controlled production image and prevent the numeric libraries from using CPU-specific shortcuts that were available in one environment but not the other.

After that, development and production used the same effective math engine.

For a fraud program, that is the part that matters: the risk team can trust that a score means the same thing wherever it is produced.

Takeaways

The main lesson is simple: reproducibility is not only about code.

For production ML, reproducibility includes:

  • the source data
  • the feature transformation logic
  • the preprocessing artifacts
  • the model artifact
  • dependency versions
  • threading behavior
  • timezone and hash behavior
  • CPU instruction dispatch
  • the container or runtime contract

Most teams check the first five.

The painful cases often live in the last four.

This issue was a reminder that mature ML delivery is not just about building a good model. It is about making sure the model can be trusted under real production constraints, on real infrastructure, with real operational consequences.

That is where engineering discipline becomes part of the model.