AMR.ALFAYOUMY
// CHALLENGE & SOLUTION · SOCIAL NETWORK ANALYSIS

From 300M monthly calls to customer communities and influence maps.

A production analytics field note on turning 300M+ monthly calls into a controllable open-source graph model with stronger cohesion and half the SLA.

Process Overview
From call events to trusted business outputs
01
Raw calls
300M+ monthly voice interactions.
02
Nodes & weighted links
Customers become nodes; duration becomes relationship strength.
03
Global graph
Build the full 30-35M node network.
04
Influencer detection
Score who drives communication.
05
Community detection
Find natural customer groups with PLM.
06
Recursive split
Refine large groups while protecting dense structures.
07
Insights
Validate cohesion, stability, influencers, and community behavior.
08
Model feed
Send SNA outputs into downstream customer and Customer Value Management (CVM) models.

Replacing a proprietary analytics procedure is rarely just a technology migration.

The difficult part is not proving that open-source tools can run an algorithm. The difficult part is proving that a new system can carry the same business meaning as the old one, at the same production seriousness, while materially improving the result.

That was the core challenge behind this social network analysis model.

The existing workflow used SAS PROC NETWORK to identify customer influencers and communities from local voice interactions. It was already automated and already useful. Those outputs supported customer engagement, retention, and segmentation decisions. In other words, the model was not an academic graph exercise. It was part of how the business understood who influences whom.

That is why the replacement bar was high. This was not a case of taking a manual process and making it automated. SAS had already solved a real business problem. My job was to re-implement the model in open source, preserve trust in the outputs, and improve the production characteristics enough to justify the move.

The assignment was not "rewrite SAS in Python." It was "rebuild a trusted automated graph model at telecom scale, then make it faster and better."

The Problem

The organization was modernizing its AI and analytics platform, and the SNA model needed to move from SAS into a Dataiku-based production environment.

On paper, that sounds straightforward: find an open-source graph library, rebuild the model, compare the outputs, and deploy it.

In practice, social network analysis makes that much harder.

The monthly graph was massive: more than 300 million voice calls had to be reduced into a network of roughly 30-35 million customer nodes and the weighted links between them. Each link represented real communication behavior, not a synthetic relationship. At that size, small modeling choices become production decisions. A small difference in how calls are aggregated, how weights are interpreted, or how large communities are handled can change the final segmentation.

For business users, those changes show up as different influencers, different communities, and different retention actions.

So the migration had two goals that had to be held together:

  • preserve enough fidelity to the SAS baseline that the business could trust the new outputs
  • improve the production result, especially runtime and community quality

Matching the old model alone would not have been enough. The new system had to respect the historical baseline, but it also had to earn its place by cutting the monthly SLA from roughly 3 hours to roughly 1.5 hours and producing communities with stronger internal cohesion.

The Business Meaning Of The Model

The SNA model answers two practical questions.

First: who are the people who drive conversation?

These are customers whose outbound communication patterns make them influential within the network. They matter because their behavior can shape groups around them. Losing the right customer may affect more than one account; it may affect a local cluster of relationships.

Second: which customers naturally belong together?

Communities are groups of customers whose interactions are stronger with each other than with the outside network. For commercial teams, that can support segmentation, campaign design, retention planning, and customer value management.

The model therefore had to be technically sound, but also explainable at the level of customer relationships. A community cannot be treated as a random algorithmic label. It has to behave like a meaningful group.

The Approach

I started by treating SAS as the behavioral baseline, not as the implementation blueprint.

That distinction mattered. The objective was to reproduce the business behavior of PROC NETWORK using open-source components, not to recreate every proprietary internal decision SAS makes behind the scenes.

I compared several graph analytics options across the criteria that mattered for this workload: Dataiku orchestration, Teradata integration, directed weighted relationships, influencer metrics, community detection, scalability, and the ability to support a monthly production SLA.

NetworKit became the best fit because it gave the project a strong graph analytics engine without locking the model back into a proprietary analytics stack. It could support weighted graph processing, large-scale community detection, and production execution inside the new platform.

But the library choice was only one part of the design. The real work was deciding where each kind of computation belonged.

I designed the model as a staged graph system. The database handled the large relational reduction first: turning 300M+ raw monthly calls into clean, weighted customer relationships. The graph engine then handled the network-native work: influence, community structure, and community quality. The orchestration layer handled repeatability, failure boundaries, scheduling, and production promotion.

That separation was deliberate. If the graph engine receives raw telecom volume, the model wastes time doing work the database can do faster. If the database is forced to act like a graph engine, the community logic becomes awkward and hard to evolve. The production design worked because each layer did the job it was naturally good at.

How I Designed The Model

The design had four major principles.

First, reduce the problem before graph processing. The raw monthly call volume was too large to treat casually, so the model converted 300M+ call records into a weighted relationship network before the graph algorithms ran. That protected runtime and made the graph stage focus on actual customer relationships rather than raw event noise.

Second, preserve direction where it mattered and simplify where it helped. Influencer detection depends on directional behavior: who initiates communication and how strong that outbound behavior is. Community detection, on the other hand, is about mutual structure, so the model used an undirected view of relationship strength for that stage. This was not a technical convenience; it matched the meaning of the two business questions.

Third, treat large communities as a modeling problem, not a nuisance. Large groups in a telecom graph can be real: dense stars, cliques, and highly connected local structures. Forcing every group under a size limit can create artificial fragments. The new model applied increasing resolution only when the network structure supported a meaningful split, then rejected any split that produced one-person fragments. If a dense structure could not be split cleanly, the model preserved it instead of destroying it.

Fourth, validate the model as a business system. I did not rely on one parity number. I compared influencer overlap, community size profiles, largest-community behavior, internal versus external communication, stickiness, triangle density, migration match, and month-over-month stability. That made the migration evidence-based rather than opinion-based.

Solving The Community Split Problem

The hardest part of the model was not finding communities. It was deciding when not to split them.

The SAS baseline used a maximum community size rule, so the open-source model needed to respect that behavior. But a strict size rule can create bad business output if it breaks a naturally dense group into artificial fragments.

To solve that, I designed the split logic around validity, not just size.

When a community was too large, the model tried to find a stronger internal partition. But before accepting the result, it inspected the proposed groups. If the split produced a singleton, the model rejected it. That forced the algorithm to either find a better split or keep the original community intact.

For dense stars and cliques, that choice was important. In a star, many people are genuinely organized around a central person. In a clique, the group is tightly connected across many members. Breaking those structures just to satisfy a size threshold would make the output look cleaner while making it less true.

For the business, that would be worse than leaving the community oversized. Retention and segmentation teams do not need mathematically convenient fragments. They need groups that reflect real customer behavior. Preserving an unsplittable dense community is a better business decision than manufacturing smaller groups that no longer mean anything.

Community Split Policy
Size target with signal protection
Oversized Community
100+
Try to find a meaningful internal partition instead of blindly cutting the group.
Reject Bad Split
1x
If any proposed subgroup becomes a singleton, reject the split and search for a better one.
Preserve Dense Signal
OK
If a star or clique cannot be split cleanly, keep it intact rather than inventing weak groups.
The target was not smaller communities at any cost. The target was useful customer groups, which meant preventing one-person communities and preserving dense structures when they represented real behavior.

Where The New Model Improved On SAS

The open-source version was built to respect the SAS baseline, but it was not limited to being a replica.

The first major improvement was SLA.

The monthly run time moved from roughly 3 hours to roughly 1.5 hours. That matters in production because SNA is usually one part of a larger customer value workflow. A shorter SLA gives downstream teams more time to validate, activate, and use the output before the business window closes.

The SLA improvement came from three design choices working together.

First, the model reduced raw telecom volume before graph processing, so the graph engine worked on weighted relationships instead of raw call events. Second, it used Parallel Louvain for the heavy community detection step, which made the core graph algorithm fit the scale of the monthly network. Third, the expensive stages ran on controlled Kubernetes resources and the large-community refinement work was partitioned so multiple workers could process difficult communities in parallel instead of serially waiting on one long tail.

The result was not just faster code. It was a production execution plan: database reduction, graph-native parallelism, K8s resource isolation, and parallel refinement of the hard cases.

Production Execution
Parallelism and resource control
In-DB
Preprocessing
Aggregate raw calls into weighted links before graph processing.
K8s Graph
PLM Community Detection
Run the graph-native heavy stage with controlled CPU and memory.
Parallel
Large Community Refinement
Process difficult oversized groups across workers instead of one serial tail.
SAS Baseline~3h
Open-Source Model~1.5h
The runtime gain came from placing each workload in the right execution layer: relational reduction in the database, graph-native processing on controlled Kubernetes resources, and parallel refinement for the hard communities.

The second major improvement was community cohesion.

Stickiness measures how much communication stays inside the communities the model creates. Higher stickiness means the groups are not just mathematically tidy; they are better aligned with actual customer interaction. Across the validation months, the new implementation produced higher stickiness than SAS, meaning more total communication duration stayed inside the assigned communities.

Triangle density improved too. A triangle is a closed communication loop: A talks to B, B talks to C, and C talks back into the same group. More triangles mean the community is not only centered around one strong link; it has a richer internal network. In the validation rows compared here, the new model identified about 14.1% more of these closed loops, which suggested stronger local cohesion rather than looser grouping.

The third improvement was model control.

With the new implementation, the logic around large communities became explicit, testable, and adjustable. The model could keep splitting oversized groups when the network supported it, but it also had guardrails against creating meaningless one-person communities just to satisfy a size target.

That matters because business needs change. If the client later wants tighter communities for campaign targeting, looser communities for retention coverage, a different tolerance for oversized groups, or a different policy for preserving dense structures, the model now has visible controls that can be tuned deliberately. The client is no longer limited to the behavior exposed by a proprietary procedure; they can decide how the model should behave as the commercial use case evolves.

The modular design also makes future rework less risky. If the client later needs to replace one component, such as the community detection method, the influencer scoring logic, the validation layer, or the export path, that component can be changed and integrated back into the rest of the pipeline without rewriting the whole model. That is an important advantage for a production analytics system whose business use cases will keep evolving.

That sounds like a small modeling choice, but it has real business value. A community of one is usually not a community in the customer-management sense. It may satisfy a technical constraint, but it weakens the output for downstream teams.

The fourth improvement was platform ownership.

The original SAS model was already automated, so the win was not simply "now it runs on a schedule." The win was that the automation became part of the modern AI platform: controlled compute, clear promotion from UAT to production, measurable resource usage, and a workflow that could be operated by the broader platform team.

The fifth improvement was observability.

Instead of treating the model as a black box that emits final tables, the new workflow measured the health of the graph across more than 20 validation metrics: total nodes, total relationship weight, community counts, size distribution, oversized groups, largest-community behavior, internal versus external communication, stickiness, triangle density, migration match, and month-over-month stability.

That gave the team a stronger answer to the most important migration question: "Are we getting the same kind of business signal?"

Why The Validation Had To Be Broad

For a migration like this, one metric can be dangerously comforting.

A model can match influencer counts while damaging communities. It can match average community size while changing the largest structures. It can improve runtime while quietly producing unstable month-to-month assignments. None of those would be acceptable for a production SNA model.

That is why the validation covered more than 20 metrics. The most important ones answered different risk questions:

  • Influencer parity proved the model still found the same high-impact customers.
  • Community stickiness proved more communication stayed inside the assigned groups.
  • Triangle density proved the groups had stronger internal relationship loops.
  • Completeness showed that most SAS community structures remained intact inside the new model.
  • Homogeneity showed how cleanly new communities mapped back to the old SAS groupings.
  • Month-over-month stability showed the model was responsive without becoming volatile.

The stability result was especially important. The model kept average community size around 23 nodes, median size around 16-17 nodes, and a tight spread across the validation months. As the graph grew from roughly 32.4 million to 32.8 million nodes, the number of communities scaled predictably instead of swinging unpredictably.

User stability hovered around 40%, while influencer stability was close to 60%. That balance was healthy. It meant the model could react to real monthly behavior changes, while the centers of influence were much more stable than the general population.

The Results

The new model achieved very high parity with the historical baseline for influencer detection.

Across monthly validation windows, it scanned roughly five million influencers and matched the SAS output at better than 99.9%. The remaining differences were small enough to investigate directly, and were consistent with expected numerical precision differences rather than a material business drift.

Graph Scale
300M+
monthly calls reduced into a 30-35M node graph
Influencer Match
99.9%
baseline parity across roughly five million influencers
Community Stickiness
+0.6pp
higher inside-community call share versus SAS
Triangle Density
+14.1%
more closed communication loops in the compared months
Singletons
0%
one-person communities in the validated output
Runtime
50%
SLA reduction from roughly 3h to 1.5h

The community results were also stable month over month. As the graph grew from roughly 32.4 million to 32.8 million nodes, the total number of communities scaled predictably rather than jumping erratically. The largest communities remained recognizable compared with SAS, which was important because those large structures often represent real network behavior rather than random noise.

The new model also produced zero singleton communities in the validated output. That was intentional. It meant the algorithm was not breaking customer groups into isolated labels just to make a technical constraint look cleaner.

In quality terms, the Dataiku implementation showed higher community stickiness: more of the total communication duration happened inside the communities it created. It also identified materially more closed communication loops than the SAS baseline, suggesting that the new implementation was capturing cohesive local structures more effectively.

Why This Was Not Just A Migration

A migration moves a workload from one platform to another.

This project changed the delivery contract.

Before, the business depended on a proprietary automated procedure to produce a familiar output. After the rebuild, the business had an open-source graph model with comparable influencer results, stronger community cohesion, half the SLA, clearer validation metrics, and a production workflow that fit the modern AI platform.

That matters because analytical trust has two sides.

One side is statistical trust: do the outputs behave like the old trusted baseline?

The other side is operational trust: can the system run every month, under controlled resources, with clear failure boundaries and measurable output quality?

The new implementation had to satisfy both. A model that is accurate but operationally painful is not finished. A pipeline that is elegant but changes the business meaning of the output is not acceptable either.

The Plain English Summary

The simplest version is this:

The old SAS model found important customers and customer groups by looking at who called whom, how strongly they were connected, and which groups naturally formed in the network. It was automated, trusted, and already valuable.

I rebuilt that capability using open-source graph analytics inside the modern Dataiku platform.

The new model produced nearly the same influencer results as SAS, kept community behavior stable across months, improved stickiness and triangle density, avoided meaningless one-person communities, preserved dense stars and cliques when splitting would damage the signal, and cut the monthly SLA from about 3 hours to about 1.5 hours.

So the business did not just get a different technical stack. It got a model that preserved the trusted signal while becoming easier to run, validate, and evolve. More importantly, it gained finer control over the model's behavior: the ability to tune community size, split strictness, oversized-community handling, validation thresholds, and individual model components as business needs change.

Takeaways

The main lesson is that replacing proprietary analytics is not only a code conversion exercise.

For a serious production model, the migration plan has to cover:

  • the business meaning of each output
  • the historical baseline users already trust
  • the tolerance for expected differences
  • the operating SLA
  • the validation metrics that prove continuity
  • the new controls that let the client tune future model behavior
  • the modular boundaries that make future component replacement possible

In this case, the open-source implementation did not win because it was open source. It won because it preserved the right behavior, exposed the right controls, and made the production system easier to own.

That is the standard I care about in platform modernization: not whether the new stack is fashionable, but whether it gives the business more confidence, more control, and less hidden dependency.