NCC BULGARIA

NCC presenting the success story

NCC-Bulgaria has been founded by the Institute of Information and Communication Technologies at
the Bulgarian Academy of Sciences, Sofia University “St. Kliment Ohridski” and the University of
National and World Economy.NCC-Bulgaria is focused on:

  • Creating a roadmap for successful work in the field of HPC, big data analysis and AI,
  • Analyzing the existing competencies and facilitating the use of HPC/HPDA/AI in Bulgaria
  • Raising awareness and promoting HPC/HPDA/AI use in companies and the public sector.

Scientific/governmental/private partners involved:

ConceptDigital Ltd is a rapidly expanding Bulgarian firm specialising in digital marketing analytics. It manages multi-platform campaigns with millions of impressions every day. Working in a data-heavy and quickly changing market, the company aimed to enhance the integrity and effectiveness of its advertising services by tackling one of the most persistent threats in the industry: ad fraud.

Technical/scientific Challenge:

The challenges in the domain of digital advertising fraud are multifaceted and increasingly complex, threatening the integrity of campaign analytics and draining advertising budgets. Fraudulent actors employ automation, camouflage, and distributed tactics – such as bots, click farms, domain spoofing, and session hijacking – to generate fake impressions, clicks, and conversions that closely mimic legitimate user behaviour, making detection difficult. Campaigns generate millions of high-volume, high-velocity interactions daily across platforms, devices, and formats, resulting in noisy and heterogeneous datasets that overwhelm traditional detection systems. Striking a balance between false positives and false negatives is critical, as over-filtering can exclude genuine traffic while under-filtering allows fraud to persist. Most existing tools operate in silos without a unified detection framework, limiting the ability to correlate cross-channel behaviour and identify sophisticated fraud patterns. Compounding these issues are strict privacy and regulatory constraints, particularly under GDPR, which require anonymization and data protection by design. Addressing these challenges demands scalable, adaptive, and privacy-preserving solutions – leveraging machine learning to automate behaviour analysis and enable real-time fraud detection.

Key technical challanges:

A deeper investigation into the advertising and campaign analytics environment revealed several interconnected technical limitations that hindered robust fraud detection and prevention:

    • Fragmented and Heterogeneous Data Streams: Data was sourced from various ad platforms (Google Ads, Meta, DSPs, affiliate networks), each producing events with distinct formats, inconsistent timestamps, and schema variability. The absence of centralized ingestion and standardization mechanisms impaired data quality, context building, and cross-platform correlation.
    • Dynamic and Obfuscated Fraud Patterns: Fraudulent behaviour was no longer limited to obvious bot traffic. Sophisticated actors emulated legitimate sessions, distributed activities over time, and manipulated user-agent strings and geolocation data. Without historical learning or behavioural baselining, these attacks often went unnoticed.
    • Scalability Constraints on Real-Time Detection: Manual review or batch-model scoring was insufficient for environments receiving tens of thousands of events per second. Without stream processing and scalable ML deployment, detection was either too slow or too coarse.
    • Sparse and Noisy Ground Truth for Supervised Learning: Labelled fraud cases were scarce, imbalanced, or incomplete. While certain campaigns had confirmed fraud events, others lacked annotation. This challenged the training of robust supervised models and increased reliance on semi-supervised or unsupervised learning.
    • Lack of Interpretability and Trust in AI Outputs: Existing detection models operated as “black boxes,” raising concerns among stakeholders regarding reliability and auditability. Human-in-the-loop teams require insights into why specific sessions were flagged, demanding transparent and explainable AI models.
    • Compliance with GDPR and Data Privacy: User sessions often contained IP addresses, device identifiers, user-agents, and location metadata. Ensuring this data was anonymised – via hashing, tokenization, and suppression – was essential before training models or exporting data for analytics, especially under GDPR obligations.

Solution: 

A real-time fraud detection pipeline was developed to process high-volume, heterogeneous data from digital advertising sources. The architecture integrates structured validation, streaming analytics, multi-tier scoring, and privacy-preserving mechanisms. Each layer was selected to address practical limitations in data quality, latency, scalability, interpretability, and compliance with GDPR.

Event data originating from digital marketing platforms – such as Google Ads, Meta, DSPs, and affiliate networks – is ingested through Apache Kafka, which ensures stable message ordering, horizontal scalability, and decoupling of producers and consumers. Incoming events are parsed and normalized at the stream level to support unified processing and feature extraction.

Before reaching the detection layer, all streaming events pass through PyDeequ-based data validation pipelines. These checks enforce constraints such as timestamp alignment, field completeness, and structural integrity. The validation layer prevents malformed or poisoned data from reaching downstream machine learning models or analytics systems, acting as a safeguard for the rest of the pipeline.

Validated events are processed using Apache Spark Structured Streaming, operating in 500 ms micro-batches. Within this processing layer, a range of real-time features is computed, including:

  • Entropy measures (e.g., IP shifts, user-agent diversity),
  • Rolling click rates and referrer bursts, and
  • Session duration and interaction frequency patterns.

These features are specifically designed to detect subtle anomalies common in fraud behaviour, without the need for full session reconstruction or expensive joins.

The fraud detection logic operates in two tiers, powered by machine learning algorithms specifically selected for their efficiency, interpretability, and accuracy:

  • Tier 1 uses Isolation Forest, an unsupervised anomaly detection algorithm, to flag structural and behavioural outliers in real time. It requires no labelled data and is efficient enough to operate within the streaming pipeline with sub-10 ms inference latency.
  • Tier 2 applies a LightGBM model trained with Focal Loss, a supervised gradient boosting classifier optimized for highly imbalanced data. This model is used to evaluate borderline cases and delivers high precision in detecting fraudulent activity.

Both models are deployed within the Spark runtime environment, allowing seamless scoring directly on the stream. This architecture enables real-time fraud detection with minimal latency and high throughput.

To ensure consistency and traceability between model training and live inference, features are managed via an integrated Feature Store. This guarantees that the same transformation logic used during model training is applied in real-time inference, minimizing the risk of skew or inconsistent behaviour.

To comply with data protection regulations, all data handling follows privacy-by-design principles. This includes:

  • Hashing and tokenization of session identifiers, IP addresses, and device fingerprints,
  • Application of differential privacy techniques in behavioural aggregations, and
  • Dynamic suppression of high-risk fields in audit reports and exported datasets.
  • These protections ensure that no personally identifiable information is retained or exposed beyond what is operationally required.

Model retraining jobs are executed daily on GPU-accelerated Spark clusters, with automated deployment and rollback managed through MLflow, ensuring reproducibility, auditability, and safe model iteration. Model lineage, parameters, and metrics are stored for version control and compliance audits.

The process is illustrated in Figure 1, which depicts the flow from ingestion through validation, feature extraction, scoring, and anonymization. This architecture emphasizes modularity, interpretability, and sub-second latency. Every component works together to safeguard advertising spend, maximise detection accuracy, and maintain full regulatory compliance while preserving high performance.



Scientific impact:
 

  • Introduced a real-time machine learning pipeline integrating Apache Kafka, Spark Structured Streaming, and feature stores — a scalable architecture for future research in real-time behavioral analytics and fraud detection.
  • The two-tier detection approach, combining Isolation Forest (unsupervised) and LightGBM with Focal Loss (supervised), contributes to scientific work on handling imbalanced, noisy, and sparse datasets in real-world fraud contexts.
  • This project operationalizes interpretable AI in stream-based environments, offering insight into anomaly classification and contributing to explainability research in machine learning deployments.
  • Demonstrates the practical embedding of GDPR-compliant methods (hashing, tokenization, suppression, differential privacy) into real-time pipelines — bridging theoretical privacy frameworks and applied data protection research.
  • Advances in reproducibility practices in AI through version-controlled, traceable model lifecycle management in high-frequency fraud detection systems.

Benefits: 

  • The system prevents fraudulent clicks and impressions from consuming campaign spend, preserving ROI and improving campaign credibility.
  • Fine-tuned models detect advanced fraud patterns while minimizing disruption to legitimate traffic — optimizing platform trust and user experience.
  • Cross-platform normalization enables ConceptDigitalLtd to correlate events from disparate sources, enhancing insight and decision-making.
  • Built-in GDPR-compliant practices ensure data handling remains secure and lawful without manual interventions or risk of breach.
  • Automated retraining with MLflow and GPU-accelerated jobs ensures the system adapts to evolving fraud patterns without downtime or manual retraining.

Success story # Highlights:

  • Real-time fraud detection pipeline deployed using Kafka and Spark, capable of sub-second processing at scale.
  • PyDeequ-powered data validation ensures structural integrity before scoring.
  • Consistent feature logic managed through a Feature Store, minimizing drift.
  • Hybrid ML detection with Isolation Forest (unsupervised) and LightGBM (supervised) for robust pattern recognition.
  • Full GDPR compliance via hashing, tokenization, suppression, and differential privacy.
  • GPU-powered model retraining scheduled daily; audit-ready model management through MLflow.
  • Achieved high detection precision without sacrificing latency or compliance.

Figure 1. Real-time ad fraud detection pipeline integrating Kafka, Spark Streaming, multi-tier ML models (Isolation Forest, LightGBM), and privacy-by-design principles

Contact:

  • Prof. Kamelia Stefanova, kstefanova at unwe.bg,
  • Prof. Valentin Kisimov, vkisimov at unwe.bg,
  • Assist. Dr. Ivona Velkova, ivonavelkova at unwe.bg

University of National and World Economy, Bulgaria