OpenMedicare
Start Here
Explore
Fraud
Investigations
Data
Tools
About

Footer

OpenMedicare

Independent Medicare data journalism

Sister Sites

  • OpenMedicaid
  • OpenFeds
  • OpenSpending

Explore

  • Providers
  • Procedures
  • States
  • Specialties
  • Search

Fraud Analysis

  • Still Out There (AI)
  • Fraud Overview
  • Fraud Watchlist
  • Deep Dive Profiles
  • Impossible Numbers
  • Report Fraud

Investigations

  • The Algorithm Knows
  • How We Built the Model
  • Internal Medicine Crisis
  • Florida & California Fraud
  • Million Dollar Flagged
  • All Investigations

Tools

  • Provider Lookup
  • Compare
  • Cost Calculator
  • Your Medicare Dollar
  • Downloads

About

  • About OpenMedicare
  • Methodology
  • Glossary
  • Data Sources
  • API Docs
  • Updates
Data Sources: Centers for Medicare & Medicaid Services (CMS), Medicare Provider Utilization and Payment Data
Disclaimer: This site is an independent journalism project. Data analysis and editorial content are not affiliated with or endorsed by CMS or any government agency. All spending figures are based on publicly available Medicare payment records.
Sister Sites: OpenMedicaid · OpenFeds · OpenSpending

© 2026 OpenMedicare. Independent data journalism. Built by TheDataProject.ai

Methodology•Download Data
  1. Home
  2. Investigations
  3. How We Built the Fraud Model
Technical Deep-Dive

How We Built an ML Fraud Detection Model for 1.7 Million Medicare Providers

Supervised learning, 96 million rows, and the difference between anomaly detection and actual fraud labels

February 21, 2026
18 min read
By OpenMedicare Investigative Team

Contents

  1. 1. The Problem
  2. 2. The Data
  3. 3. Training Labels — The Key Innovation
  4. 4. Feature Engineering
  5. 5. Model Selection & Training
  6. 6. Feature Importance
  7. 7. Results
  8. 8. Limitations & Ethics
  9. 9. What's Next
  10. 10. Open Questions

1. The Problem

Medicare pays over $854 billion per year to healthcare providers. The Government Accountability Office estimates that $60–90 billion of that is lost to fraud, waste, and abuse annually — roughly 7–10% of total spending. That's more than the entire budget of the Department of Homeland Security.

CMS has limited auditing resources. The HHS Office of Inspector General has about 1,600 employees overseeing a program that pays 1.7 million providers. That's roughly one investigator per 1,000 providers. They can't look at everyone. So the question becomes: can machine learning help identify where to look?

Most existing fraud detection in healthcare is either rule-based (flag anyone billing over X) or unsupervised anomaly detection (find statistical outliers). Both have problems. Rules are easy to game. Anomaly detection catches weird billing, but weird isn't the same as fraudulent — a rural oncologist treating a cancer cluster will look like an outlier for legitimate reasons.

We wanted to try something different: a supervised model trained on confirmed fraud cases.

2. The Data

Dataset at a Glance

Source: CMS Medicare Physician & Other Practitioners
Time Range: 2014–2023 (10 years)
Total Rows: 96 million
Unique Providers: 1.72 million NPIs
Features per Provider: 30+ engineered
Total Payments: $854.8 billion

The raw data comes from CMS's publicly available Medicare Physician & Other Practitioners dataset, released annually. Each row represents one provider billing one HCPCS code in one year — so a single doctor might have hundreds of rows across codes and years.

We aggregated these 96 million rows into provider-level features: total payments, total services, unique beneficiaries, submitted charges (what they billed), allowed amounts (what Medicare approved), and the actual payment. We also preserved procedure-level detail for feature engineering.

Key raw features include: billing amounts, service volumes, beneficiary counts, markup ratios (submitted charges ÷ Medicare payment), procedure codes (HCPCS), geographic data (state, ZIP), and specialty classification.

3. Training Labels — The Key Innovation

This is what makes our approach different from most Medicare fraud research. Instead of just flagging outliers, we have actual ground truth labels. We know who committed fraud — because they got caught.

Label Sources

  • HHS OIG LEIE (List of Excluded Individuals/Entities): The federal government's database of healthcare providers excluded from federal programs for fraud, patient abuse, licensing violations, etc. Contains 82,714 entries. After NPI matching, 8,301 unique NPIs linked to our Medicare dataset.
  • DOJ Healthcare Fraud Cases: We manually compiled NPIs from Department of Justice press releases on healthcare fraud prosecutions. This added 6 additional confirmed NPIs not in LEIE.
  • Total matched: 8,307 confirmed fraud-associated NPIs. Of these, 2,198 were found in our Medicare billing dataset with sufficient data for modeling.

Why only 2,198 out of 8,307? Many LEIE entries are for providers who were excluded before our data window (2014–2023), who practice in settings not covered by this dataset (hospital employees, home health aides), or who had too few billing records to generate meaningful features.

This is a supervised classification model, not anomaly detection. That's a huge difference. Anomaly detection says "this provider is unusual." Our model says "this provider's billing pattern looks like providers who were confirmed to have committed fraud." The latter is a much stronger signal.

4. Feature Engineering

We engineered 30+ features from the raw data. They fall into five categories:

Direct Features

Aggregated directly from CMS data:

  • total_payments — sum of Medicare payments across all years
  • total_services — total service count
  • total_beneficiaries — unique beneficiaries served
  • markup_ratio — submitted charges ÷ Medicare payment (how aggressively they bill above what Medicare pays)

Derived Ratios

These capture billing intensity rather than raw volume:

  • services_per_beneficiary — are they seeing each patient unusually often?
  • payment_per_service — are they billing high-value codes?
  • payment_per_beneficiary — how much do they extract per patient?

Specialty-Relative Features (Z-Scores)

Raw billing numbers are misleading across specialties — an ophthalmologist billing $500K is normal; a family doctor billing $500K is unusual. We compute z-scores relative to each provider's specialty median:

z_payment = (provider_payment − specialty_median) / specialty_std

A z-score of 3+ means the provider bills 3 standard deviations above their specialty peers. This normalizes across specialties and is one of our most powerful feature categories.

Procedure Features

  • hhi_concentration — Herfindahl-Hirschman Index of procedure code concentration. High HHI = billing is concentrated in a few codes (potential code abuse)
  • upcoding_ratio — ratio of high-level E&M codes (99214/99215) to low-level (99213). Upcoding is one of the most common fraud types
  • drug_share — fraction of billing from drug administration codes (Part B drugs are a major fraud vector)
  • wound_share — fraction from wound care/skin substitute codes
  • covid_share — fraction from COVID-related codes

Temporal Features

  • services_per_day — total services ÷ estimated working days. Flags physically impossible volumes
  • beneficiaries_per_day — unique patients per working day
  • years_active — how many years the provider appears in the dataset. Turns out this is the single most important feature

5. Model Selection & Training

The Class Imbalance Problem

Fraud providers: 2,198
Clean providers: 1,717,427
Positive rate: 0.13%
Ratio: 1 : 781

For every confirmed fraudster, there are 781 clean providers. A model that predicts "not fraud" for everyone achieves 99.87% accuracy. Accuracy is meaningless here.

We chose Random Forest for several reasons:

  • Interpretability — feature importance scores tell us why the model flags someone, not just that it does. For a fraud detection tool, explainability matters.
  • Class imbalance handling — with class_weight='balanced', Random Forest automatically upweights the minority class
  • Robustness — handles mixed feature types, doesn't require normalization, resistant to outliers
  • Training speed — fits in under 30 minutes on our dataset

Cross-Validation Results

5-fold stratified cross-validation:
Fold 1: AUC 0.84 | Fold 2: AUC 0.81 | Fold 3: AUC 0.83
Fold 4: AUC 0.82 | Fold 5: AUC 0.83
Mean AUC: 0.83 (±0.01)

An AUC of 0.83 means: given a random fraud provider and a random clean provider, the model correctly ranks the fraudster higher 83% of the time. Not perfect, but meaningful — especially given the noise in our labels (LEIE includes non-fraud exclusions like license revocations).

We also tried Gradient Boosting (XGBoost), which took 4+ hours to train and yielded only a marginal improvement of ~1–2% AUC. For a research tool where interpretability and iteration speed matter more than squeezing out the last percentage point, Random Forest was the right call.

6. Feature Importance

What does the model actually look at? Here are the top 10 features by Gini importance:

Years Active
16.3%
Services / Beneficiary
11.9%
Markup Ratio
8%
Total Services
7.2%
Payment / Beneficiary
6.8%
Z-Score (Payment)
5.4%
Code Concentration (HHI)
4.9%
Total Payments
4.7%
Services / Day
4.1%
Upcoding Ratio
3.5%

Gini importance from Random Forest (500 trees, balanced class weights)

The top features tell an interesting story:

  • years_active (16.3%) — The single most important feature. Fraudsters tend to have shorter billing histories. They enter the system, bill aggressively, and get caught (or disappear) within a few years. Legitimate providers have decades-long careers.
  • services_per_beneficiary (11.9%) — How many services per patient. Fraud often involves padding encounters — billing for services that didn't happen or weren't medically necessary.
  • markup_ratio (8.0%) — Charge inflation. Fraudulent providers tend to submit charges much higher relative to what Medicare pays, suggesting aggressive overbilling.
  • total_services (7.2%) — Sheer volume. Many fraud schemes are volume plays — doing the same thing thousands of times.
  • payment_per_beneficiary (6.8%) — How much they extract per patient. High values suggest either unnecessary services or high-cost procedure abuse.

7. Results

Key Findings

  • • 500 providers scored >86% fraud probability
  • • Model correctly flagged providers later charged by DOJ
  • • Top states: CA, FL, NY, TX, NJ — mirrors DOJ enforcement geography
  • • Internal Medicine (53%) + Family Practice (27%) = 80% of high-risk flags
  • • Mean AUC: 0.83 across 5-fold cross-validation

When we scored all 1.72 million providers, 500 scored above our 86% threshold. These aren't random outliers — they're providers whose billing patterns statistically resemble confirmed fraudsters across multiple dimensions simultaneously.

The most compelling validation: we trained the model on LEIE data (providers excluded before or during our data window), then checked it against DOJ prosecutions that came after. The model had already flagged several of these providers as high-risk. Our data predicted fraud before the Department of Justice announced charges. Read the full story →

The geographic distribution is also telling. Our top-flagged states — California, Florida, New York, Texas, New Jersey — are exactly the states where DOJ has historically concentrated healthcare fraud enforcement. The model independently discovered the same geographic patterns.

The specialty concentration is notable: 80% of high-risk flags are Internal Medicine or Family Practice. This makes sense — these are high-volume, office-visit-heavy specialties where billing fraud is easiest to execute and hardest to detect in individual claims.

8. Limitations & Ethics

We want to be extremely clear about what this model is and isn't.

What This Model Is NOT

  • • Not an accusation. A high fraud score means billing patterns statistically resemble confirmed fraudsters. There are many legitimate reasons for unusual billing.
  • • Not comprehensive. The model is trained on caught fraudsters. By definition, it may miss sophisticated schemes that haven't been detected yet.
  • • Not unbiased. If LEIE disproportionately includes certain specialties or regions (it does — enforcement resources aren't evenly distributed), the model inherits that bias.
  • • Not a replacement for investigation. Statistical flags are starting points for human review, not conclusions.

Survivorship bias is our biggest known limitation. We can only train on providers who got caught. If there's a class of sophisticated fraud that systematically evades detection, our model won't learn those patterns. We're training on the fraud that looks like caught fraud.

Label noise is another concern. The LEIE includes exclusions for reasons beyond fraud — license revocations, controlled substance violations, patient abuse. These providers may have different billing patterns than financial fraudsters. We treat all LEIE entries as positive labels, which adds noise.

We publish this work as a research and transparency tool, not as accusations. Every provider profile on OpenMedicare includes a disclaimer. We encourage anyone with concerns about a specific provider to report to the OIG rather than draw conclusions from statistical models alone.

9. What's Next

This is v1 of our fraud model. Here's what we're working on:

  • Temporal models — Year-over-year changes in billing patterns. A provider whose billing doubles overnight is more suspicious than one who's always billed at high volume. We have 10 years of data; we should use the time dimension.
  • Network analysis — Provider referral patterns. Fraud rings often involve multiple providers referring to each other. Graph-based features could capture this.
  • Prescription data integration — CMS also publishes Medicare Part D prescriber data. Combining billing patterns with prescribing patterns could surface kickback schemes.
  • Cleaner labels — Filtering LEIE to financial fraud exclusions only, excluding license-based exclusions that may not reflect billing fraud.
  • Deep learning experiments — Sequence models on procedure-level billing history, treating each provider's billing as a time series.

10. Open Questions

We built this in the open because we believe healthcare transparency benefits from community scrutiny. There are questions we haven't answered — and some we probably haven't thought to ask.

  • How should we handle specialty bias in LEIE? Should we train separate models per specialty?
  • Is years_active a leaky feature? (Excluded providers stop billing — does the feature capture exclusion rather than predict it?)
  • What's the right threshold? We used 86% — but the precision/recall tradeoff is a policy decision, not a technical one.
  • How do you validate a fraud model when ground truth is inherently incomplete?
  • What features are we missing that could separate "unusual but legitimate" from "unusual and fraudulent"?

Explore the Data Yourself

We've published the model's highest-risk flags with full billing breakdowns. Look at the numbers, check our work, and tell us what we're getting wrong.

Explore High-Risk Providers →View Full Watchlist →

Disclaimer: The fraud scores and billing patterns described in this article are statistical outputs from a machine learning model trained on publicly available data. They are not accusations of fraud. Individual cases may have legitimate explanations. Named providers have not been charged with any crime unless otherwise stated. If you suspect fraud, report it to the OIG Fraud Hotline (1-800-HHS-TIPS).

Related

📊 Our Data Predicted Fraud Before the DOJ🔍 Still Out There: Unflagged Providers🤖 The Algorithm Knows🏠 Fraud Analysis Hub
Share:

Data Sources

  • • Centers for Medicare & Medicaid Services (CMS) — Medicare Physician & Other Practitioners Data (2014–2023)
  • • HHS Office of Inspector General — List of Excluded Individuals/Entities (LEIE)
  • • Department of Justice — Healthcare Fraud Prosecution Records
  • • Government Accountability Office — Medicare Improper Payment Estimates

Last Updated: February 2026

Note: All data is from publicly available Medicare records. OpenMedicare is an independent journalism project not affiliated with CMS.