How We Detect Medicare Fraud With ML

1. The Problem
2. The Data
3. Training Labels — The Key Innovation
4. Feature Engineering
5. Model Selection & Training
6. Feature Importance
7. Results
8. Limitations & Ethics
9. What's Next
10. Open Questions

1. The Problem

Medicare pays over $854 billion per year to healthcare providers. The Government Accountability Office estimates that $60–90 billion of that is lost to fraud, waste, and abuse annually — roughly 7–10% of total spending. That's more than the entire budget of the Department of Homeland Security.

CMS has limited auditing resources. The HHS Office of Inspector General has about 1,600 employees overseeing a program that pays 1.7 million providers. That's roughly one investigator per 1,000 providers. They can't look at everyone. So the question becomes: can machine learning help identify where to look?

Most existing fraud detection in healthcare is either rule-based (flag anyone billing over X) or unsupervised anomaly detection (find statistical outliers). Both have problems. Rules are easy to game. Anomaly detection catches weird billing, but weird isn't the same as fraudulent — a rural oncologist treating a cancer cluster will look like an outlier for legitimate reasons.

We wanted to try something different: a supervised model trained on confirmed fraud cases.

2. The Data

Dataset at a Glance

Source: CMS Medicare Physician & Other Practitioners

Time Range: 2014–2023 (10 years)

Total Rows: 96 million

Unique Providers: 1.72 million NPIs

Features per Provider: 30+ engineered

Total Payments: $854.8 billion

The raw data comes from CMS's publicly available Medicare Physician & Other Practitioners dataset, released annually. Each row represents one provider billing one HCPCS code in one year — so a single doctor might have hundreds of rows across codes and years.

We aggregated these 96 million rows into provider-level features: total payments, total services, unique beneficiaries, submitted charges (what they billed), allowed amounts (what Medicare approved), and the actual payment. We also preserved procedure-level detail for feature engineering.

Key raw features include: billing amounts, service volumes, beneficiary counts, markup ratios (submitted charges ÷ Medicare payment), procedure codes (HCPCS), geographic data (state, ZIP), and specialty classification.

3. Training Labels — The Key Innovation

This is what makes our approach different from most Medicare fraud research. Instead of just flagging outliers, we have actual ground truth labels. We know who committed fraud — because they got caught.

Label Sources

HHS OIG LEIE (List of Excluded Individuals/Entities): The federal government's database of healthcare providers excluded from federal programs for fraud, patient abuse, licensing violations, etc. Contains 82,714 entries. After NPI matching, 8,301 unique NPIs linked to our Medicare dataset.
DOJ Healthcare Fraud Cases: We manually compiled NPIs from Department of Justice press releases on healthcare fraud prosecutions. This added 6 additional confirmed NPIs not in LEIE.
Total matched: 8,307 confirmed fraud-associated NPIs. Of these, 2,198 were found in our Medicare billing dataset with sufficient data for modeling.

Why only 2,198 out of 8,307? Many LEIE entries are for providers who were excluded before our data window (2014–2023), who practice in settings not covered by this dataset (hospital employees, home health aides), or who had too few billing records to generate meaningful features.

This is a supervised classification model, not anomaly detection. That's a huge difference. Anomaly detection says "this provider is unusual." Our model says "this provider's billing pattern looks like providers who were confirmed to have committed fraud." The latter is a much stronger signal.

4. Feature Engineering

We engineered 30+ features from the raw data. They fall into five categories:

Direct Features

Aggregated directly from CMS data:

total_payments — sum of Medicare payments across all years
total_services — total service count
total_beneficiaries — unique beneficiaries served
markup_ratio — submitted charges ÷ Medicare payment (how aggressively they bill above what Medicare pays)

Derived Ratios

These capture billing intensity rather than raw volume:

services_per_beneficiary — are they seeing each patient unusually often?
payment_per_service — are they billing high-value codes?
payment_per_beneficiary — how much do they extract per patient?

Specialty-Relative Features (Z-Scores)

Raw billing numbers are misleading across specialties — an ophthalmologist billing $500K is normal; a family doctor billing $500K is unusual. We compute z-scores relative to each provider's specialty median:

z_payment = (provider_payment − specialty_median) / specialty_std

A z-score of 3+ means the provider bills 3 standard deviations above their specialty peers. This normalizes across specialties and is one of our most powerful feature categories.

Procedure Features

hhi_concentration — Herfindahl-Hirschman Index of procedure code concentration. High HHI = billing is concentrated in a few codes (potential code abuse)
upcoding_ratio — ratio of high-level E&M codes (99214/99215) to low-level (99213). Upcoding is one of the most common fraud types
drug_share — fraction of billing from drug administration codes (Part B drugs are a major fraud vector)
wound_share — fraction from wound care/skin substitute codes
covid_share — fraction from COVID-related codes

Temporal Features

services_per_day — total services ÷ estimated working days. Flags physically impossible volumes
beneficiaries_per_day — unique patients per working day
years_active — how many years the provider appears in the dataset. Turns out this is the single most important feature

5. Model Selection & Training

The Class Imbalance Problem

Fraud providers: 2,198

Clean providers: 1,717,427

Positive rate: 0.13%

Ratio: 1 : 781

For every confirmed fraudster, there are 781 clean providers. A model that predicts "not fraud" for everyone achieves 99.87% accuracy. Accuracy is meaningless here.

We chose Random Forest for several reasons:

Interpretability — feature importance scores tell us why the model flags someone, not just that it does. For a fraud detection tool, explainability matters.
Class imbalance handling — with class_weight='balanced', Random Forest automatically upweights the minority class
Robustness — handles mixed feature types, doesn't require normalization, resistant to outliers
Training speed — fits in under 30 minutes on our dataset

Cross-Validation Results

5-fold stratified cross-validation:
Fold 1: AUC 0.84 | Fold 2: AUC 0.81 | Fold 3: AUC 0.83
Fold 4: AUC 0.82 | Fold 5: AUC 0.83
Mean AUC: 0.83 (±0.01)

An AUC of 0.83 means: given a random fraud provider and a random clean provider, the model correctly ranks the fraudster higher 83% of the time. Not perfect, but meaningful — especially given the noise in our labels (LEIE includes non-fraud exclusions like license revocations).

We also tried Gradient Boosting (XGBoost), which took 4+ hours to train and yielded only a marginal improvement of ~1–2% AUC. For a research tool where interpretability and iteration speed matter more than squeezing out the last percentage point, Random Forest was the right call.

6. Feature Importance

What does the model actually look at? Here are the top 10 features by Gini importance:

Years Active

16.3%

Services / Beneficiary

11.9%

Markup Ratio

Total Services

7.2%

Payment / Beneficiary

6.8%

Z-Score (Payment)

5.4%

Code Concentration (HHI)

4.9%

Total Payments

4.7%

Services / Day

4.1%

Upcoding Ratio

3.5%

Gini importance from Random Forest (500 trees, balanced class weights)

The top features tell an interesting story:

years_active (16.3%) — The single most important feature. Fraudsters tend to have shorter billing histories. They enter the system, bill aggressively, and get caught (or disappear) within a few years. Legitimate providers have decades-long careers.
services_per_beneficiary (11.9%) — How many services per patient. Fraud often involves padding encounters — billing for services that didn't happen or weren't medically necessary.
markup_ratio (8.0%) — Charge inflation. Fraudulent providers tend to submit charges much higher relative to what Medicare pays, suggesting aggressive overbilling.
total_services (7.2%) — Sheer volume. Many fraud schemes are volume plays — doing the same thing thousands of times.
payment_per_beneficiary (6.8%) — How much they extract per patient. High values suggest either unnecessary services or high-cost procedure abuse.

7. Results

Key Findings

• 500 providers scored >86% fraud probability
• Model correctly flagged providers later charged by DOJ
• Top states: CA, FL, NY, TX, NJ — mirrors DOJ enforcement geography
• Internal Medicine (53%) + Family Practice (27%) = 80% of high-risk flags
• Mean AUC: 0.83 across 5-fold cross-validation

When we scored all 1.72 million providers, 500 scored above our 86% threshold. These aren't random outliers — they're providers whose billing patterns statistically resemble confirmed fraudsters across multiple dimensions simultaneously.

The most compelling validation: we trained the model on LEIE data (providers excluded before or during our data window), then checked it against DOJ prosecutions that came after. The model had already flagged several of these providers as high-risk. Our data predicted fraud before the Department of Justice announced charges. Read the full story →

The geographic distribution is also telling. Our top-flagged states — California, Florida, New York, Texas, New Jersey — are exactly the states where DOJ has historically concentrated healthcare fraud enforcement. The model independently discovered the same geographic patterns.

The specialty concentration is notable: 80% of high-risk flags are Internal Medicine or Family Practice. This makes sense — these are high-volume, office-visit-heavy specialties where billing fraud is easiest to execute and hardest to detect in individual claims.

8. Limitations & Ethics

We want to be extremely clear about what this model is and isn't.

What This Model Is NOT

• Not an accusation. A high fraud score means billing patterns statistically resemble confirmed fraudsters. There are many legitimate reasons for unusual billing.
• Not comprehensive. The model is trained on caught fraudsters. By definition, it may miss sophisticated schemes that haven't been detected yet.
• Not unbiased. If LEIE disproportionately includes certain specialties or regions (it does — enforcement resources aren't evenly distributed), the model inherits that bias.
• Not a replacement for investigation. Statistical flags are starting points for human review, not conclusions.

Survivorship bias is our biggest known limitation. We can only train on providers who got caught. If there's a class of sophisticated fraud that systematically evades detection, our model won't learn those patterns. We're training on the fraud that looks like caught fraud.

Label noise is another concern. The LEIE includes exclusions for reasons beyond fraud — license revocations, controlled substance violations, patient abuse. These providers may have different billing patterns than financial fraudsters. We treat all LEIE entries as positive labels, which adds noise.

We publish this work as a research and transparency tool, not as accusations. Every provider profile on OpenMedicare includes a disclaimer. We encourage anyone with concerns about a specific provider to report to the OIG rather than draw conclusions from statistical models alone.

9. What's Next

This is v1 of our fraud model. Here's what we're working on:

Temporal models — Year-over-year changes in billing patterns. A provider whose billing doubles overnight is more suspicious than one who's always billed at high volume. We have 10 years of data; we should use the time dimension.
Network analysis — Provider referral patterns. Fraud rings often involve multiple providers referring to each other. Graph-based features could capture this.
Prescription data integration — CMS also publishes Medicare Part D prescriber data. Combining billing patterns with prescribing patterns could surface kickback schemes.
Cleaner labels — Filtering LEIE to financial fraud exclusions only, excluding license-based exclusions that may not reflect billing fraud.
Deep learning experiments — Sequence models on procedure-level billing history, treating each provider's billing as a time series.

10. Open Questions

We built this in the open because we believe healthcare transparency benefits from community scrutiny. There are questions we haven't answered — and some we probably haven't thought to ask.

How should we handle specialty bias in LEIE? Should we train separate models per specialty?
Is years_active a leaky feature? (Excluded providers stop billing — does the feature capture exclusion rather than predict it?)
What's the right threshold? We used 86% — but the precision/recall tradeoff is a policy decision, not a technical one.
How do you validate a fraud model when ground truth is inherently incomplete?
What features are we missing that could separate "unusual but legitimate" from "unusual and fraudulent"?

Explore the Data Yourself

We've published the model's highest-risk flags with full billing breakdowns. Look at the numbers, check our work, and tell us what we're getting wrong.

Explore High-Risk Providers →View Full Watchlist →

1. The Problem
2. The Data
3. Training Labels — The Key Innovation
4. Feature Engineering
5. Model Selection & Training
6. Feature Importance
7. Results
8. Limitations & Ethics
9. What's Next
10. Open Questions

1. The Problem

We wanted to try something different: a supervised model trained on confirmed fraud cases.

2. The Data

Dataset at a Glance

Source: CMS Medicare Physician & Other Practitioners

Time Range: 2014–2023 (10 years)

Total Rows: 96 million

Unique Providers: 1.72 million NPIs

Features per Provider: 30+ engineered

Total Payments: $854.8 billion

3. Training Labels — The Key Innovation

Label Sources

HHS OIG LEIE (List of Excluded Individuals/Entities): The federal government's database of healthcare providers excluded from federal programs for fraud, patient abuse, licensing violations, etc. Contains 82,714 entries. After NPI matching, 8,301 unique NPIs linked to our Medicare dataset.
DOJ Healthcare Fraud Cases: We manually compiled NPIs from Department of Justice press releases on healthcare fraud prosecutions. This added 6 additional confirmed NPIs not in LEIE.
Total matched: 8,307 confirmed fraud-associated NPIs. Of these, 2,198 were found in our Medicare billing dataset with sufficient data for modeling.

4. Feature Engineering

We engineered 30+ features from the raw data. They fall into five categories:

Direct Features

Aggregated directly from CMS data:

total_payments — sum of Medicare payments across all years
total_services — total service count
total_beneficiaries — unique beneficiaries served
markup_ratio — submitted charges ÷ Medicare payment (how aggressively they bill above what Medicare pays)

Derived Ratios

These capture billing intensity rather than raw volume:

services_per_beneficiary — are they seeing each patient unusually often?
payment_per_service — are they billing high-value codes?
payment_per_beneficiary — how much do they extract per patient?

Specialty-Relative Features (Z-Scores)

z_payment = (provider_payment − specialty_median) / specialty_std

A z-score of 3+ means the provider bills 3 standard deviations above their specialty peers. This normalizes across specialties and is one of our most powerful feature categories.

Procedure Features

hhi_concentration — Herfindahl-Hirschman Index of procedure code concentration. High HHI = billing is concentrated in a few codes (potential code abuse)
upcoding_ratio — ratio of high-level E&M codes (99214/99215) to low-level (99213). Upcoding is one of the most common fraud types
drug_share — fraction of billing from drug administration codes (Part B drugs are a major fraud vector)
wound_share — fraction from wound care/skin substitute codes
covid_share — fraction from COVID-related codes

Temporal Features

services_per_day — total services ÷ estimated working days. Flags physically impossible volumes
beneficiaries_per_day — unique patients per working day
years_active — how many years the provider appears in the dataset. Turns out this is the single most important feature

5. Model Selection & Training

The Class Imbalance Problem

Fraud providers: 2,198

Clean providers: 1,717,427

Positive rate: 0.13%

Ratio: 1 : 781

For every confirmed fraudster, there are 781 clean providers. A model that predicts "not fraud" for everyone achieves 99.87% accuracy. Accuracy is meaningless here.

We chose Random Forest for several reasons:

Interpretability — feature importance scores tell us why the model flags someone, not just that it does. For a fraud detection tool, explainability matters.
Class imbalance handling — with class_weight='balanced', Random Forest automatically upweights the minority class
Robustness — handles mixed feature types, doesn't require normalization, resistant to outliers
Training speed — fits in under 30 minutes on our dataset

Cross-Validation Results

5-fold stratified cross-validation:
Fold 1: AUC 0.84 | Fold 2: AUC 0.81 | Fold 3: AUC 0.83
Fold 4: AUC 0.82 | Fold 5: AUC 0.83
Mean AUC: 0.83 (±0.01)

6. Feature Importance

What does the model actually look at? Here are the top 10 features by Gini importance:

Years Active

16.3%

Services / Beneficiary

11.9%

Markup Ratio

Total Services

7.2%

Payment / Beneficiary

6.8%

Z-Score (Payment)

5.4%

Code Concentration (HHI)

4.9%

Total Payments

4.7%

Services / Day

4.1%

Upcoding Ratio

3.5%

Gini importance from Random Forest (500 trees, balanced class weights)

The top features tell an interesting story:

years_active (16.3%) — The single most important feature. Fraudsters tend to have shorter billing histories. They enter the system, bill aggressively, and get caught (or disappear) within a few years. Legitimate providers have decades-long careers.
services_per_beneficiary (11.9%) — How many services per patient. Fraud often involves padding encounters — billing for services that didn't happen or weren't medically necessary.
markup_ratio (8.0%) — Charge inflation. Fraudulent providers tend to submit charges much higher relative to what Medicare pays, suggesting aggressive overbilling.
total_services (7.2%) — Sheer volume. Many fraud schemes are volume plays — doing the same thing thousands of times.
payment_per_beneficiary (6.8%) — How much they extract per patient. High values suggest either unnecessary services or high-cost procedure abuse.

7. Results

Key Findings

• 500 providers scored >86% fraud probability
• Model correctly flagged providers later charged by DOJ
• Top states: CA, FL, NY, TX, NJ — mirrors DOJ enforcement geography
• Internal Medicine (53%) + Family Practice (27%) = 80% of high-risk flags
• Mean AUC: 0.83 across 5-fold cross-validation

8. Limitations & Ethics

We want to be extremely clear about what this model is and isn't.

What This Model Is NOT

• Not an accusation. A high fraud score means billing patterns statistically resemble confirmed fraudsters. There are many legitimate reasons for unusual billing.
• Not comprehensive. The model is trained on caught fraudsters. By definition, it may miss sophisticated schemes that haven't been detected yet.
• Not unbiased. If LEIE disproportionately includes certain specialties or regions (it does — enforcement resources aren't evenly distributed), the model inherits that bias.
• Not a replacement for investigation. Statistical flags are starting points for human review, not conclusions.

9. What's Next

This is v1 of our fraud model. Here's what we're working on:

Temporal models — Year-over-year changes in billing patterns. A provider whose billing doubles overnight is more suspicious than one who's always billed at high volume. We have 10 years of data; we should use the time dimension.
Network analysis — Provider referral patterns. Fraud rings often involve multiple providers referring to each other. Graph-based features could capture this.
Prescription data integration — CMS also publishes Medicare Part D prescriber data. Combining billing patterns with prescribing patterns could surface kickback schemes.
Cleaner labels — Filtering LEIE to financial fraud exclusions only, excluding license-based exclusions that may not reflect billing fraud.
Deep learning experiments — Sequence models on procedure-level billing history, treating each provider's billing as a time series.

10. Open Questions

We built this in the open because we believe healthcare transparency benefits from community scrutiny. There are questions we haven't answered — and some we probably haven't thought to ask.

How should we handle specialty bias in LEIE? Should we train separate models per specialty?
Is years_active a leaky feature? (Excluded providers stop billing — does the feature capture exclusion rather than predict it?)
What's the right threshold? We used 86% — but the precision/recall tradeoff is a policy decision, not a technical one.
How do you validate a fraud model when ground truth is inherently incomplete?
What features are we missing that could separate "unusual but legitimate" from "unusual and fraudulent"?

Explore the Data Yourself

We've published the model's highest-risk flags with full billing breakdowns. Look at the numbers, check our work, and tell us what we're getting wrong.

Explore High-Risk Providers →View Full Watchlist →

How We Built an ML Fraud Detection Model for 1.7 Million Medicare Providers

Contents

1. The Problem

2. The Data

Dataset at a Glance

3. Training Labels — The Key Innovation

Label Sources

4. Feature Engineering

Direct Features

Derived Ratios

Specialty-Relative Features (Z-Scores)

Procedure Features

Temporal Features

5. Model Selection & Training

The Class Imbalance Problem

Cross-Validation Results

6. Feature Importance

7. Results

Key Findings

8. Limitations & Ethics

What This Model Is NOT

9. What's Next

10. Open Questions

Explore the Data Yourself

Related

Data Sources

How We Built an ML Fraud Detection Model for 1.7 Million Medicare Providers

Contents

1. The Problem

2. The Data

Dataset at a Glance

3. Training Labels — The Key Innovation

Label Sources

4. Feature Engineering

Direct Features

Derived Ratios

Specialty-Relative Features (Z-Scores)

Procedure Features

Temporal Features

5. Model Selection & Training

The Class Imbalance Problem

Cross-Validation Results

6. Feature Importance

7. Results

Key Findings

8. Limitations & Ethics

What This Model Is NOT

9. What's Next

10. Open Questions

Explore the Data Yourself

Related

Data Sources