How We Built an ML Fraud Detection Model for 1.7 Million Medicare Providers
Supervised learning, 96 million rows, and the difference between anomaly detection and actual fraud labels
Contents
1. The Problem
Medicare pays over $854 billion per year to healthcare providers. The Government Accountability Office estimates that $60–90 billion of that is lost to fraud, waste, and abuse annually — roughly 7–10% of total spending. That's more than the entire budget of the Department of Homeland Security.
CMS has limited auditing resources. The HHS Office of Inspector General has about 1,600 employees overseeing a program that pays 1.7 million providers. That's roughly one investigator per 1,000 providers. They can't look at everyone. So the question becomes: can machine learning help identify where to look?
Most existing fraud detection in healthcare is either rule-based (flag anyone billing over X) or unsupervised anomaly detection (find statistical outliers). Both have problems. Rules are easy to game. Anomaly detection catches weird billing, but weird isn't the same as fraudulent — a rural oncologist treating a cancer cluster will look like an outlier for legitimate reasons.
We wanted to try something different: a supervised model trained on confirmed fraud cases.
2. The Data
Dataset at a Glance
The raw data comes from CMS's publicly available Medicare Physician & Other Practitioners dataset, released annually. Each row represents one provider billing one HCPCS code in one year — so a single doctor might have hundreds of rows across codes and years.
We aggregated these 96 million rows into provider-level features: total payments, total services, unique beneficiaries, submitted charges (what they billed), allowed amounts (what Medicare approved), and the actual payment. We also preserved procedure-level detail for feature engineering.
Key raw features include: billing amounts, service volumes, beneficiary counts, markup ratios (submitted charges ÷ Medicare payment), procedure codes (HCPCS), geographic data (state, ZIP), and specialty classification.
3. Training Labels — The Key Innovation
This is what makes our approach different from most Medicare fraud research. Instead of just flagging outliers, we have actual ground truth labels. We know who committed fraud — because they got caught.
Label Sources
- HHS OIG LEIE (List of Excluded Individuals/Entities): The federal government's database of healthcare providers excluded from federal programs for fraud, patient abuse, licensing violations, etc. Contains 82,714 entries. After NPI matching, 8,301 unique NPIs linked to our Medicare dataset.
- DOJ Healthcare Fraud Cases: We manually compiled NPIs from Department of Justice press releases on healthcare fraud prosecutions. This added 6 additional confirmed NPIs not in LEIE.
- Total matched: 8,307 confirmed fraud-associated NPIs. Of these, 2,198 were found in our Medicare billing dataset with sufficient data for modeling.
Why only 2,198 out of 8,307? Many LEIE entries are for providers who were excluded before our data window (2014–2023), who practice in settings not covered by this dataset (hospital employees, home health aides), or who had too few billing records to generate meaningful features.
This is a supervised classification model, not anomaly detection. That's a huge difference. Anomaly detection says "this provider is unusual." Our model says "this provider's billing pattern looks like providers who were confirmed to have committed fraud." The latter is a much stronger signal.
4. Feature Engineering
We engineered 30+ features from the raw data. They fall into five categories:
Direct Features
Aggregated directly from CMS data:
total_payments— sum of Medicare payments across all yearstotal_services— total service counttotal_beneficiaries— unique beneficiaries servedmarkup_ratio— submitted charges ÷ Medicare payment (how aggressively they bill above what Medicare pays)
Derived Ratios
These capture billing intensity rather than raw volume:
services_per_beneficiary— are they seeing each patient unusually often?payment_per_service— are they billing high-value codes?payment_per_beneficiary— how much do they extract per patient?
Specialty-Relative Features (Z-Scores)
Raw billing numbers are misleading across specialties — an ophthalmologist billing $500K is normal; a family doctor billing $500K is unusual. We compute z-scores relative to each provider's specialty median:
A z-score of 3+ means the provider bills 3 standard deviations above their specialty peers. This normalizes across specialties and is one of our most powerful feature categories.
Procedure Features
hhi_concentration— Herfindahl-Hirschman Index of procedure code concentration. High HHI = billing is concentrated in a few codes (potential code abuse)upcoding_ratio— ratio of high-level E&M codes (99214/99215) to low-level (99213). Upcoding is one of the most common fraud typesdrug_share— fraction of billing from drug administration codes (Part B drugs are a major fraud vector)wound_share— fraction from wound care/skin substitute codescovid_share— fraction from COVID-related codes
Temporal Features
services_per_day— total services ÷ estimated working days. Flags physically impossible volumesbeneficiaries_per_day— unique patients per working dayyears_active— how many years the provider appears in the dataset. Turns out this is the single most important feature
5. Model Selection & Training
The Class Imbalance Problem
For every confirmed fraudster, there are 781 clean providers. A model that predicts "not fraud" for everyone achieves 99.87% accuracy. Accuracy is meaningless here.
We chose Random Forest for several reasons:
- Interpretability — feature importance scores tell us why the model flags someone, not just that it does. For a fraud detection tool, explainability matters.
- Class imbalance handling — with
class_weight='balanced', Random Forest automatically upweights the minority class - Robustness — handles mixed feature types, doesn't require normalization, resistant to outliers
- Training speed — fits in under 30 minutes on our dataset
Cross-Validation Results
Fold 1: AUC 0.84 | Fold 2: AUC 0.81 | Fold 3: AUC 0.83
Fold 4: AUC 0.82 | Fold 5: AUC 0.83
Mean AUC: 0.83 (±0.01)
An AUC of 0.83 means: given a random fraud provider and a random clean provider, the model correctly ranks the fraudster higher 83% of the time. Not perfect, but meaningful — especially given the noise in our labels (LEIE includes non-fraud exclusions like license revocations).
We also tried Gradient Boosting (XGBoost), which took 4+ hours to train and yielded only a marginal improvement of ~1–2% AUC. For a research tool where interpretability and iteration speed matter more than squeezing out the last percentage point, Random Forest was the right call.
6. Feature Importance
What does the model actually look at? Here are the top 10 features by Gini importance:
Gini importance from Random Forest (500 trees, balanced class weights)
The top features tell an interesting story:
- years_active (16.3%) — The single most important feature. Fraudsters tend to have shorter billing histories. They enter the system, bill aggressively, and get caught (or disappear) within a few years. Legitimate providers have decades-long careers.
- services_per_beneficiary (11.9%) — How many services per patient. Fraud often involves padding encounters — billing for services that didn't happen or weren't medically necessary.
- markup_ratio (8.0%) — Charge inflation. Fraudulent providers tend to submit charges much higher relative to what Medicare pays, suggesting aggressive overbilling.
- total_services (7.2%) — Sheer volume. Many fraud schemes are volume plays — doing the same thing thousands of times.
- payment_per_beneficiary (6.8%) — How much they extract per patient. High values suggest either unnecessary services or high-cost procedure abuse.
7. Results
Key Findings
- • 500 providers scored >86% fraud probability
- • Model correctly flagged providers later charged by DOJ
- • Top states: CA, FL, NY, TX, NJ — mirrors DOJ enforcement geography
- • Internal Medicine (53%) + Family Practice (27%) = 80% of high-risk flags
- • Mean AUC: 0.83 across 5-fold cross-validation
When we scored all 1.72 million providers, 500 scored above our 86% threshold. These aren't random outliers — they're providers whose billing patterns statistically resemble confirmed fraudsters across multiple dimensions simultaneously.
The most compelling validation: we trained the model on LEIE data (providers excluded before or during our data window), then checked it against DOJ prosecutions that came after. The model had already flagged several of these providers as high-risk. Our data predicted fraud before the Department of Justice announced charges. Read the full story →
The geographic distribution is also telling. Our top-flagged states — California, Florida, New York, Texas, New Jersey — are exactly the states where DOJ has historically concentrated healthcare fraud enforcement. The model independently discovered the same geographic patterns.
The specialty concentration is notable: 80% of high-risk flags are Internal Medicine or Family Practice. This makes sense — these are high-volume, office-visit-heavy specialties where billing fraud is easiest to execute and hardest to detect in individual claims.
8. Limitations & Ethics
We want to be extremely clear about what this model is and isn't.
What This Model Is NOT
- • Not an accusation. A high fraud score means billing patterns statistically resemble confirmed fraudsters. There are many legitimate reasons for unusual billing.
- • Not comprehensive. The model is trained on caught fraudsters. By definition, it may miss sophisticated schemes that haven't been detected yet.
- • Not unbiased. If LEIE disproportionately includes certain specialties or regions (it does — enforcement resources aren't evenly distributed), the model inherits that bias.
- • Not a replacement for investigation. Statistical flags are starting points for human review, not conclusions.
Survivorship bias is our biggest known limitation. We can only train on providers who got caught. If there's a class of sophisticated fraud that systematically evades detection, our model won't learn those patterns. We're training on the fraud that looks like caught fraud.
Label noise is another concern. The LEIE includes exclusions for reasons beyond fraud — license revocations, controlled substance violations, patient abuse. These providers may have different billing patterns than financial fraudsters. We treat all LEIE entries as positive labels, which adds noise.
We publish this work as a research and transparency tool, not as accusations. Every provider profile on OpenMedicare includes a disclaimer. We encourage anyone with concerns about a specific provider to report to the OIG rather than draw conclusions from statistical models alone.
9. What's Next
This is v1 of our fraud model. Here's what we're working on:
- Temporal models — Year-over-year changes in billing patterns. A provider whose billing doubles overnight is more suspicious than one who's always billed at high volume. We have 10 years of data; we should use the time dimension.
- Network analysis — Provider referral patterns. Fraud rings often involve multiple providers referring to each other. Graph-based features could capture this.
- Prescription data integration — CMS also publishes Medicare Part D prescriber data. Combining billing patterns with prescribing patterns could surface kickback schemes.
- Cleaner labels — Filtering LEIE to financial fraud exclusions only, excluding license-based exclusions that may not reflect billing fraud.
- Deep learning experiments — Sequence models on procedure-level billing history, treating each provider's billing as a time series.
10. Open Questions
We built this in the open because we believe healthcare transparency benefits from community scrutiny. There are questions we haven't answered — and some we probably haven't thought to ask.
- How should we handle specialty bias in LEIE? Should we train separate models per specialty?
- Is years_active a leaky feature? (Excluded providers stop billing — does the feature capture exclusion rather than predict it?)
- What's the right threshold? We used 86% — but the precision/recall tradeoff is a policy decision, not a technical one.
- How do you validate a fraud model when ground truth is inherently incomplete?
- What features are we missing that could separate "unusual but legitimate" from "unusual and fraudulent"?
Explore the Data Yourself
We've published the model's highest-risk flags with full billing breakdowns. Look at the numbers, check our work, and tell us what we're getting wrong.
Disclaimer: The fraud scores and billing patterns described in this article are statistical outputs from a machine learning model trained on publicly available data. They are not accusations of fraud. Individual cases may have legitimate explanations. Named providers have not been charged with any crime unless otherwise stated. If you suspect fraud, report it to the OIG Fraud Hotline (1-800-HHS-TIPS).
Related
Data Sources
- • Centers for Medicare & Medicaid Services (CMS) — Medicare Physician & Other Practitioners Data (2014–2023)
- • HHS Office of Inspector General — List of Excluded Individuals/Entities (LEIE)
- • Department of Justice — Healthcare Fraud Prosecution Records
- • Government Accountability Office — Medicare Improper Payment Estimates
Last Updated: February 2026
Note: All data is from publicly available Medicare records. OpenMedicare is an independent journalism project not affiliated with CMS.