NEW: Top CPT Code 97156 Errors and How to Fix Get the Guide
All Case Studies
Behavioral Health

Raising the bar: How Brellium beat industry benchmarks for E/M Coding reliability

How Brellium's AI-powered coding software outperformed published benchmarks for human-to-human E/M coding agreement, and what that means for your revenue cycle.

Up to 92%
agreement with expert in-house coders
30+ percentage points
above industry benchmarks for reliability

When behavioral health leaders hear about Coding Module, they all respond the same way: That sounds great, but coding requires clinical judgment. How can I trust AI to get this right?

It's a fair question. Medical coding translates complex, unstructured clinical notes into the billing codes that justify revenue, track patient progress, and surface practice utilization data. Get it wrong and the consequences compound quickly: overcoding creates repayment liability, undercoding leaves legitimate revenue on the table, and claim denials mean hours of unbillable rework.

To demonstrate the power of Coding Module directly, we put Brellium head-to-head against two clients' own in-house expert coders, each reviewing 200 real behavioral health notes.

Brellium agreed with the coders between 85% and 92% of the time — well exceeding published benchmarks for human-to-human coding agreement, which range between 50–70%.

The challenge: coding decisions can be subjective

Academic studies of E/M coding accuracy have consistently found middling agreement on coding decisions made by humans, no matter their expertise level.

For example, the landmark study on E/M coding agreement found that even trained coding specialists assigned the same E/M level only 57% of the time — and as low as 50% on individual cases.

After the 2021 MDM guideline overhaul, ENT providers without structured feedback achieved accurate coding only 40% of the time (training only boosted accuracy to 70%).

These aren't outliers — a systematic review of 18 outpatient billing studies found agreement rates consistently in the moderate range across specialties.

The reason why is that E/M coding requires interpretation and judgment calls.

Providers can assign an E/M code based on total time spent on the encounter or based on the complexity of medical decision making (MDM) involved in care. MDM is determined by the number and complexity of problems addressed, the amount and complexity of data reviewed, and the risk of complications or morbidity.

Consider how this could look in your practice. Your team must code the following simplified note:

"Patient returns for follow-up of generalized anxiety disorder. PHQ-9 reviewed; score improved from 14 to 9. Continuing current SSRI at same dose. No medication changes today."

Coder A assigns CPT Code 99213 (low complexity). Only GAD is being actively addressed — the MDD is noted but not documented as managed at this visit. This coder sees one stable chronic condition, one data point reviewed, and a prescription continuing unchanged. Two of three MDM elements are low; so they rate the overall level as low.

Coder B assigns CPT Code 99214 (moderate complexity). In their read, a patient on an SSRI for comorbid GAD and MDD is never truly managing just one condition — the medication choice reflects both diagnoses, even if only one was the visit's focus. They see two chronic conditions under management, which tips problem complexity to moderate. With risk already at moderate, two of three elements now meet the moderate threshold.

Who is correct?

The answer comes down to whether an undocumented but clinically present diagnosis counts as a condition being "addressed" — a judgment call the CPT guidelines don't resolve.

The solution: a software that codes like you

We designed Coding Module to be customizable, capable of navigating nuance, and ultimately feel like another member of your team.

Coding Module analyzes every clinical note against AMA MDM guidelines and outputs a recommended E/M code with a clear rationale. Our recommendations draw from more than 9 million behavioral health encounters and sharpen over time. As you review Brellium's selections and flag your preferences, the system learns your organization's risk tolerance, documentation standards, and coding conventions.

For organizations that want to go further, Managed Coding removes the operational work entirely. Rather than flagging misaligned codes for your team to resolve, Brellium's AAPC-certified coders handle the review, correction, and provider communication on your behalf, so accurate claims go out without adding work to your billing team or clinicians.

Both Coding Module and Managed Coding give you the coverage and confidence to know every encounter is coded correctly — with less manual overhead.

How we stack up against the other options

To validate this workflow, we directly compared Brellium’s coding decisions against those made by our clients’ in-house coders to see how often the two methods came to the same decision.

We partnered with two of our behavioral health clients — one comprehensive outpatient provider that offers services across multiple states and another telehealth provider that operates nationwide.

Brellium agreed with the coders between 85% and 92% of the time — far surpassing the published industry benchmarks for human-human coding agreement.

By comparison, most studies testing human-human coding agreement find that raters align just 55% - 60% of the time — a gap of 25 to 36 percentage points.

Brellium is the first in the industry to publish performance data supporting our AI-powered coding product. While other teams might demonstrate the time saving or cost savings of their coding tools, for any tool to provide lasting ROI the software must generate results customers can reliably trust.

When we compare Brellium to other options, we see Brellium outperforms. For example:

  • Outsourced coding teams rarely publish data on how often their coders agree with yours. Based on the medical and coding literature, even the best trained coders are unlikely to achieve more than moderate agreement with your team.
  • A 2024 analysis of GPT, Gemini, and LLama Pro found standard LLMs failed to hit a 50% exact-match rate when asked to generate medical codes.
  • In-house manual review will generate codes that are more aligned to your organization, but it is too time and resource intensive to scale.

To learn more about Brellium or see Coding Module in action, book a personalized demo with our team.

GET STARTED

Ready to achieve similar results?

See how Brellium can transform your clinical compliance operations.

Get a Demo