Amazon — Nathan Ghabour

The Challenge

Two questions no one had answered yet

When Amazon set out to build Halo's body composition feature, which estimates body fat percentage from smartphone photos, two questions were unanswered. How accurate did the technology need to be for someone to trust it with data about their own body? And with multiple competing technical approaches in development, who decides which one is better?

The first question required getting in a room with real people. The second required an evaluation process no team could game. I worked on both.

Amazon Halo — Body Composition

3D body model generated from smartphone photos — no hardware, no lab visit

The core problem

"How accurate does this need to be before a person looks at a model of their own body and believes it?"

That's not an engineering question. It's a research question. Nobody had run the research.

The Question I Set Out To Answer

Consumer accuracy thresholds for body data had never been systematically studied

The Approach — Part 1

Finding the accuracy threshold consumers actually needed

I designed and ran a consumer accuracy study with active exercisers, the core demographic for a health wearable. Each person was shown body models of themselves generated by four different technical approaches, from a simple height-and-weight estimate to a high-resolution laser scan. We asked one question: which of these is you?

The answers were immediate and unanimous in ways we hadn't anticipated.

2×

accuracy improvement over leading smart scales at Halo launch

4

accuracy levels tested, from weight estimate to laser scan

100%

immediately identified which models were higher vs. lower accuracy

<10%

identified with anything below high-accuracy scan. The bar was much higher than expected.

76%

preferred an ambitious fitness goal state over a moderate one

100%

said achieving their goal body was very possible. No one found it intimidating.

What People Actually Used to Decide "Is This Me?"

Participants didn't evaluate accuracy abstractly. They looked for specific body details: collarbones, belly buttons, how clothing interacted with their shape. These were the signals that said "this is my body" versus "this is a body like mine."
People who preferred photos to avatars gave the same reason: the avatar didn't have enough personal detail. The bar wasn't proportional correctness. It was recognizable specificity.
Facial features mattered. Clothing interaction mattered. Generic body shapes, even proportionally accurate ones, didn't pass the self-identification test.
This changed the evaluation criteria entirely: accuracy had to be measured at the level of consumer self-identification, not deviation from a reference scan.

"Is this you?" — Results by model accuracy level

Height + weight estimate

<10%

Photo + height/weight

~30%

Averaged scan + photo

~60%

High-resolution scan

90%+

This set the accuracy target every technical approach had to meet

Key insight: "Accurate enough" is not a number your engineers can define. It's a threshold your users set. The question isn't how close your model is to a clinical reference. It's whether a real person, looking at a model of their own body, recognizes themselves in it. Related, but not the same standard. Only one determines whether your product gets trusted.

The Approach — Part 2

Running an objective comparison between competing technical approaches

With a consumer accuracy bar defined by research, the next challenge was building an evaluation process rigorous enough to make a defensible call. When multiple technical teams each believe their approach is better, you need a shared framework that no one can game.

What Made the Evaluation Rigorous

Representative test sets: Evaluation datasets reflected the actual diversity of consumer bodies, not the controlled lab population that engineering teams tend to over-optimize for. A model that works on fit, athletic bodies and fails on everyone else is not a consumer product.
Consumer-grounded metrics: Accuracy was measured against the body regions participants used to self-identify, not aggregate deviation from a reference scan. The user research directly informed what "better" meant.
Separation of training and evaluation: Managed dataset pipelines across both teams to ensure neither was shaping their model toward the evaluation set. A subtle failure mode when teams have visibility into each other's work.
Consistent test conditions: Real-world capture: varied lighting, different phone models, home environments. Not the controlled studio setup that makes every approach look better than it performs in practice.

Evaluation Framework

Technical Approach A
Team 1

↘

Shared Evaluation Dataset
Consumer-representative
Real-world conditions
Consumer accuracy metrics

Technical Approach B
Team 2

↗

Defensible algorithmic decision — one approach wins on consumer metrics

Understanding How People Wanted to Use Body Data

The same research surfaced how consumers wanted to engage with body data: not just to measure themselves, but to see where they were going
76% preferred an ambitious fitness goal over a moderate one. 100% rated the ambitious version as achievable. No one found it intimidating.
The research pointed to a clear product direction: body fat percentage as a standalone number doesn't motivate change. People needed to see where they were going, not just where they were.
That insight raised the accuracy bar, not lowered it. If the model doesn't look like you, you won't trust your progress. The avatar has to earn belief before the number matters.

What This Enabled Downstream

A clear basis for selecting the algorithm that shipped in Halo, backed by consumer research rather than internal benchmarks
The 2× improvement over smart scales wasn't just a technical comparison. It was calibrated to the threshold at which consumers actually recognized themselves in the result.
Product direction for fitness visualization: how to frame goal states, what level of detail consumers needed, and which input types were trustworthy enough to build on
A reusable evaluation methodology. The same framework applies to any body data product making accuracy claims: consumer-representative test sets, real-world conditions, metrics grounded in how people actually perceive their own bodies

The Outcome

Halo launched with 2× the accuracy of leading smart scales.

Amazon Halo launched as one of the most accurate consumer body composition products available. The user research established the bar. The evaluation process determined which approach could clear it. Both together made a claim that held up at launch.

Why It Matters

What I can do for teams building body data products

Most teams making accuracy claims about body data do it one of two ways: they take whatever the engineering team says is accurate, or they compare against a clinical reference metric no consumer has ever heard of. Neither approach answers the question a customer is actually asking: "Can I trust this with data about my own body?"

This methodology is transferable. If you're building a product that makes claims about people's bodies, I can help you define the consumer accuracy bar, build the evaluation infrastructure to measure against it, and translate technical performance into decisions your team can act on.

The accuracy bar for body data products has to be set by the people who will use them. That's research work before it's engineering work.

Building something in this space?

I help teams figure out what "accurate enough" actually means for their consumers, and build the evaluation process to get there.

Get in touch Book a call →

Body Data Is Only UsefulIf People Can See Themselves In It