Why our model can't count.
And why that's exactly the point.
A few months ago, one of our sellers described their fit model to us like this:
"5'8" tall. 44-inch bust. Three accessories — a thin chain, a slim watch, simple hoops. Five fingers per hand, clearly visible. Two pockets on the jacket."
It's a precise description. It's how a stylist would brief a real photographer. It's how a real photographer would brief a real model. It's exactly the kind of instruction you'd expect a sophisticated generation system to handle gracefully.
It's also a recipe for the worst output you'll ever see.
If you've worked with generative image systems for any length of time, you already know the punchline. The five-fingers thing? Asking for it makes the output worse, not better. The 44-inch bust? The system can't measure. The "two pockets"? It might give you three. Or one. Or a pocket-shaped smudge.
This is the part of generative AI that engineers learn the hard way. Not from a paper. Not from a docs page. From a hundred outputs that almost worked, and one moment where you stop and ask: why does this get worse the more specific I am?
The illusion of precision
There's a deeply ingrained assumption among engineers — myself included — that goes something like: more precision in, more precision out. It's how compilers work. It's how databases work. It's how nearly every system we build works.
Generative image systems don't work like that.
They don't read your number. They don't measure. They have no concept of dimension. When you write "5'8"," they don't picture five-foot-eight. They pattern-match across an enormous space of images that may or may not have ever been tagged with anything close to that height. The number, to them, is a token in a sentence — like the word "blue" or "warm." It influences the output. It doesn't constrain the output.
The same is true for counts. Asking for "five fingers per hand" is one of the worst things you can do — it draws the model's attention to a thing it's already trying to get right, and the extra attention tends to make it worse, not better. The model already knows hands have five fingers. You don't need to tell it. Telling it makes it worried.
It's like asking someone not to think of a polar bear.
What we did about it
When we started building roopafy, every seller in our beta described their model the way that seller described theirs — with numbers. Measurements. Counts. Heights in feet and inches. Bust in inches. Hip in centimetres.
We had to figure out how to honour that intent without ever passing those numbers to the part of our pipeline that actually makes the pictures.
So we fine-tuned our models to think in language, not numbers.
A 5'8" height becomes "tall." A specific bust measurement becomes a figure type — curvy hourglass, slim athletic, soft pear. Three accessories becomes "minimal accessory styling — a thin chain, a slim watch, simple earrings." Five fingers per hand becomes silence — we trust the model's own anatomical prior, because the model is better at hands when nobody is asking it to be.
The seller still sees the form they filled out. They still see the numbers they typed. Those numbers are stored, they're real, they belong to that model in our system. But by the time the image generator receives the instruction, every number has been translated into the dialect the model actually understands.
This is the work
It's tempting to look at a product like ours and call it "just a wrapper." The phrase is everywhere. Just a wrapper on AI. Just a wrapper on an API. Just a wrapper.
I'll tell you what's not in the wrapper.
A wrapper doesn't know that a numeric bust measurement is a worse prompt than a figure-type. A wrapper doesn't know that "five fingers" makes hands worse and "well-formed hands" makes hands better. A wrapper doesn't know that camera-angle words work, and degree numbers don't. A wrapper doesn't have a translation layer for height, or weight, or accessory count, or pose specificity. A wrapper sends what you typed.
We send what works.
That's not a small distinction. That's months of testing every kind of instruction, recording what produced consistent output and what produced uncanny output, and codifying the difference into a layer the user never sees. It's craft. It's invisible. And it's the difference between a system that mostly works and a system you can build a business on.
What this taught us
The most useful engineering lesson I've taken from building roopafy isn't a pattern or a framework. It's a posture.
When you're working with a system that has its own grammar — and generative AI absolutely has its own grammar — your job isn't to force it to speak yours. Your job is to learn how it actually wants to be spoken to, and then translate everything your users say into that dialect.
Numbers don't make pictures.
Words do.
This is the first post from inside roopafy engineering. We'll be sharing more — about the work nobody sees, the lessons that don't make the changelog, and the parts of building a generative product that aren't on the marketing site. Mason — Chief Architect, roopafy