~/log/fifteen-thousand-images
15,000 images later: what industrial CV taught me about data work
AI 2 min read
The model was never the problem
When the people-counting system at Tata Steel underperformed, my instinct — like everyone’s — was to reach for a bigger model. The pretrained checkpoint scored fine on benchmark footage and badly on plant footage, and the gap had nothing to do with capacity. Steel plants are a domain-shift machine: backlit catwalks, heat shimmer, reflective PPE, occlusion from machinery, camera angles no public dataset contains.
Almost every point of improvement came from the dataset. The loop looked like this:
- Train on what we had.
- Run on held-out camera feeds.
- Harvest the failures — frames where detection missed, flickered, or double-counted.
- Annotate those frames specifically.
- Repeat.
Fifteen thousand images later, the distribution of the training set looked like the plant, not like COCO — and the metrics followed.
Hard frames are worth more than easy frames
The first dataset iteration was sampled uniformly from footage, which means it was dominated by easy frames: well-lit, unoccluded, medium distance. The model aced those and failed where it mattered. Rebalancing toward hard cases — distant figures, partial occlusion, glare — was worth more than doubling the dataset size.
The uncomfortable version of this lesson: annotation time is the scarcest resource in applied CV, and spending it uniformly is spending it badly. Triage before you annotate.
Night shift is a different dataset
IR night footage isn’t “the same images, darker.” It’s nearly grayscale, highlights bloom, and contrast behaves differently. Treating it as the same distribution quietly tanked night-time accuracy — the failure was invisible in aggregate metrics until we sliced by time of day.
Slice your metrics by every operational condition you can name. Aggregates hide exactly the failures your users will find.
Trackers fix counting, not detection
Per-frame counts flicker: a person occluded for five frames becomes two people. ByteTrack on top of YOLOv10 fixed the counting problem by preserving identity through occlusion — but it’s worth being precise that tracking compensates for missing detections; it doesn’t improve them. If detection is weak, the tracker launders the weakness into smoother-looking wrong numbers.
What I’d tell past me
- Build the failure-harvesting pipeline before training run two, not after run ten.
- Tune thresholds per camera. Cameras are individuals.
- The dataset is the model. The architecture is a detail you pick once.
~/log/related