~/work/tata-steel-vision · FIELD-TESTED
Tata Steel People Counting
Real-time people detection and counting on industrial CCTV, built on YOLOv10 and a 15,000-image dataset curated by hand.
The problem
Industrial safety compliance at a steel plant depends on knowing how many people are inside specific zones at any moment — and CCTV operators can’t watch every feed. The task: detect and count people on live camera streams, in an environment that breaks most pretrained models. Steel plants have harsh backlighting, heat shimmer, reflective PPE, occlusion from machinery, and camera angles nothing in COCO prepares you for.
Why the dataset was the real project
An off-the-shelf YOLO checkpoint scored poorly on plant footage — most failures traced back to domain gap, not model capacity. So the bulk of the work became data:
- Pulled frames from plant CCTV across shifts, seasons of lighting, and camera positions.
- Curated and annotated a 15,000-image dataset, balancing hard cases (partial occlusion, distant figures, glare) instead of letting easy frames dominate.
- Iterated with error analysis: train, run on held-out feeds, harvest the failure frames, annotate those, repeat.
That loop — not architecture tweaks — is what moved the metrics.
Architecture
# the core loop, simplified
for frame in stream:
detections = yolo(frame) # YOLOv10, person class
tracks = bytetrack.update(detections)
zone_counts = count_by_zone(tracks, zones)
publish(zone_counts) # dashboard + alerting - YOLOv10 for detection — chosen for its NMS-free design and latency profile on the available hardware.
- ByteTrack for tracking, so a person occluded for a few frames keeps their identity instead of being double-counted.
- Zone logic on top of track positions, so counts are per-area rather than per-frame totals.
Decisions & tradeoffs
- Tracking over per-frame counting. Raw per-frame counts flicker badly under occlusion. Track-based counting smooths this at the cost of a tracker to tune.
- Smaller model, better data beat a larger model on dirty data — the latency budget was fixed by the live-feed requirement.
- Confidence thresholds tuned per camera, not globally. Plant cameras differ too much for one threshold to be honest.
What broke and what I’d change
- Early versions double-counted people at zone boundaries; fixed by counting on track centroids with hysteresis rather than bounding-box overlap.
- Night-shift IR footage needed its own augmentation strategy — grayscale-ish, bloomed highlights. The first dataset iteration underrepresented it.
- If I rebuilt it: invest in an automated annotation-triage pipeline from day one. Manually picking hard frames was the bottleneck of every iteration.