NLP & Computer Vision That Works on Your Data

Pre-built APIs hit 93% on clean demos and 45% on your actual handwriting. Cloud vision models look great in the lab and fall apart on your factory floor's real lighting. We build NLP and computer vision systems that close that gap: trained on your data, calibrated to your environment, with the rules and exception handling that production needs.

Pixelfield is for CTOs, VPs of Engineering and operations leads who've already tried the API, hit the wall and need production accuracy on real-world documents, images and text. The seniors who scope the work also write the code. 50+ AI features in production, fastest deployment two months from kickoff to live, and a reputation for telling you when Vision API is enough.

  • Trained on your data, not benchmarks
  • Hybrid: model + rules + exception handling
  • Environmental calibration for CV
  • Honest about when pre-built wins
VeoliaUniversal studiosMercedesVienna insurance groupRaiffeisen BankGeometryWagestreamCinestarWMC | GREYNOAHOgilvyAmeli
4.9/5 on Google
4.8/5 on Trustpilot
5.0/5 on Clutch

We've worked with startups and big brands across the UK, Europe and the US since 2013.

A London based team that collaborates with you to build something that holds up in production.

/ Deliverables
What We Build (and the Real-World Problem It Solves)
01
Document Processing and OCR (Including Handwriting)
Print? Cloud APIs do 93-95%, fine. Handwriting, mixed-layout forms, multi-page reports, technical drawings? Cloud APIs collapse to 45-50%, VLMs degrade past page three with hallucinations. We build specialised OCR pipelines that hold 95%+ on real handwriting, structured fields and narrative text, with rules and exception handling around the model.
Reference workflows in industry have processed 150,000+ handwritten pages over twelve months at 95% accuracy where APIs were doing 45-50% with massive correction overhead.
OCR
HANDWRITING RECOGNITION
DOCUMENT EXTRACTION
FORMS PROCESSING
02
Text Classification, Extraction and Search
Named entity recognition, intent classification, ticket routing, sentiment on real customer text. Pre-trained models hit 80-90% on clean text and drop fast on your domain jargon, slang and edge cases. We fine-tune on your labelled data, ship with confidence thresholds and human review for low-confidence cases, and tell you up front whether your data needs labelling work first.
For dedicated chat or generative work, see our LLM Development and Chatbot pages.
NER
TEXT CLASSIFICATION
ENTITY EXTRACTION
SEMANTIC SEARCH
03
Visual Inspection and Object Detection
Defect detection on production lines, package counting in warehouses, retail shelf monitoring, safety violations on construction sites. Lab accuracy is not factory-floor accuracy. We do site-specific calibration: lighting, camera mounting, motion budget, hard-negative mining for background leakage, drift checks against new conditions.
Pilot lab does 98%. Factory floor with windows does 38% until you fix the lighting. We address the physical environment first, the model second.
VISUAL INSPECTION
OBJECT DETECTION
RETAIL CV
SAFETY MONITORING
04
Edge and Real-Time Vision
Camera feeds at 30fps are where cloud APIs stop being viable. One reference Rekognition deployment hit around $2,280 per camera per month. We deploy on-device or edge: ONNX Runtime, TensorRT Lite, TFLite or Rust wrappers, with hardware-validated accuracy (the same INT8 model can drift between 71% and 93% across different chips). Privacy, latency and predictable cost.
On-prem and edge deployments for medical, manufacturing and security use cases where data residency or sub-50ms latency rules out the API.
EDGE INFERENCE
ONNX
REAL-TIME CV
ON-PREM DEPLOYMENT

Why Most NLP and CV Projects Fail in Production (and What We Do About It)

The Benchmark-to-Production Gap

Models score great on benchmarks and fall apart on real data. Your first production runs typically land at 25-50% of what you saw in the lab and need ten times the effort to recover. We design the gap into the scope from week one: paid PoC on your real data, accuracy reported by data type (printed vs handwritten, lab vs floor) and explicit failure-mode analysis.

The Handwriting and Domain-Language Wall

Cloud OCR APIs sell 95% on clean print. On real handwriting they hit 45-50%. NLP APIs hit 80-90% on standard text and drop fast on your industry jargon, slang and multilingual edge cases. We close those gaps with specialised models, fine-tuning on your labelled data, and the post-processing rules every production system needs.

Confident and Wrong Is Worse Than Unreliable and Wrong

A model that returns perfectly formatted JSON with the wrong values is the worst kind of failure. We design confidence calibration, exception handling and human review into the workflow: low-confidence outputs route to review, layout shifts trigger alerts, drift on a sliced metric pages an engineer. Silent failures are the ones that destroy trust.

Hybrid Systems, Not Just Models

Production NLP and CV is rarely just a model. It's model plus rules engine plus post-processing plus exception handling plus human review plus feedback loops. One real NER pipeline ran with 'a million additional rules on top to fix the results'. We build the whole stack and explain why each layer earns its place.

How We Run an NLP or Computer Vision Engagement

01

Real-Data Audit and Feasibility (weeks 1-2)

We test the obvious pre-built option (Textract, Vision API, Rekognition, spaCy, HuggingFace) on your actual data first. Real handwriting samples, real factory floor footage, real customer text in your jargon. We measure where it holds, where it breaks, and what the failure modes look like.

Output: feasibility report with real accuracy by data slice, pre-built vs custom recommendation, environmental readiness assessment for CV, fixed-scope build quote.

02

Domain Adaptation and Model Development (weeks 3-6+)

For NLP: labelling strategy, fine-tuning on your data, slang and jargon coverage, multilingual handling where needed. For CV: site-specific data collection, hard-negative mining for background leakage, augmentation for lighting and motion, hardware validation across the cameras and chips you'll actually deploy on.

Build runs as a small senior team led by Michal Vavra. Each iteration is gated on a fixed eval set sliced by data type and condition, not aggregate accuracy.

03

Hybrid System and Production Integration (weeks 5-10+)

The model is one component. We wire in rules engines, post-processing, confidence thresholds, exception queues, human-review interfaces and feedback loops. Then we ship behind feature flags with shadow testing on real production traffic before any decision is automated.

For CV deployments we run site-specific validation across every camera, lighting condition and time of day before sign-off.

04

Monitoring, Drift Detection, Iteration

At launch we wire in confidence-distribution monitoring, sliced accuracy by data type, environmental drift alerts, exception-queue tracking and a feedback loop from human review back into the eval set. CV especially needs site-by-site monitoring because protocol or scanner drift can drop sensitivity 10+ percentage points silently.

Monthly retainer for monitoring, retraining and iteration, not a lock-in. Take the system in-house whenever you're ready.

NLP AND COMPUTER VISION INVESTMENT

Cost depends on data quality, environmental complexity (for CV), labelling effort and whether the answer is pre-built, fine-tuned or specialised custom. Discovery is fixed-fee from £2K and produces a defensible build quote plus a projected monthly run rate before you commit. Production builds typically start around £10K and run six to twelve weeks to a live system. For high-volume video, edge deployment usually beats per-frame cloud API pricing inside a quarter.
Real-Data Audit and Feasibility
From £2K. Test pre-built on your actual data, accuracy by slice, fixed-scope quote
Site Readiness and Domain Adaptation (CV)
Lighting, camera, motion calibration and on-site data collection where the deployment requires it
Production Build
Typically £10K+, 6-12 weeks to a live system with model, rules, exception handling and monitoring
Monitoring and Iteration Retainer
Monthly engagement for drift detection, retraining and exception-queue review

Frequently Asked Questions

The questions CTOs and operations leads ask us in the first call about <strong>NLP, computer vision, accuracy on real data and production cost</strong>.

Often dramatically. Cloud OCR APIs hit 93-95% on clean print and 45-50% on real handwriting. Specialised models reach 95% on the same handwriting. CV models trained on COCO collapse on your factory floor due to background leakage and lighting; site-specific fine-tuning on roughly 200 real images usually fixes it. We test the API on your data in discovery so you know exactly which gap you're closing before you commit.

We don't quote accuracy without seeing your data. The discovery PoC on your real samples produces honest accuracy by slice (printed vs handwritten, lab vs floor, polite vs angry user text). Then we set targets the system has to hit before launch. We also report false-positive and false-negative rates separately, because aggregate accuracy hides the failure modes that actually matter.

Use the API when your task is clean printed text, basic labelling, low volume, no domain-specific accuracy requirement and no real-time/edge constraint. Build custom when you have handwriting, domain objects, real-time video (camera-feed costs explode on cloud APIs), privacy/edge requirements, or you need 95%+ accuracy. We'll tell you which side you're on before quoting a build.

Yes, but rarely on cloud APIs. One reference Rekognition deployment hit around $2,280 per camera per month. We deploy on-device using ONNX Runtime, TensorRT Lite, TFLite or Rust wrappers, with hardware-validated accuracy. The same INT8 model can drift between 71% and 93% across different Snapdragon chips, so we test on the silicon you actually plan to ship on.

Three layers. Confidence thresholds that route low-confidence outputs to a human review queue. Output validation against typed contracts (layout shifts produce structured output, just wrong, so we check structure plus content). Sliced accuracy monitoring: we track drift on the slices that matter (per-customer, per-camera, per-document-type) so silent degradation surfaces in days, not months.

Discovery is fixed-fee from £2K. Production builds typically start at £10K and run six to twelve weeks. Run rate depends on architecture: cloud inference for low-volume NLP can be pennies per million tokens; edge CV on your hardware is near-zero marginal cost; high-volume video on cloud APIs explodes fast. We quote both numbers in the proposal so the run rate is clear before you commit.

When the pre-built API does the job at acceptable accuracy and cost. When your data is too sparse to fine-tune. When the physical environment isn't fixed (CV pilots fail because of lighting, not the model). When a rules engine or template-based extraction would be more reliable. We'll say so during discovery and point you at the cheaper, more reliable option.