Home

Expensive ways small ML teams imitate big tech

10 May 2026

ML engineers at small companies routinely spend weeks on optimizations that don't move any user-facing metric. They tune retrieval indexes that serve ten queries per second. They build deduplication pipelines for corpora small enough to fit in a laptop's RAM. They stand up retraining infrastructure for systems that don't get enough daily signal to retrain meaningfully.

This happens because most ML practices we copy come from contexts where they paid off, and those contexts don't transfer. Big-tech ML content is written by engineers solving big-tech problems. The implicit message, "do this, it's serious work," is true for them. What gets left out is the break-even point: the scale below which a practice costs more than it returns.

What follows is four practices small ML teams routinely adopt where the break-even sits well above their scale. Each section names the practice, the assumption that justifies it at scale, why that assumption fails when you scale down, and what to do instead. The pattern across all four is the same: work that looks like serious ML but isn't doing anything the system actually needs.

Every ML practice has a break-even point: a scale below which the cost of adopting it (engineering time, system complexity, ongoing maintenance) exceeds the value it returns. Above the break-even, the practice pays off. Below it, the practice is actively harmful. It consumes resources that could be spent on work that does move metrics.

Big-tech ML content rarely names the break-even, because at big-tech scale the break-even has already been crossed for everything worth writing about. From inside an organization where the break-even is invisible, every practice looks like it just works. From outside, the break-even is the most important number in ML engineering.

Most "best practices" in the ML literature haven't crossed the break-even for a small team. The discipline isn't whether to optimize. It's whether you've crossed the break-even for this specific optimization.

Vector databases at small scale

Standing up a managed vector database (Pinecone, Weaviate, Qdrant) has become a default first move for teams building anything that involves embeddings. The reasoning, when articulated, runs something like: "we're building retrieval, retrieval needs vectors, vectors need a vector DB."

At scale, vector DBs solve real problems: concurrent reads at high QPS, distributed indexing across machines, durability guarantees, cost-per-query optimization at billions of vectors. These are non-trivial engineering problems and the managed services are worth what they cost when you have them.

At small scale, none of these problems exist. A corpus of 50,000 documents at 768-dimension embeddings fits in 150MB of memory. A brute-force cosine similarity scan over that corpus, in numpy, returns in under 50 milliseconds on a single CPU core. There is no concurrency problem, no indexing problem, no durability problem to solve, and yet the vector DB is sitting there, requiring a deployment, a monitoring story, version pinning, an upgrade path, and a line item on the cloud bill.

The simpler alternative is straightforward: a dictionary mapping document IDs to embeddings, loaded from disk at startup, queried with numpy.dot and argpartition. Twenty lines of code. No infrastructure. Faster than any networked vector DB at this scale, because the comparison isn't with a tuned vector DB versus brute force. It's with brute force versus a network round-trip plus a tuned vector DB.

The break-even sits somewhere between one and ten million vectors, depending on query latency requirements and how much memory you can spend. Below that, in-memory similarity is faster, simpler, and easier to debug. The signal that you've crossed the break-even is concrete: memory pressure, restart latency, or concurrent-write requirements that you've actually measured. Not an architecture diagram that anticipates them.

The question isn't will I need a vector DB eventually? It's do I need one now? For most teams, the honest answer is no, and the cost of the premature yes is engineering time you could have spent on the product.

Deduplication on small corpora

Deduplication has become a near-default step in document pipelines: hash-based exact matching, MinHash with LSH for near-duplicates, semantic deduplication via embedding similarity. The reasoning runs something like: "duplicates inflate the corpus, skew retrieval, and waste compute, so dedupe."

At massive scale, this is correct. The research motivating these pipelines comes from training LLMs on billion-document datasets, where near-duplicates create systemic distributional problems: models memorize repetitive content, evaluation sets leak into training, and metrics distort in ways that compound. At that scale, the infrastructure pays for itself.

At the "small" scale of one million documents, however, the picture inverts. At this volume, most domains only deal with a manageable amount of noise: the same page scraped twice or two versions of a help article. Exact-match deduplication catches these cases for almost zero cost: hash the normalized text, drop collisions, and you've solved the 80% of the problem that actually matters.

Near-duplicate detection is a different proposition. Calibrating MinHash thresholds takes real time. False positives (documents that look similar by Jaccard distance but carry different intent) get removed silently, and you only notice when retrieval fails on a query you can't trace. The pipeline becomes a maintenance surface that grows with the corpus and produces marginal gains you can't measure against your eval set.

The break-even for near-duplicate detection isn't a corpus size, it's an evidence threshold. The signal that you need it: you've identified retrieval failures that trace specifically to near-duplicate noise. Not theoretical concerns from a paper, not "best practices" reasoning. Measured failures.

The deeper question worth asking: research recommendations are calibrated to research contexts. Before importing one, ask whether your context resembles the one the research was done in. For deduplication, the answer at small retrieval scale is almost always no.

Continuous retraining at small signal

Continuous retraining (automated daily or weekly retraining on accumulated production data) has become an MLOps default for production ML systems. The reasoning runs something like: "data shifts, models stale, so retrain on a schedule."

At scale, this is straightforwardly correct. Recommendation systems at YouTube, Netflix, Meta retrain frequently because the daily signal is rich enough to drive measurable improvement. With hundreds of millions of users and billions of interactions, a day's worth of feedback is enough to reliably update model parameters in directions that hold up on holdout evaluation. The retraining pays for itself, the pipeline is justified, and the literature supports the practice.

At small scale, none of those preconditions hold. With a few thousand daily users and proportionally fewer interactions, a day's worth of signal is dominated by noise. The model "updates" you see between retraining runs are within natural variance. They're not improvements, they're statistical jitter. Worse, you can't tell, because your eval set is too small to distinguish a real 1-2% improvement from random fluctuation.

What you do get, reliably, is the cost: a retraining pipeline that has to be maintained, deployment infrastructure that has to handle versioned models, and a false sense that the system is improving when it's actually oscillating. The team builds dashboards to track metrics that move within noise. Engineering hours go into debugging retraining failures that, even when fixed, don't change anything users experience.

The simpler alternative: train, deploy, leave it alone. Retrain quarterly at most, or in response to specific identified problems: measured data drift, a feature change that requires it, an evaluation regression that traces to model staleness. Not on a schedule.

The break-even is statistical, not temporal. You've crossed it when a single retraining cycle's worth of new data is large enough to produce updates whose improvement is distinguishable from noise on your eval set. If you're not sure whether retraining helped, you haven't crossed it, and the pipeline is costing you without returning anything.

Custom embedding fine-tuning at small evaluation scale

Fine-tuning embedding models on domain-specific corpora has become a common move for teams trying to improve retrieval. Sentence-transformers fine-tuning, OpenAI/Cohere fine-tuning APIs, custom contrastive training: the tooling has gotten accessible enough that teams reach for it early. The reasoning runs something like: "off-the-shelf embeddings are general-purpose, my domain is specific, so a fine-tune should help."

At scale, this works. Domain-specific embeddings genuinely outperform general ones when you have the eval infrastructure to validate the gains and the deployment maturity to handle multiple model versions. Real research, real results, real cost-benefit at the scale where it's been validated.

At small scale, the gain disappears into the noise floor. With a few hundred labeled queries (typical for a small ML team's eval set), the difference between a strong off-the-shelf embedding model and a fine-tuned variant is usually within the variance of the eval itself. You can't tell whether the fine-tune helped, hurt, or did nothing. You can't tell whether the gains are consistent across the kinds of queries you actually serve. You especially can't tell whether the fine-tune will hold up as the corpus drifts.

What you can tell, reliably, is the cost. Fine-tuning takes weeks of engineering time. The custom model has to be versioned, retrained when the base model updates, and revalidated when you want to swap embedding providers. You've taken on a maintenance liability that compounds every time you reconsider any related decision.

The simpler alternative: use a strong off-the-shelf model. The current generation of general-purpose embeddings is well past the threshold of usefulness for almost every small-scale retrieval application. The differences between top off-the-shelf models matter less than the cost of running a custom one.

The break-even sits at eval set size, not corpus size. You've crossed it when your eval set is large enough to distinguish a 2-3% improvement from natural variance: typically thousands of labeled queries with diverse coverage. Below that, you're optimizing on noise. The fine-tuning research that motivates the practice was done with eval sets orders of magnitude larger than what most small teams have.

The pattern

All four practices share the same shape. Each one is calibrated to a context where the cost of adoption is dominated by the value it returns. Each one fails when applied to a context where the inputs that justify it (scale, signal, evaluation infrastructure) aren't there yet. The shared rationalization is some version of "this is what serious ML teams do." The shared cost is engineering time and system complexity that don't return anything users feel.

The reason this keeps happening is structural. Big-tech ML content is the most visible ML content. It's written by engineers solving big-tech problems for an audience that mostly works at big-tech scale. The implicit message, "do this, it's the right way," is true for the writer's context. It travels poorly to other contexts, but the warning labels don't travel with it.

The discipline that separates strong small ML teams from weak ones isn't technical sophistication. It's the willingness to ask, before each optimization: at what scale does this become net-positive? Have I crossed that scale? If not, what's the simpler version that works for what I actually have?

The strongest small ML systems often look underbuilt by big-tech standards. That's not a bug. Underbuilt is fast to ship, fast to debug, and easy to evolve when the context actually changes. The complexity will come when you need it, and you'll know, because you'll have evidence rather than intuition.

The job at small scale isn't to build the system big-tech would respect. It's to build the system that ships, works, and gets out of the way.

← Back to writing