What Are BM25 and SPLADE? Sparse Learned Retrieval Explained
By Tharindu Gunawardana | SearchMinistry Media |
BM25 is a probabilistic sparse retrieval algorithm scoring documents using term frequency saturation, inverse document frequency, and document length normalisation. SPLADE extends sparse retrieval by using BERT's masked language model head to expand document terms into the full vocabulary, solving BM25's vocabulary mismatch problem.
BM25 Scoring Components
BM25 computes a relevance score by summing three components for each query term: term frequency with saturation (controlled by parameter k1), inverse document frequency (rare terms score higher than common terms), and document length normalisation (controlled by parameter b). TF saturation means 10 occurrences of a term scores only 2-3x higher than 1 occurrence, preventing keyword stuffing from dominating scores.
SPLADE: Learned Sparse Retrieval
SPLADE uses a BERT model with a masked language model head to produce sparse vectors over the full tokeniser vocabulary. For a document discussing "search engine ranking algorithms", SPLADE assigns non-zero weights to "information retrieval", "relevance scoring", and "web index", enabling matches to queries using this related vocabulary. SPLADE vectors remain sparse and can be used with inverted indexes, preserving retrieval speed.
SEO Implications
BM25 and SPLADE are the sparse retrieval component of hybrid AI search systems. Exact terminology ensures BM25 coverage; semantic variation improves SPLADE expansion coverage. BM25's TF saturation and length normalisation directly penalise keyword stuffing, rewarding comprehensive documents over repetitive ones. Content using precise topic terminology alongside natural semantic variation performs optimally in sparse retrieval.