# Late Interaction with turbopuffer

## Late Interaction

Most search systems compress every document into a single number before comparing it to a query. It's fast, but can be imprecise, like comparing two songs by their average pitch. "Cheap flights to Europe" and "budget airline tickets to the EU" mean the same thing, but they don't share a single word. When you squish each one into a single number, that similarity can get lost.

Late interaction models like [ColBERT](https://arxiv.org/abs/2004.12832) keep one vector per word instead of one per document. At search time, each word in the query finds its best semantic match among the document's words: "cheap" matches "budget," "flights" matches "tickets," "Europe" matches "EU." Those match scores get added up, giving you more precise ranking, especially on longer or more nuanced documents.

turbopuffer's dense retrieval is already fast and accurate, but it still compresses each document into a single vector. For short texts like titles or questions, that's fine. For longer documents, contracts, support articles, product catalogs, a single vector can't represent every detail, and the ranking suffers. Adding ColBERT reranking on top of turbopuffer recovers that lost detail. turbopuffer narrows millions of documents to \~100 candidates in milliseconds, then ColBERT re-scores those candidates word-by-word to get the final ranking right.

This guide walks through implementing late interaction retrieval on turbopuffer using the [Quora Duplicate Questions](https://huggingface.co/datasets/sentence-transformers/quora-duplicates) dataset.

## Implementation

This sample code uses two namespaces: one stores a single dense vector per document (for fast ANN retrieval), and the other stores one ColBERT token vector per word (for reranking). At query time, turbopuffer retrieves the top candidates by dense similarity, then ColBERT rescores each candidate word by word. To get started, you'll need:

* A [turbopuffer API key](https://turbopuffer.com/dashboard)
* An [OpenAI API key](https://platform.openai.com/settings/organization/api-keys) for dense embeddings (or swap in Cohere, Voyage, etc.)
* Python 3.9+ with packages: `turbopuffer`, `transformers`, `safetensors`, `torch`, `openai`

#### Step 1: Setup

Set up the turbopuffer client, create two namespaces, and define an embedding helper function.

```python
# $ pip install turbopuffer transformers safetensors torch openai
import turbopuffer
import os
import numpy as np
from colbert_encoder import ColBERTEncoder

tpuf = turbopuffer.Turbopuffer(
    # API tokens are created in the dashboard: https://turbopuffer.com/dashboard
    api_key=os.getenv("TURBOPUFFER_API_KEY"),
    # Pick the right region: https://turbopuffer.com/docs/regions
    region="gcp-us-central1",
)

ns = tpuf.namespace("late-interaction-example")
token_ns = tpuf.namespace("late-interaction-tokens-example")

# Create an embedding with OpenAI, could be {Cohere, Voyage, Mixed Bread, ...}
# Requires OPENAI_API_KEY to be set (https://platform.openai.com/settings/organization/api-keys)
def openai_or_rand_vector(text: str) -> list[float]:
    if not os.getenv("OPENAI_API_KEY"): print("OPENAI_API_KEY not set, using random vectors"); return [__import__('random').random() for _ in range(1536)]
    try: return __import__('openai').embeddings.create(model="text-embedding-3-small",input=text).data[0].embedding
    except ImportError: print("openai package not installed, using random vectors (`pip install openai`)"); return [__import__('random').random() for _ in range(1536)]

documents = [
    "How do I make money online?",
    "What are the best ways to earn money from home?",
    "How can I start a successful online business?",
    "What programming languages should I learn first?",
    "How do I invest in the stock market as a beginner?",
    "What are some good side hustles for college students?",
    "How can I improve my credit score quickly?",
    "What is the best way to learn Python?",
    "How do I create a budget and stick to it?",
    "What are the highest paying remote jobs?",
]

# To load a larger dataset instead, use:
# $ pip install datasets
# from datasets import load_dataset
# ds = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train")
# seen = set()
# documents = []
# for row in ds:
#     for q in [row["sentence1"], row["sentence2"]]:
#         q = q.strip()
#         if q not in seen and len(q) > 10:
#             seen.add(q)
#             documents.append(q)
#     if len(documents) >= 10_000:
#         break
# documents = documents[:10_000]
```

#### Step 2: Index dense vectors

Each document gets a single dense embedding (1536-dim from OpenAI). This is the same vector you'd use for standard search on turbopuffer. At query time, these vectors are used to quickly narrow millions of documents down to the top \~100 candidates.

```python
# Write each document as a row with its dense vector and text.
# cosine_distance means turbopuffer will rank by 1 - cosine_similarity.
ns.write(
    upsert_rows=[
        {"id": i, "vector": openai_or_rand_vector(doc), "text": doc}
        for i, doc in enumerate(documents)
    ],
    distance_metric="cosine_distance",
)
print(f"Indexed {len(documents)} documents")

```

#### Step 3: Index ColBERT token vectors

This is what makes late interaction work. Each document is broken into individual tokens, and each token gets its own 128-dim ColBERT vector. These are stored in a separate namespace, with a doc\_id attribute so we can retrieve all tokens for a given document at query time. The ID scheme (doc\_id \* 1000 + tok\_idx) links each token back to its parent document.

```python
# Load ColBERTv2 and encode every document into per-token vectors.
# A 10-word question produces ~15 token vectors (128 dims each).
colbert = ColBERTEncoder("colbert-ir/colbertv2.0")
token_vecs = colbert.encode_documents(documents)

# Flatten into one row per token. Each row has:
#   id:     doc_id * 1000 + tok_idx (links token back to its document)
#   vector: 128-dim ColBERT embedding for this token
#   doc_id: which document this token belongs to (filterable)
buffer = []
for doc_id, embs in token_vecs.items():
    for tok_idx in range(len(embs)):
        buffer.append({
            "id": doc_id * 1000 + tok_idx,
            "vector": embs[tok_idx].tolist(),
            "doc_id": doc_id,
        })

# doc_id must be filterable so we can fetch tokens for specific documents
token_ns.write(
    upsert_rows=buffer,
    distance_metric="cosine_distance",
    schema={"doc_id": {"type": "uint", "filterable": True}},
)
print(f"Indexed {sum(len(v) for v in token_vecs.values())} token vectors")

```

#### Step 4: ColBERT reranking

This replaces third-party rerankers like Cohere or Voyage. Instead of fetching raw token vectors and computing similarity locally, we let turbopuffer do the similarity search server-side. For each of the 32 query token vectors, we ANN search the token namespace filtered to candidate doc IDs. turbopuffer returns `$dist` (cosine distance) for each match, which we convert to similarity (`1 - $dist`). For each candidate document, we take the best token match per query word and sum them up, that's the ColBERT MaxSim score.

This approach uses just 2 `multi_query` calls (32 query tokens batched 16 at a time) regardless of how many candidates you're reranking, and no raw vectors are transferred over the network.

```python
def colbert_rerank(results, query, top_k=10):
    # Encode the query into 32 token vectors (128 dims each)
    q_emb = colbert.encode_queries([query])[0].numpy()

    # Build candidate list from dense retrieval results
    candidates = [{"id": row.id, "text": row["text"]} for row in results.rows]
    candidate_ids = [c["id"] for c in candidates]
    scores = {cid: 0.0 for cid in candidate_ids}

    # For each query token, find the closest document tokens in turbopuffer.
    # multi_query sends up to 16 ANN searches in one API call, so 32 tokens = 2 calls.
    for chunk_start in range(0, len(q_emb), 16):
        q_chunk = q_emb[chunk_start : chunk_start + 16]
        token_results = token_ns.multi_query(queries=[
            {
                # ANN search: find nearest doc tokens to this query token
                "rank_by": ("vector", "ANN", q_tok.tolist()),
                # 1500 covers 100 candidates × ~15 tokens each.
                # Increase for longer documents (max 10,000).
                "top_k": 1500,
                # Only search tokens belonging to our candidate documents
                "filters": ("doc_id", "In", candidate_ids),
                "include_attributes": ["doc_id"],
            }
            for q_tok in q_chunk
        ])

        # For each query token's results, keep the best match per document
        for sub in token_results.results:
            best = {}
            for row in sub.rows:
                did, sim = row["doc_id"], 1.0 - row["$dist"]
                if did not in best or sim > best[did]:
                    best[did] = sim
            # Add each document's best match to its running score
            for did, sim in best.items():
                scores[did] += sim

    # Attach scores and sort — highest total MaxSim score wins
    for c in candidates:
        c["colbert_score"] = scores[c["id"]]
    candidates.sort(key=lambda x: x["colbert_score"], reverse=True)
    return candidates[:top_k]

```

#### Step 5: Query

First, turbopuffer retrieves the top candidates using dense ANN search. Then, ColBERT reranks those candidates word by word for more precise ranking.

```python
query = "How can I make money online free of cost?"

# Stage 1: Dense ANN retrieves top 100 candidates from the full corpus
result = ns.query(
    rank_by=("vector", "ANN", openai_or_rand_vector(query)),
    top_k=100,
    include_attributes=["text"],
)

# Stage 2: ColBERT rescores those 100 candidates word-by-word
reranked = colbert_rerank(result, query)
for i, r in enumerate(reranked[:5]):
    print(f"  {i+1}. [{r['colbert_score']:.1f}] {r['text']}")

```

## Performance Analysis

We evaluated on 100 known duplicate pairs from the Quora dataset using MRR\@10 (Mean Reciprocal Rank at 10). For each pair, we search with one question and check where its known duplicate lands in the top 10 results. A correct result at position 1 scores 1.0, position 2 scores 0.5, position 3 scores 0.33, and so on. If it's not in the top 10, the score is 0. MRR\@10 is the average of these scores across all queries.

```python
# Map each question to its index so we can check if results contain the right answer
question_to_id = {q: i for i, q in enumerate(documents)}

# Find 100 known duplicate pairs from the dataset (label == 1 means duplicate)
eval_pairs = []
for row in ds:
    if row["label"] == 1:
        q1, q2 = row["sentence1"].strip(), row["sentence2"].strip()
        if q1 in question_to_id and q2 in question_to_id and q1 != q2:
            eval_pairs.append((q1, question_to_id[q1], q2, question_to_id[q2]))
    if len(eval_pairs) >= 100:
        break

# For each pair: search with q1, check where q2 lands in the results
dense_mrr, reranked_mrr = [], []
for query_text, query_id, _, target_id in eval_pairs:
    result = ns.query(
        rank_by=("vector", "ANN", openai_or_rand_vector(query_text)),
        top_k=100,
        include_attributes=["text"],
    )

    # Score dense results: 1/rank if target is in top 10, else 0
    dense_ids = [row.id for row in result.rows if row.id != query_id]
    dense_mrr.append(next((1/(i+1) for i, rid in enumerate(dense_ids[:10]) if rid == target_id), 0))

    # Score reranked results the same way
    reranked = colbert_rerank(result, query_text, top_k=100)
    reranked_ids = [r["id"] for r in reranked if r["id"] != query_id]
    reranked_mrr.append(next((1/(i+1) for i, rid in enumerate(reranked_ids[:10]) if rid == target_id), 0))

print(f"Dense MRR@10:          {np.mean(dense_mrr):.3f}")
print(f"+ ColBERT rerank MRR@10: {np.mean(reranked_mrr):.3f}")
# Dense MRR@10:          0.845
# + ColBERT rerank MRR@10: 0.814
```

| Method                                | MRR\@10 | Dataset                     |
| ------------------------------------- | ------- | --------------------------- |
| Dense (OpenAI text-embedding-3-small) | 0.845   | Quora (10K short questions) |
| Dense + ColBERT rerank                | 0.814   | Quora (10K short questions) |

ColBERT reranking doesn't improve results here. Two things are working against it:

**Short documents.** Quora questions are \~10–15 tokens. A single OpenAI vector captures them fully, so there's no lost detail for ColBERT to recover.

**Model mismatch.** The dense retrieval uses OpenAI's text-embedding-3-small (a 2024 model trained on massive data), while ColBERTv2 uses a 2022 BERT backbone. The model quality gap swamps the retrieval strategy difference. On individual queries, word-level matching still helps. Searching "How can I make money online free of cost?" moves the match "How do I make money online?" from rank 10 to rank 3. But it's not enough to overcome the weaker base model across the board.

Late interaction's advantage shows up when documents are long enough that a single vector can't capture every detail (support articles, contracts, product catalogs) and when queries and documents use different vocabulary for the same concepts.

## When to use late interaction

**Use it when:**

* You need the top 5–10 results to be precisely ordered, not just "close enough"
* Queries and documents use different words for the same concepts (common in Q\&A, support, and legal search)
* Your documents are long enough that compressing to a single vector loses information
* You can tolerate \~200–600ms of added latency per query (\~130ms with cached query embeddings)

**Use standard dense retrieval when:**

* Storage cost is the primary constraint
* Queries and documents share consistent vocabulary
* Latency budget is under 50ms total
* You're building an MVP and want minimal complexity

## Cost/Performance Trade-offs

Late interaction adds both storage and latency. In our 10K-document test, the dense namespace used 62 MB and the token namespace used 83 MB. Quora questions are short (\~15 tokens each), so for longer documents expect the token namespace to be significantly larger.

turbopuffer returns a `performance` object with every query response, which lets us separate server time from network overhead. Here's what we measured on the Quora dataset:

| Component                | Server     | Wall (client)   | Source                               |
| ------------------------ | ---------- | --------------- | ------------------------------------ |
| **Dense only**           | \~15ms     | \~65–70ms       | `server_total_ms`, `client_total_ms` |
| + ColBERT encode         |            | +50–300ms       | client                               |
| + 2x multi\_query rerank | +30ms      | +130ms          | `server_total_ms`, `client_total_ms` |
| **Total with reranking** | **\~45ms** | **\~250–500ms** |                                      |

turbopuffer itself is fast. The dense query and both reranking batches total \~45ms server-side (`cache_temperature: "hot"`). The bottleneck is network latency and local ColBERT encoding, not turbopuffer.

To keep costs down: rerank fewer candidates, cache query embeddings for repeated queries, or only index tokens for high-value documents.

## <mark style="color:$success;">Bonus (optional)</mark>

The current implementation works, but it requires two namespaces, two `multi_query` calls, and a local ColBERT model. Most of that complexity could go away with a few changes to turbopuffer.

The two-namespace pattern exists because turbopuffer stores one vector per row. If it supported multiple vectors per document, dense and token vectors could live together, and you wouldn't need `doc_id` filtering to link them back.

The two `multi_query` calls exist because the `Sum` and `Max` operators only work with BM25 today, not vector search. Extending them to ANN/kNN would let you express ColBERT's MaxSim scoring in a single `rank_by` expression instead of aggregating scores client-side.

The biggest latency cost is running ColBERT locally (\~50-300ms per query). If turbopuffer handled the encoding server-side, similar to how it handles BM25 tokenization, you could send query text and get reranked results back in one call. Total latency would drop close to a standard dense query.
