Under the hood

For Data Nerds

For fellow data nerds: here is the actual retrieval and inference stack behind every Parsley conversation - semantic chunking, dense embeddings, approximate nearest-neighbor search with structural multi-tenancy, and a two-axis intent model layered over the dialog. No black box, no hand-waving.

Indexing

On upload, then periodic recrawl

Upload & crawl docs

Extract & chunk text

Gemini embeddings

Per-tenant vector store

Retrieval

Every question

Visitor question

Embed the query

Vector search

Gemini answers from your docs

Intelligence

Every conversation

Topic modeling

Buyer intent (Hot / Warm / Cold)

MEDDIC scorecard

Synced to your CRM

The shape of the pipeline. Exact parameters are tuned and not shown.

The system is a fairly classic retrieval-augmented generation (RAG) pipeline with two things we care a lot about bolted on: hard tenant isolation at the storage layer, and a second inference pass that treats each conversation as a labeling problem rather than a chat log. The three stages below are the offline indexing path, the online retrieval and generation path, and the conversation-level intelligence layer. We name the models and the data structures; we hold back the specific hyperparameters (chunk width, overlap, embedding dimensionality, k) because those are tuned and competitively sensitive.

Indexing: from documents to a dense vector space

Ingestion normalizes heterogeneous inputs - PDFs, docs, and crawled web pages - into clean UTF-8 text, then segments each document into overlapping chunks. The overlap is deliberate: a sliding window preserves local context across boundaries so a single idea is never split across two non-adjacent vectors, which is the usual cause of retrieval drop-out at chunk edges. We tune the window width to trade retrieval precision (smaller chunks, tighter matches) against context sufficiency (larger chunks, fewer fragments) for sales material specifically.

Each chunk is encoded into a dense vector with Google’s gemini-embedding-001 model. We use asymmetric encoding - chunks are embedded with the RETRIEVAL_DOCUMENT task type and queries later with RETRIEVAL_QUERY, so the model places passages and questions into a shared space optimized for retrieval rather than symmetric similarity. Vectors land in Firestore at a dimensionality chosen to balance recall against index size and query cost, comfortably inside Firestore’s native vector-index ceiling.

Indexing is idempotent and versioned. Every chunk carries an embedding-version stamp and provenance back to its source document; re-ingesting clears prior vectors for that document, and a reconcile job sweeps for orphans whose parent record no longer exists. The whole index step runs out-of-band after upload, so it never sits on the user’s request path.

For URL-sourced knowledge we recrawl on a schedule so the agent tracks site changes. The recrawl is content-addressed: we hash the freshly extracted text and only re-chunk and re-embed when it actually differs from what we last indexed. An unchanged site costs a lightweight fetch-and-compare and zero embedding calls, so we can keep knowledge fresh without re-encoding pages that did not move.

Sliding-window chunking with overlap to preserve cross-boundary context

Asymmetric document/query encoding for retrieval-optimized embeddings

Idempotent, version-stamped writes with orphan reconciliation

Content-addressed recrawl: unchanged pages skip re-embedding entirely

Retrieval: nearest neighbors, then grounded generation

At query time we embed the prospect’s question into the same space, then run an approximate nearest-neighbor search against that tenant’s vectors using cosine similarity - Firestore’s native find_nearest KNN operator over the vector index. The top-k passages are concatenated into a grounding context and injected into a constrained prompt; Gemini then generates an answer conditioned on your retrieved chunks rather than its parametric memory. Because the answer is anchored to retrieved evidence, the failure mode shifts from confident fabrication toward honest “not in the docs,” which is exactly what you want in front of a buyer.

We run retrieval ourselves rather than delegating to a managed black-box index. That is a deliberate latency and control decision: per-tenant corpora are small, so ANN over a single tenant’s vectors returns in single-digit milliseconds, and the dominant term in end-to-end latency becomes decoder time at the LLM - not retrieval. It also means there is no opaque per-query ceiling we cannot profile or tune. The result is a presales agent that reads as if it actually ingested your sales collateral, because that is precisely what it did. The same flow is summarized for a general audience in our how it works overview.

Cosine-similarity ANN via Firestore find_nearest over the vector index

Semantic matching, so paraphrase and synonymy still surface the right passage

Grounded decoding to suppress hallucination; retrieval is not the latency floor

Isolation

Multi-tenancy that is structural, not a WHERE clause

The most common way to leak data in a shared vector store is a filter-based design: one big index, a tenant-id predicate on every query, and one forgotten predicate away from a cross-tenant disclosure. We do not do that. Each tenant’s chunks live in a separate Firestore subcollection keyed by a deterministic owner prefix, so a query is physically scoped to a single tenant’s partition. The isolation is a property of where the data sits, not of remembering to add a filter - there is no shared partition for a bad predicate to span.

One canonical owner-prefix derivation is reused across indexing, query, and migration so the write path and read path can never resolve to different partitions. Document deletes cascade to their chunks, audit fields trace every vector to a live parent record, and free and paid tenants get byte-for-byte identical isolation. Details on the broader posture live in our security practices.

Intelligence: conversations as a labeling problem

Generation answers the prospect; a second inference pass reads the same dialog as structured signal. Crucially this is passive inference over what was said - we never interrogate the prospect with a qualification script. Three outputs come off every conversation, and they are modeled as separate heads rather than one blended score:

Topic modeling

A multi-label classification over the dialog: pricing, features, integrations, security, and so on. A conversation can carry several topic labels at once, which is what makes the aggregate view - what your whole audience probes for - meaningful.

Buyer intent (Hot / Warm / Cold)

Discrete buying-intent signals inferred from the exchange are reduced to an ordinal temperature - Hot, Warm, or Cold. This is the headline metric reps triage on: who to call first.

MEDDIC qualification scorecard

An independent extraction over the six MEDDIC dimensions - Metrics, Economic Buyer, Decision Criteria, Decision Process, Identify Pain, Champion - scored as an N of 6 coverage signal: how winnable, and where the gaps are.

The design decision worth flagging is that intent and qualification are orthogonal axes, and we keep them that way on purpose. Intent estimates how warm someone is; MEDDIC estimates how winnable they are. MEDDIC deliberately does not feed the temperature - collapsing them would let a thorough, well-qualified researcher who is months from buying masquerade as a hot lead, and vice versa. Both heads emit independently and then sync to your CRM - Attio, HubSpot, and others - so the labels land where your team already works.