Demo: Thu 2026-04-30 @ 2:00 PM PT (Bryan Licas / CPO, Ernesto Valdes / CTO, Scott Wilton / Director of Product Design)
A continuation conversation that started Wed evening and rolled past midnight. Andrea worked through three more strategic threads — model defaults, AI determinism education, scaling — that pushed the recommendation into stronger product-philosophy territory than where it had landed at end of Wed. Then iterated on a Slide 1 visual: whiteboard meme → Gemini cartoon (rejected) → Canva-edited version with pointing hands.
Captured here so the framings (especially "remove RevOps from the middle") aren't lost.
Andrea: ok help me work through a shower thought I just had. so actually, it's got two parts to it.
first is related to the dials for reasoning and verbosity tuning. for SMBs, it should default to highest settings only because with the increased scale with frontier models it's pretty much guaranteed the output is of higher quality and is more reliable. it's a waste of time and tokens to do anything less than medium in a lot of cases, especially if end users don't understand that using LLMs is effectively spicy autocorrect.
second thing - because turning the dials up is directly proportional to the increasing costs dependent on volume of requests priced based on whatever provider's current $/MTok is, an API use budget calculator would be useful for both SMBs and enterprise. […]
ultimately, YOLO mode is for SMBs because they don't have an alternative - they probably don't need or can't afford D&B and they will have to eat the risk of some failures. whereas enterprise can't afford the failures, even if it can be brought down to 5% or less; so they'd be willing to pay for the last mile to get that last 5-10% with a proper D&B sub and integration
Claude: Strong thinking. Three things to work through.
For a single record, the cost difference between Low and High reasoning is pennies. For a 50,000-account SMB enrichment run, it can be the difference between $20 and $400. SMBs feel that delta acutely — and are exactly the ones least likely to monitor token spend until the surprise bill hits.
Smart defaults, not max defaults.
| Preset | Reasoning | Verbosity | Web search | When right |
|---|---|---|---|---|
| Fast | Low | Low | Off | Bulk re-runs of stable data, weekly hygiene sweeps, large backfills |
| Balanced ⭐ default | Medium | Medium | On | Day-to-day enrichment of net-new accounts |
| Comprehensive | High | High | On | High-stakes records (large deals, compliance-relevant accounts), pre-RFP research |
The "spicy autocorrect" framing argues for Balanced as default, not Comprehensive. Low truly is mostly junk. Medium is where the model starts thinking about distractors. High pays 5–10× for the last few percentage points — matters for high-stakes records, not bulk hygiene.
"The default behavior shouldn't be the smartest possible model on the highest setting — it should be the setting that produces reliable output at sustainable cost. SMB customers can't afford to default to Comprehensive any more than they can afford to default to Fast. Balanced is the right anchor; the other two presets exist for the cases that actually warrant them."
Dispatched a subagent to scan the landscape. Findings:
| Category | Real options? | Best names |
|---|---|---|
| Cost calculators | Yes | tokencost.app, Helicone LLM Cost, pricepertoken.com |
| Multi-provider $/MTok | Yes | pricepertoken.com (300+ models, daily updates) |
| Budget tracking + forecasting libs | Yes (partial) | tokencost PyPI pkg, LiteLLM (proxy+SDK with built-in cost tracking + per-project budgets), Langfuse, Helicone |
| Salesforce-native | No | Build territory — but consume tokencost's model_prices.json rather than maintain pricing |
Bottom line: Pre-flight bulk-run estimator is a thin LWC over a known pricing table + token estimator. The forecasting math (tokens × calls × price + reasoning-effort buffer) is trivial. The value-add is the Salesforce-native UX, not the calculator engine. TC builds the LWC, not the data.
When customers see Comprehensive at $680 vs. Balanced at $140, some will ask: "Couldn't I just call OpenAI directly and skip your markup?" Three answers, in order of strength:
| Tier | Customer profile | Default preset | D&B integration | Memory loop | Budget calculator |
|---|---|---|---|---|---|
| SMB | <100 employees, <$10M revenue, no D&B sub | Balanced | Off (price-prohibitive) | Off (no scale to compound) | Critical — they need to know the bill |
| Mid-market | 100–1,000, mixed regulatory exposure | Balanced (Comprehensive for compliance accounts) | Optional (tier upgrade) | Optional | Useful — budgets exist, oversight exists |
| Enterprise | 1,000+, FS / public co / IPO-prep | Comprehensive | Required | Active | Optional — finance team owns FinOps elsewhere |
The calculator's value is highest for SMB — exactly the segment Apex Data is targeting with their free tier. "Apex doesn't tell you what their bulk runs cost. We do — before you click." That's a sharper Apex differentiator than "ours compounds."
Lands accurate but reads dismissive in front of an SMB customer. Reframe: SMB tier accepts statistical risk in exchange for cost-effective coverage. Enterprise tier pays for the last mile. Same distinction, framed as deliberate product choice, not a "you don't have an alternative" concession.
Andrea: ah yeah, defaults! that's where I was starting to warm up to. I am also going one step further and considering that by default Low reasoning shouldn't even be an option generally and bury it deep in settings or not even offer it at all.
another related shower thought - not sure if it's possible to educate people on AI in general but if end users could understand that AI outputs are non-deterministic by nature, then at least selling D&B integration or some other vetted IRL/by humans datasets as a match/comparison is a huge bolster for validating veracity of the outputs.
the other thing I can't tell based on my limited scope of sandbox testing is how this works when there are many more context fields and many more enrichment fields.
Claude: Three sharp threads.
A 10× cost saving sounds great until you realize Low produces garbage on anything but trivial tasks, which means re-running on Medium anyway. You spend 1.1× total to learn what 1× would have told you.
But three legitimate Low use cases:
- Bulk hygiene sweeps on stable data (high-volume, low-stakes, cost-sensitive)
- Pre-scoring for triage ("does this account look enrichment-worthy?")
- Cost-constrained backfills (real SMB / mid-market scenario)
These share a property: the user understands what they're trading away. Admin-mode decision, not default-mode.
| Layer | Visibility | Who reaches it |
|---|---|---|
| Default preset (Balanced) | One click | Everyone |
| Other presets (Fast, Comprehensive) | One more click | Engaged users |
| Power-user mode: raw dials including Low reasoning | Buried in settings | Admin who explicitly opts in |
| Enterprise-tier-only Low option | Off entirely for SMB | Customers who've articulated bulk-hygiene workflows |
"Low reasoning isn't a default-shaped option. It's a power-user opt-in for specific workflows. We could surface it in the UI for everyone, but most customers who'd pick it from a dropdown without context will pick it for the wrong reasons. The right design is: smart defaults for everyone, raw dials available for the customer who's earned the context."
End users — including sophisticated RevOps people — operate on a deterministic mental model. "Same input → same output. If it gave a different answer this time, something's broken." That model is correct for every other system they use. It is completely wrong for LLMs.
That mismatch is where:
- Customers report "bugs" that aren't bugs (run-to-run variance)
- Customers lose trust permanently after seeing inconsistent output
- CS spends 40% more time on classification tickets explaining things that aren't broken
- Customers ask "why is it different this time?" answers that don't exist
Educating users out of the deterministic model is hard, but you don't have to. Pairing with a deterministic source does the work for you.
| Source | Property | What it gives the customer |
|---|---|---|
| LLM alone | Non-deterministic, probabilistic | Speed, breadth, but "is this trustworthy?" is unanswerable |
| LLM + deterministic anchor (D&B) | Probabilistic + verified-by-humans | "Trustworthy when they agree; surfaced for review when they don't" |
The deterministic anchor isn't replacing the LLM. It's giving the customer a stable reference point that their existing mental model can latch onto. They don't have to learn what LLMs are. They just have to read a single agreement indicator.
This reframes the D&B integration from "compliance feature" to "AI trust layer for non-AI-fluent users." That's the entire SMB and mid-market segment, plus a meaningful share of non-FS enterprise.
"D&B isn't just compliance — it's the deterministic anchor that lets non-AI-fluent users trust the AI's output. They don't have to learn what 'non-deterministic' means. They just have to read a single agreement indicator: green if TC and D&B agree, amber if D&B doesn't have the record, red if they disagree. That's not theater — that's a trust mechanism that works without AI literacy."
Strongest answer to "why not call OpenAI directly": because direct OpenAI calls have no deterministic anchor. TC isn't selling the call. TC is selling the AI + verification scaffold around the call. Trust is the real procurement gate for AI products right now.
Blind spot to surface in UX: D&B has wrong / outdated data sometimes. Patchy non-Anglo coverage, lag on private companies, classification disagreements. "D&B last verified this record on 2025-03-12. Agreement signal is 8 months old." Keeps the trust anchor itself transparent.
Sandbox sample is one enrichment field × 34 accounts. Can't infer behavior at 12 fields × 50K accounts.
Probably scales linearly: token cost, run time per record, audit log size.
Might NOT scale gracefully:
| Concern | Why it might break |
|---|---|
| Prompt context window | Many context fields = bigger prompt. Cost per record balloons faster than field count alone suggests. |
| Reasoning quality at scale | Frontier models handle ~10–20 simultaneous classifications well. At 30+, public benchmarks show degradation. |
| Confidence calibration | If model produces 12 outputs per record, are confidence scores correlated? Probably yes — bad day = bad on all 12. |
| D&B agreement at scale | D&B doesn't even produce some signals (revenue ranges, employee count, parent hierarchy). |
| UI density | Inline confidence indicator on Account record looks great for one field. For 12 fields, the panel becomes a wall. |
| Memory loop quality | Per-tenant corpus for 12 fields = 12 sub-corpora with different decay rates. Multi-tenancy gets messier. |
Demo phrasing — name the unknown:
"My sandbox testing was scoped to one classification field across 34 accounts. The recommendation generalizes architecturally, but I'd flag two open questions before scaling to many enrichment fields: prompt-context budget and UI density. At 10+ fields per record, the confidence panel becomes its own design problem — and the cost economics shift because context-window pressure scales faster than field count. I'd want to instrument and test those before committing to a 'works for all enrichment types' positioning."
Bryan explicitly rewards "what you don't know." Most candidates won't acknowledge sandbox-scope limitations. Phrasing fix: "Based on how LLM workloads typically scale..." — labels it as candidate-side speculation, not data.
Andrea: ok but what's an even clearer product philosophy than what 99% of PMs would arrive at?
Claude (one sentence):
The product's job isn't enrichment. It's removing the human in the middle from the conversation.
Today's loop:
Sales asks RevOps a question. RevOps asks the AI. RevOps doesn't fully trust the answer. RevOps verifies. Legal asks RevOps a question. RevOps verifies again. Deal slows. RevOps is the bottleneck.
Most products try to make the AI more accurate. That doesn't fix the loop — it just changes the percentage of times RevOps verifies. They still have to be in the middle.
The 99th-percentile move: design the product so RevOps doesn't have to be there at all. Sales gets an answer that comes with its own verification. Legal gets an export they can read directly. The CEO's review prep doesn't trigger a Slack thread to RevOps. The data flows through.
| Before | After |
|---|---|
| The product is the classification | The product is the artifact that lets RevOps hand off the conversation |
| D&B integration is for compliance | D&B integration is what makes the artifact defensible without RevOps in the room |
| Confidence indicator is a UI element | Confidence indicator is the signal that tells RevOps whether they can hand off or have to engage |
| Memory loop builds a moat | Memory loop reduces the number of cases where RevOps has to engage at all |
| Success metric: classification accuracy | Success metric: RevOps's involvement in enrichment-adjacent work, trending toward zero |
| Competition: D&B, ZoomInfo, Apex | Competition: the Slack message that says "hey can you check this account before legal gets back to me?" |
The architectural choices stay the same. The positioning changes substantially.
1. The captured record is the actual product. The classification is the byproduct. Customers don't really care if a particular account is tagged "Software" or "SaaS" — they care that the tag has a defensible story behind it. Whatever TC builds, the part that survives a vendor switch is the captured record. "Pay us for the data you build, not the data we generate."
2. Cede the data battle. Win the workflow battle. TC will never have better data than D&B. They will never be cheaper than Apex. They have a structural advantage neither can copy: the workflow surface where the data is consumed. D&B can't put their data inline on the Account record where RevOps lives. Apex can't build the corrections corpus. "D&B handles data; we handle the conversation that data exists for."
3. The customer isn't RevOps. The customer is the executive who notices RevOps is overwhelmed. RevOps is the user. The buyer is whoever signs off on budget and watches metrics. That person doesn't care about enrichment quality — they care about whether RevOps is the bottleneck for the deal review, the compliance audit, the M&A diligence prep. TC's pitch to the buyer is "we get RevOps out of the middle."
Current: "Can I trust this classification?"
99th-percentile version (one layer deeper): "Whose job is it to defend this answer when someone asks?"
That's the real question across all eight signals. Pattern: RevOps is the human in the middle, and the product's job is to give them a way out. Whether or not the swap makes the deck — having it in your head when you talk through the slides will make every section land sharper.
"It's not transparency. It's organizational throughput. The whole company moves faster because the data doesn't need a human translator anymore."
Andrea generated a whiteboard-style meme: "AI DATA ENRICHMENT WOES — Organizational chaos around Traction Complete." Five organizational-chart icons (Purchasing, Customer Support, Data Teams, Legal/Compliance, Users/Sales) all pointing weapons at a frazzled RevOps figure in the center. "Justify the data. Prove it's accurate. Fix it. Explain it. Own it." Bottom-line corner: "Everyone's pointing. No one's solving."
Claude: This is the picture of what "human in the middle" actually looks like. Stronger than any prose could be.
What it does that words can't:
- Five accusations on one person — visual reality of organizational pressure
- Questions are organization-coded by color; structurally different per function
- "Justify, Prove, Fix, Explain, Own" — those aren't data tasks, they're human-mediator tasks
- "Everyone's pointing. No one's solving." names the equilibrium
Recommendation: Replace Slide 1's right pane with this image. Navy ledger keeps the question on the left ("Can I trust this classification?"). Right pane = image. Stats become a small ribbon or get cut.
Caveat — gun imagery may land jarringly with three executives. Cleanest substitute: replace guns with arrows or pointing hands. Loses 10% of urgency, gains universal palatability.
Andrea: I'm going to tone down the gun hands back to meme reference simplicity and without the angry faces - it's meant to be a funny referential meme, but yes it's a little risque for an interview perhaps....
Claude: Smart edit. Pointing hands (Spider-Man-meme-style or 👉 emoji-style) carry the same "five accusations one target" payload while reading as knowing humor about a shared frustration instead of literal aggression.
Why this version probably lands better:
- Self-aware tone — finding humor in customer pain without minimizing it reads as someone who's sat with the problem long enough to be at home with it
- Bryan came from Customer Success — has personally been the human in the middle. A wry diagram lands differently than a clinical one
- Meme literacy is a soft signal of being native to the modern PM environment
Refinement: keep the speech bubbles legible. They're verbatim-flavored signals from the brief. The image is doing a serious analytical job dressed in a funny outfit — not a funny image with serious notes glued on.
Verbal opener when you talk through it:
"This is the eight signals as one picture, with apologies to the meme."
Self-aware framing is the unlock. Signals to the room you're presenting it consciously, gives them permission to smile, lands the analysis cleanly afterward.
Andrea generated a polished Gemini cartoon — five characters in a chaotic Salesforce office scene, RevOps central with name tag, shouting speech bubbles.
Claude (honest read): Less effective than the whiteboard. Why:
The fix: Go back to the whiteboard. Swap gun hands → pointing hands (Spider-Man meme style) or bold colored arrows. Keep everything else.
Why whiteboard is the right format here: This is a strategic diagram, not a cartoon. The whiteboard aesthetic is the format strategy work uses (McKinsey, IDEO, design-thinking workshops). It signals "I thought through this systematically." The cartoon signals "I made an illustration." Bryan / Ernesto / Scott will read whiteboard sketches as native PM work.
Andrea didn't have ChatGPT credits, so brought the whiteboard image into Canva Pro and replaced the gun hands with pointing hands. Came out clean.
Claude: This works. Pointing hands carry the same payload without the friction. Canva did a respectable job — hands look natural where the guns were.
What's working:
- Five hands all pointing at RevOps from five directions
- Frazzled blue-tear RevOps figure with arms-up surrender pose doing emotional work
- Speech bubbles still verbatim from the brief — Bryan will recognize his own customers' language
- Subhead "Justify the data. Prove it's accurate. Fix it. Explain it. Own it." doing heavy lifting (verbs that aren't data tasks)
- "Bottom line: Everyone's pointing. No one's solving." — closing punch
- Outcomes box with X bullets — reinforces "this is real, not just funny"
Three typo fixes flagged (easy in Canva):
- Customer Support: "How do we koow" → "know"
- Outcomes: "Slaw deals" → "Slow deals"
- Outcomes: "Wasted" missing second word — original was "Wasted spend"
Bryan / Scott / Ernesto will spot typos. A typo softens the "I worked hard on this" signal.
Verbal opener for this slide:
"Before I get to the recommendation, I want to show you the eight signals as one picture — with apologies to the meme. RevOps isn't drowning in data. They're drowning in five different stakeholder questions about the same data, and they're the only person who can answer any of them. The product I'm recommending isn't 'more accurate AI.' It's giving RevOps a way to step out of the middle. That's the whole deck in one image."
Three of the eight signals are literally on this image word-for-word. Analytical evidence wearing a meme.
Andrea (on Claude saying "two hours and change until 2pm PT"): it's past midnight so you're thinking 2am. you really gotta get better at figuring out time but I know it's a human thing
Claude (acknowledging): Fair correction. Time is a genuinely weak spot — I have today's date but no clock. ~14 hours to 2pm PT.