Governance first: workshop uses synthetic or anonymized data only. If you process real data through a cloud LLM, you need a no-training, no-retention DPA. For healthcare/pharma, a HIPAA BAA is also required. Prompts are shown verbatim so you know exactly what gets sent.
Few teams are listening.
Most organizations collect data they never act on. In workshop after workshop, teams discover the insight they need already exists. It is just buried. The Data Mining Sprint surfaces it in under three hours.
Each prompt feeds the next. Clean output becomes Classify input. Classify output becomes Extract input. No prompt works alone. The pipeline is the product.
~8 minutes per step · JSON in, JSON outProduction note: Add checkpointed stages, retry logic, and dead-letter handling in production.
Deterministic first: Use pandas/polars for dedup, date normalization, and null filtering before any LLM call. Batching: Large CSVs exceed context windows. Process in chunks of 500–1000 rows with checkpointed stages. Schema: Enforce JSON output via structured response modes (OpenAI response_format, Anthropic tool_use, or Pydantic). Prompt instructions alone are not enforcement. Determinism: temperature=0 reduces variance but does not guarantee identical outputs across separate API calls. Use code for operations requiring exact reproducibility.
We cluster, rank, and dot-vote. The topic with the most dots becomes the centerpiece of your business case. In many sessions, someone has a moment of recognition: "Wait, 30% of our tickets are about one bug we never prioritized."
Every finding needs math. Not just what we found, but what it costs to ignore it. If we can't fill key variables with a real number from your company, the business case isn't ready. We stop and get the data.
Top insight, business case, and one surprise. No slides required. Whiteboard + voice.
Clarity (1-5) and Conviction (1-5). Was the insight clear? Did the math hold up?
"What would make this wrong?" Teams defend assumptions. The room learns from pushback.
Selection weights: business impact (40%), feasibility (30%), clarity of presentation (30%). The team with the highest composite score nominates the pilot candidate. Not a popularity contest.
These gates apply to every workshop dataset. The 80% threshold is a workshop screening step. Full validation requires a separate protocol.
Under three hours. One CSV. A reusable prototype pipeline, a draft business case, and a team that knows how to mine its own data. I facilitate this for leadership offsites, client workshops, and team intensives.