AI Bookkeeper: Going Back to the Core
Client
A leading fintech in Asia, serving >500k SMB customers.
Challenge
The company had launched several AI-powered features, but engagement and trust were low. Key issues surfaced:
- Redundancy: Some features were replicable by copy-pasting into ChatGPT.
- Low impact: Many “nice-to-haves” failed to address real pain points.
- Neglected core: The bookkeeping automation — used millions of times per month — had plateaued. Over one-third of transactions were misclassified, forcing many manual checks. Labels were unreliable, free-text bank descriptions cryptic and sparse, and the feature had been overshadowed by “shiny” GenAI experiments.
Objective
Reframe the problem: instead of chasing peripheral GenAI demos, double down on the AI-driven bookkeeping — the one feature with 100M+ annual uses. Align leadership on a north star of “driving manual checks to zero” (a practical user-focused goal), with staged milestones:
- +5% accuracy lift in 6 weeks with two ML engineers.
- Another +5% to warrant a partial rollout.
- Deliver 1–2 AI-driven features that feel essential, not optional — making the product indispensable.
Approach
Deep discovery
- Conducted interviews with users, product managers, marketing, C-suite, and ex-employees.
- Ran hands-on product testing to feel the friction directly.
- Analyzed usage logs to see where the existing system fails.
System redesign
- Replaced brittle manual rules with rule-induction algorithms that learned per-user rules (critical since transaction classification varies widely).
- Designed a two-tier AI system:
- White-box, personalized models for high-usage users (transparent, tailored, simple).
- Generalizable NLP-based models for newer/low-frequency users (effective when history is sparse).
Hypothesis-driven iteration
- Generated >10 hypotheses for improving two NLP models (e.g. upgrading to LLM-derived embeddings, replacing MLP-Mixer with attention layers, unifying preprocessing logic).
- Ran >20 lightweight experiments (1–3 day cycles) for the rule learner — from decision tree tuning to mining regex features from LLMs.
- Embedded confidence-based mechanisms: abstain or prompt for review when uncertain, automate only when high-confidence.
Data-driven loop
Manual inspection of misclassified data → hypothesis → evaluation → adjustment. This rapid feedback loop built momentum and sharpened model quality.
Impact
1/3 reduction in no/mistaken predictions, dramatically reducing manual corrections while keeping decisions interpretable and boosting customer satisfaction. This was the single biggest leap in this product’s history.
Takeaways
- Focus matters: doubling down on neglected core workflows beats peripheral “AI theatre.”
- Hybrid wins: symbolic + ML (aka. neuro-symbolic AI) approaches delivered accuracy, interpretability, and adaptability.
- Momentum through loops: 30+ small experiments in weeks broke a years-long plateau.
- Prioritize painkillers over vitamins: trust and indispensability come from solving the daily, high-friction problems.