Krungthai Frontier Lab · BASIN Data Strategy

Stop being the pipes. Ingest once, let agents consume.

Today every app builds its own pull from the DWH, datamarts, SAS files and spreadsheets, with the logic living in one person's head and humans doing the keying. BASIN flips it: each source is ingested once into a bronze→silver→gold spine, and agents consume from one place. Here's what we ingest first, why, and the part nobody likes to say out loud, reconciliation and upkeep are the real cost.

◉ click any box to explain
Machine source · API / structured (ingest first) Filings / documents · OCR-VLM (ingest first) Human-touched · reconciliation-heavy (later)

What we ingest first, and why

Machine sources, where AI gives the biggest clean win and there's no human bottleneck. These cover the listed corporates the CA report targets.
API
SET / SETSMART → PSIMS
Listed-company financials & profiles, straight from SET via API. Authoritative, structured, daily. The backbone, start here.
OCR-VLM
One Report / 56-1 & MD&A
Regulatory filings, semi-structured PDFs, machine-fetchable. Carry the qualitative story + audited numbers the memo needs.
OCR-VLM
Audited financial statements
From filings or client uploads. Deterministic extraction → the financial spread. The single highest-value gold product.
FEED
News & market signals
SET news feed + public sources, as streams. Powers EWS / red-flag features and context, cheap to ingest, high signal.
API
DBD registration data
For the non-listed names. Bulk/API, structured. Needed the moment the book goes beyond SET-listed corporates.

What waits, the human-touched

Valuable, but the source is a person or a gated internal system. AI can assist, not fix the upstream. Sequence these after the spine is proven.
GATED
Internal exposure & relationship data
Lives in KTB source systems / the DWH. Access-gated and inconsistently defined; needs governance + entity reconciliation before it's trustworthy.
HUMAN
Auditor notes & qualitative judgement
Semi-structured, often tacit. AI can summarise and tag, but the value depends on a human having written it well.
HUMAN
Past CA memos & RM know-how
The knowledge moat, but authored by people, unevenly. Best captured at source: the CA app itself becomes the structured-capture point.
The sequencing rule: ingest machines first, humans last, and where the source is a human, don't try to clean it downstream. Capture structured data at the point of work (the CA app, ViaLink at the BSC) so the human's normal workflow produces clean data as a by-product.

The part nobody says: ingestion is the easy 20%. Reconciliation & upkeep is the 80%.

The same company is “CKP” in SET, a different legal entity in DBD, and three exposures internally, with different fiscal years, restatements and currencies. Keeping BASIN trustworthy is an ongoing data-engineering job, not a one-off build: schemas drift, the SET API changes, filings reformat, new corporates appear, quality decays.
✓ Where AI genuinely helps

Entity / group resolution (fuzzy-matching names & IDs), schema mapping, extraction from poor-quality PDFs, anomaly & quality flags, and suggested reconciliations a human confirms. This is where ~3–4 engineers + agents replace a small army.

✗ Where AI can't save you

If a human is typing inconsistent data upstream, AI can flag it but not fix the root. The fix is structural, capture at source, not a smarter cleaner. Be clear that BASIN needs a standing owner, not a project that ends.

Krungthai Frontier Lab · BASIN-for-CA, the thin spine firstConfidential, Siametrics Consulting