Krungthai Frontier Lab · BASIN Data Strategy

Stop being the pipes. Ingest once, let agents consume.

Today every app builds its own pull from the DWH, datamarts, SAS files and spreadsheets, with the logic living in one person's head and humans doing the keying. BASIN flips it: each source is ingested once into a bronze→silver→gold spine, and agents consume from one place. Here's what we ingest first, why, and the part nobody likes to say out loud, reconciliation and upkeep are the real cost.

◉ click any box to explain

Machine source · API / structured (ingest first) Filings / documents · OCR-VLM (ingest first) Human-touched · reconciliation-heavy (later)

What we ingest first, and why

Machine sources, where AI gives the biggest clean win and there's no human bottleneck. These cover the listed corporates the CA report targets.

API

SET / SETSMART → PSIMS

Listed-company financials & profiles, straight from SET via API. Authoritative, structured, daily. The backbone, start here.

OCR-VLM

One Report / 56-1 & MD&A

Regulatory filings, semi-structured PDFs, machine-fetchable. Carry the qualitative story + audited numbers the memo needs.

OCR-VLM

Audited financial statements

From filings or client uploads. Deterministic extraction → the financial spread. The single highest-value gold product.

FEED

News & market signals

SET news feed + public sources, as streams. Powers EWS / red-flag features and context, cheap to ingest, high signal.

API

DBD registration data

For the non-listed names. Bulk/API, structured. Needed the moment the book goes beyond SET-listed corporates.

What waits, the human-touched

Valuable, but the source is a person or a gated internal system. AI can assist, not fix the upstream. Sequence these after the spine is proven.

GATED

Internal exposure & relationship data

Lives in KTB source systems / the DWH. Access-gated and inconsistently defined; needs governance + entity reconciliation before it's trustworthy.

HUMAN

Auditor notes & qualitative judgement

Semi-structured, often tacit. AI can summarise and tag, but the value depends on a human having written it well.

HUMAN

Past CA memos & RM know-how

The knowledge moat, but authored by people, unevenly. Best captured at source: the CA app itself becomes the structured-capture point.

The sequencing rule: ingest machines first, humans last, and where the source is a human, don't try to clean it downstream. Capture structured data at the point of work (the CA app, ViaLink at the BSC) so the human's normal workflow produces clean data as a by-product.

The part nobody says: ingestion is the easy 20%. Reconciliation & upkeep is the 80%.

The same company is “CKP” in SET, a different legal entity in DBD, and three exposures internally, with different fiscal years, restatements and currencies. Keeping BASIN trustworthy is an ongoing data-engineering job, not a one-off build: schemas drift, the SET API changes, filings reformat, new corporates appear, quality decays.

✓ Where AI genuinely helps

Entity / group resolution (fuzzy-matching names & IDs), schema mapping, extraction from poor-quality PDFs, anomaly & quality flags, and suggested reconciliations a human confirms. This is where ~3–4 engineers + agents replace a small army.

✗ Where AI can't save you

If a human is typing inconsistent data upstream, AI can flag it but not fix the root. The fix is structural, capture at source, not a smarter cleaner. Be clear that BASIN needs a standing owner, not a project that ends.

Krungthai Frontier Lab · BASIN-for-CA, the thin spine firstConfidential, Siametrics Consulting