先怼一下技术实现层面的naive assumption。楼主和1楼都在谈high-level的合规和concept,但没人提ontology engineering这个坑。这就像你写代码只谈架构图不写unit test——迟早崩在生产环境。
传统食物-药物相互作用(F-DI)最大的technical debt是folksonomy和formal ontology的impedance mismatch。2楼提到的曼谷阿婆"薄荷辣椒膏"在数据库里是个nightmare级别的edge case。你用什么schema存?“薄荷"是Mentha haplocalyx还是Mentha spicata?剂量单位是"一把”、"三片叶子"还是standardized grams?在field hospital这种resource-constrained环境,你没法跑NLP model做entity extraction,必须pre-structured data entry。但pre-structure意味着high friction,local communities根本不会用。这就是UX和data quality的trade-off,我上次startup倒在这上面赔了30万,lesson learned:千万别让end user填ontology form。
具体说data modeling。Capsicum annuum和Capsicum frutescens在民间都叫"辣椒",但capsaicinoids含量差一个数量级。没有canonical identifier(类似PubChem CID),你的query返回的就是garbage。OpenPhacts用URIs做compound mapping,但traditional food没有CAS number,你得自建taxonomy。建议直接用Wikidata QIDs做foreign keys,至少crowd-maintained,比你自己造wheel靠谱。
简单说其实
然后是distributed systems的hard part。field hospital的网络不是"不稳定",是partition-tolerant by design。你得设计offline-first的CRDT(Conflict-free Replicated Data Types)架构。想象这个scenario:local medic在丛林里离线更新了一个"姜黄+黑胡椒"的anti-inflammatory protocol,三天后回到base sync到中心节点。但这期间,另一个战区的field hospital基于local observation也更新了同样的条目,声称"剂量加倍"。怎么merge?用LWW(Last Write Wins)会丢数据,需要state-based CRDTs with custom merge functions。这implementation complexity… 比我这个保安去考CCIE还折腾。
验证机制别只想着double-blind clinical trials,那是peacetime luxury。可以用Bayesian confidence scoring + reputation system。每个entry有prior probability,基于phytochemical similarity(用Tanimoto coefficient算molecular fingerprint距离)。社区upvote/downvote更新posterior probability。但这又引出sybil attack vector:怎么防止pharma shill farm刷票否定有效的folk remedy?需要web of trust或者proof-of-work(不是crypto mining,是要求uploaders提供voucher specimen的geotagged photo)。
还有integration cost被严重低估。ChEMBL的API是clean RESTful JSON,但ethnographic field notes是unstructured text甚至oral history。你需要ETL pipeline做NER(Named Entity Recognition),这accuracy在low-resource languages(比如克伦语、苗语)上… 大概比让保安debug kernel panic好不了多少。我赔的那30万有一部分就是砸在这种ETL的long tail上——80%的engineering effort处理20%的edge cases,最后ROI为负。
最实际的MVP应该是:先用Schema.org的MedicalEntity做minimal viable ontology,强制要求GPS坐标+voucher photo(防止species misidentification,botanical accuracy比chemical purity更重要),storage用IPFS做decentralized pinning(绕过single point of failure,适合supply chain断裂场景),validation用tiered system:L1是expert curation(退休ethnobotanists),L2是community consensus。等MAU过千了再考虑compliance shield和fancy graph neural networks。
不过说实话,这种项目最大的killer不是tech debt,而是incentive alignment。Academics want Nature papers, communities want data sovereignty, frontline medics want Ctrl+F speed。没有reputation credit system(不是钱,是citation counts和reviewer badges)很难sustain contributor engagement。就像我追星打榜,没有超话等级谁天天做数据?