inappropriately change or overwrite JSON files compared to Markdown files
这是一个极具洞察力的工程经验。Markdown格式对LLM来说太“自由”,易被模型篡改或幻觉覆盖;而JSON具有严格的Schema约束。选择合适的数据格式本身就是一种隐式的Prompt防护栏。
inappropriately change or overwrite JSON files compared to Markdown files
这是一个极具洞察力的工程经验。Markdown格式对LLM来说太“自由”,易被模型篡改或幻觉覆盖;而JSON具有严格的Schema约束。选择合适的数据格式本身就是一种隐式的Prompt防护栏。
Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200× more parameters.
大多数人认为更大的模型架构必然带来性能提升,但作者仅通过数据工程和训练策略优化,在保持1.2B参数架构不变的情况下,超越了参数量超过200倍的现有模型,这挑战了'越大越好'的行业共识,证明了数据质量的重要性。
One example could be putting all files into an Amazon S3 bucket. It’s versatile, cheap and integrates with many technologies. If you are using Redshift for your data warehouse, it has great integration with that too.
Essentially the raw data needs to be vaguely homogenised and put into a single place
It took me a while to grok where dbt comes in the stack but now that I (think) I have it, it makes a lot of sense. I can also see why, with my background, I had trouble doing so. Just as Apache Kafka isn’t easily explained as simply another database, another message queue, etc, dbt isn’t just another Informatica, another Oracle Data Integrator. It’s not about ETL or ELT - it’s about T alone. With that understood, things slot into place. This isn’t just my take on it either - dbt themselves call it out on their blog:
Also - just because their "pricing" page caught me off guard and their website isn't that clear (until you click through to the technical docs) - I thought it's worth calling out that DBT appears to be an open-core platform. They have a SaaS offering and also an open source python command-line tool - it seems that these articles are about the latter
Working with the raw data has lots of benefits, since at the point of ingest you don’t know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.
absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before"
Romeo, N. (n.d.). What Can America Learn from Europe About Regulating Big Tech? The New Yorker. Retrieved August 19, 2020, from https://www.newyorker.com/tech/annals-of-technology/what-can-america-learn-from-europe-about-regulating-big-tech
Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … – places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
We noticed that the people who use the data are usually not the same people who produce the data, and they often don’t know where to find the information about the data they try to use. Since the Schematizer already has the knowledge about all the schemas in the Data Pipeline, it becomes an excellent candidate to store information about the data. Meet our knowledge explorer, Watson. The Schematizer requires schema registrars to include documentation along with their schemas. The documentation then is extracted and stored in the Schematizer. To make the schema information and data documentation in the Schematizer accessible to all the teams at Yelp, we created Watson, a webapp that users across the company can use to explore this data. Watson is a visual frontend for the Schematizer and retrieves its information through a set of RESTful APIs exposed by the Schematizer.