- Sep 2024
-
github.com github.com
-
I don't think anyone has reliable information about post-2021 language usage by humans. The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies. Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.
Robyn Speer will no update longer Wordfreq States that n:: there is no reliable post-2021 language usage data! Wordfreq was using open web sources, but it getting pollutted by #algogens output
-
The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money. It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google
Robyn Speer says in his view natural language processing as a field has been taken over by #algogens And most NLP research now depends on closed data from the #algogens providers.
-
Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.
Reddit was another key data source for wordfreq but they too no longer provide public archives, and sell it at high prices (to the likes of the #algogens)
-
Twitter is gone anyway, its public APIs have shut down
Twitter was a key resource for wordfreq for colloquial use of words. No longer as API shut down and the population of X is skewed to hatemongering in a way that makes it lose utility as data source.
-
-
www.404media.co www.404media.co
-
paywalled article.
Wordfreq is shutting down because LLM output on the web is polluting its data to the point of uselessness. It would track longitudinally the change in use of words across a variety of languages. Vgl human centipede epistomology in [[Talk The Expanding Dark Forest and Generative AI]] by [[Maggie Appleton]]
-