Hypothesis

2 Matching Annotations

Sep 2024
github.com github.com

wordfreq/SUNSET.md at master · rspeer/wordfreq

2
1. tonz 21 Sep 2024
  
  in Public
  
  I don't think anyone has reliable information about post-2021 language usage by humans. The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies. Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.
  
  Robyn Speer will no update longer Wordfreq States that n:: there is no reliable post-2021 language usage data! Wordfreq was using open web sources, but it getting pollutted by #algogens output
  
  wordfreq llms dataquality corpus reverseturing epistomology_centipede
2. tonz 21 Sep 2024
  
  in Public
  
  As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.
  
  Example of how #algogens slop pollutes corpus data: ChatGPT uses the word 'delve' a lot, an order of magnitude above human usage. #openvraag Is this to do with the 'need' for #algogens to sound more human by switching words around (dial down the randomness, and it will give the same stuff every time, but will stand out immediately as computer generated too)?
  
  algogens dataquality example epistomology_centipede
Visit annotations in context

Tags

example

dataquality

algogens

wordfreq

corpus

reverseturing

llms

epistomology_centipede

Annotators

tonz

URL

github.com/rspeer/wordfreq/blob/master/SUNSET.md

Tags

Annotators

URL