Hypothesis

9 Matching Annotations

Sep 2024
github.com github.com

wordfreq/SUNSET.md at master · rspeer/wordfreq

2
1. tonz 21 Sep 2024
  
  in Public
  
  I don't think anyone has reliable information about post-2021 language usage by humans. The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies. Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.
  
  Robyn Speer will no update longer Wordfreq States that n:: there is no reliable post-2021 language usage data! Wordfreq was using open web sources, but it getting pollutted by #algogens output
  
  wordfreq llms dataquality corpus reverseturing epistomology_centipede
2. tonz 21 Sep 2024
  
  in Public
  
  As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.
  
  Example of how #algogens slop pollutes corpus data: ChatGPT uses the word 'delve' a lot, an order of magnitude above human usage. #openvraag Is this to do with the 'need' for #algogens to sound more human by switching words around (dial down the randomness, and it will give the same stuff every time, but will stand out immediately as computer generated too)?
  
  algogens dataquality example epistomology_centipede
Visit annotations in context

Tags

dataquality

reverseturing

algogens

wordfreq

epistomology_centipede

example

llms

corpus

Annotators

tonz

URL

github.com/rspeer/wordfreq/blob/master/SUNSET.md
euansemple.blog euansemple.blog

Bending the truth.

1
1. tonz 16 Sep 2024
  
  in Public
  
  https://web.archive.org/web/20240916043530/https://euansemple.blog/2024/09/13/bending-the-truth/
  
  Euan Semple describes how he has to fill in an online appointment form for medical care because his doctor asked him to make an appointment, but none of the pre listed answer options match that case. It reads like #prompting #promptengineering as we do in #algogens You change the input because of a desired output, but the input itself is just that means and becomes meaningless in the process. Yet in this case that input is kept as 'truth' in a database, impacting #dataquality
  
  dataquality prompting
Visit annotations in context

Tags

dataquality

prompting

Annotators

tonz

URL

euansemple.blog/2024/09/13/bending-the-truth/
journals.plos.org journals.plos.org

Plane inclinations: A critique of hypothesis and model choice in Barbi et al

1
1. tonz 15 Sep 2024
  
  in Public
  
  Saul Justin Newman's 2018 paper criticising some papers wrt longevity and aging. Says the results can be generated by having a few randomly distributed age-misreporting errors. Barbi et al's models turn out to be sensitive to that. Barbi posit their data means there's an evolutionary dimension to their aging data, whereas Newman says it's just faulty data that causes the effect. Won an 2024 Ig Nobel for this topic.
  
  https://doi.org/10.1371/journal.pbio.3000048
  
  aging dataquality longevity
Visit annotations in context

Tags

aging

dataquality

longevity

Annotators

tonz

URL

journals.plos.org/plosbiology/article
theconversation.com theconversation.com

‘The data on extreme human ageing is rotten from the inside out’ – Ig Nobel winner Saul Justin Newman

5
1. tonz 15 Sep 2024
  
  in Public
  
  analysing the last 72 years of UN data on mortality. The places consistently reaching 100 at the highest rates according to the UN are Thailand, Malawi, Western Sahara (which doesn’t have a government) and Puerto Rico, where birth certificates were cancelled completely as a legal document in 2010 because they were so full of pension fraud. This data is just rotten from the inside out.
  
  Longevity data from the UN says researcher is highly suspect, given where they report highest rate of centenarians. Those reaching are often, poor, lacking in administrative systems, or even without government. Says data is fully untrustworthy from the start.
  
  dataquality aging longevity UN
2. tonz 15 Sep 2024
  
  in Public
  
  The clear way out of this is to involve physicists to develop a measure of human age that doesn’t depend on documents. We can then use that to build metrics that help us measure human ages.
  
  Relying on documentation for age measurement is highly problematic. Yet it determines a lot in terms of pension rates, insurance and health care cost planning.
  
  Researcher proposes developing a way to measure human age independent of documentation. (what would that be? telomeres? x-rays (like they do to determine where refugees are rightfully claiming to be underage?))
  
  dataquality aging
3. tonz 15 Sep 2024
  
  in Public
  
  Regions where people most often reach 100-110 years old are the ones where there’s the most pressure to commit pension fraud, and they also have the worst records. For example, the best place to reach 105 in England is Tower Hamlets. It has more 105-year-olds than all of the rich places in England put together. It’s closely followed by downtown Manchester, Liverpool and Hull. Yet these places have the lowest frequency of 90-year-olds and are rated by the UK as the worst places to be an old person
  
  High registered age caused more likely by bad admin and fraud pressures. Worst places in terms of aging, and listing the lowest number of 90 yr olds in the UK have the highest of 100yr olds.
  
  dataquality fraud aging bluezones
4. tonz 15 Sep 2024
  
  in Public
  
  In Okinawa, the best predictor of where the centenarians are is where the halls of records were bombed by the Americans during the war.
  
  Largest predictor of having many 100+ yr olds in Okinawa is the records being destroyed in WWII.
  
  dataquality bluezones okinawa
5. tonz 15 Sep 2024
  
  in Public
  
  https://web.archive.org/web/20240915125021/https://theconversation.com/the-data-on-extreme-human-ageing-is-rotten-from-the-inside-out-ig-nobel-winner-saul-justin-newman-239023
  
  Saul Justin Newman won an Ig Nobel for finding most claims about people living over 105 are wrong / faulty.
  
  Blue zones wrt human aging are actually bad data zones. Either because actual birthdata is missing (war, bad administrative quality) or pension fraud is rife.
  
  Jumped out at me as I just yday saw snippets of a docu about increasing personal longevity which visited Sardinia, one of the blue zones.
  
  bluezones aging longevity dataquality pensionfraud
Visit annotations in context

Tags

dataquality

UN

aging

fraud

longevity

okinawa

pensionfraud

bluezones

Annotators

tonz

URL

theconversation.com/the-data-on-extreme-human-ageing-is-rotten-from-the-inside-out-ig-nobel-winner-saul-justin-newman-239023

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL