9 Matching Annotations
  1. Last 7 days
    1. I don't think anyone has reliable information about post-2021 language usage by humans. The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies. Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.

      Robyn Speer will no update longer Wordfreq States that n:: there is no reliable post-2021 language usage data! Wordfreq was using open web sources, but it getting pollutted by #algogens output

    2. As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.

      Example of how #algogens slop pollutes corpus data: ChatGPT uses the word 'delve' a lot, an order of magnitude above human usage. #openvraag Is this to do with the 'need' for #algogens to sound more human by switching words around (dial down the randomness, and it will give the same stuff every time, but will stand out immediately as computer generated too)?

  2. Sep 2024
    1. https://web.archive.org/web/20240916043530/https://euansemple.blog/2024/09/13/bending-the-truth/

      Euan Semple describes how he has to fill in an online appointment form for medical care because his doctor asked him to make an appointment, but none of the pre listed answer options match that case. It reads like #prompting #promptengineering as we do in #algogens You change the input because of a desired output, but the input itself is just that means and becomes meaningless in the process. Yet in this case that input is kept as 'truth' in a database, impacting #dataquality

    1. Saul Justin Newman's 2018 paper criticising some papers wrt longevity and aging. Says the results can be generated by having a few randomly distributed age-misreporting errors. Barbi et al's models turn out to be sensitive to that. Barbi posit their data means there's an evolutionary dimension to their aging data, whereas Newman says it's just faulty data that causes the effect. Won an 2024 Ig Nobel for this topic.

      https://doi.org/10.1371/journal.pbio.3000048

    1. analysing the last 72 years of UN data on mortality. The places consistently reaching 100 at the highest rates according to the UN are Thailand, Malawi, Western Sahara (which doesn’t have a government) and Puerto Rico, where birth certificates were cancelled completely as a legal document in 2010 because they were so full of pension fraud. This data is just rotten from the inside out.

      Longevity data from the UN says researcher is highly suspect, given where they report highest rate of centenarians. Those reaching are often, poor, lacking in administrative systems, or even without government. Says data is fully untrustworthy from the start.

    2. The clear way out of this is to involve physicists to develop a measure of human age that doesn’t depend on documents. We can then use that to build metrics that help us measure human ages.

      Relying on documentation for age measurement is highly problematic. Yet it determines a lot in terms of pension rates, insurance and health care cost planning.

      Researcher proposes developing a way to measure human age independent of documentation. (what would that be? telomeres? x-rays (like they do to determine where refugees are rightfully claiming to be underage?))

    3. Regions where people most often reach 100-110 years old are the ones where there’s the most pressure to commit pension fraud, and they also have the worst records. For example, the best place to reach 105 in England is Tower Hamlets. It has more 105-year-olds than all of the rich places in England put together. It’s closely followed by downtown Manchester, Liverpool and Hull. Yet these places have the lowest frequency of 90-year-olds and are rated by the UK as the worst places to be an old person

      High registered age caused more likely by bad admin and fraud pressures. Worst places in terms of aging, and listing the lowest number of 90 yr olds in the UK have the highest of 100yr olds.

    4. In Okinawa, the best predictor of where the centenarians are is where the halls of records were bombed by the Americans during the war.

      Largest predictor of having many 100+ yr olds in Okinawa is the records being destroyed in WWII.

    5. https://web.archive.org/web/20240915125021/https://theconversation.com/the-data-on-extreme-human-ageing-is-rotten-from-the-inside-out-ig-nobel-winner-saul-justin-newman-239023

      Saul Justin Newman won an Ig Nobel for finding most claims about people living over 105 are wrong / faulty.

      Blue zones wrt human aging are actually bad data zones. Either because actual birthdata is missing (war, bad administrative quality) or pension fraud is rife.

      Jumped out at me as I just yday saw snippets of a docu about increasing personal longevity which visited Sardinia, one of the blue zones.