4 Matching Annotations
  1. Last 7 days
    1. An important caveat is that social movements which are poorlydocumented and which do not receive significant media attentionwill not be captured at all.

      So the model ultimately reflects whatever story the big media chooses to tell, which we already know can be very selective. This reminds us that “what’s in the data” is also affected by “what’s ignored.”

    2. discussions which will be included via the crawling methodology,and finally the texts likely to be contained after the crawled dataare filtered.

      I find it interesting that bias is introduced at multiple stages, not just one. First, not everyone has equal access to the internet. Then, the way the data is collected (like scraping it from Reddit) filters out even more voices. Finally, the "cleaning" process removes more content. Therefore, by the time the data reaches the model, the scope has been reduced by three times. It does challenge the notion that online data is “neutral” or “representative” of everyone.

    3. Such systems are unsupervised and whendeployed, take a text as input, commonly outputting scores or stringpredictions. Initially proposed by Shannon in 1949 [117], some ofthe earliest implemented LMs date to the early 1980s and were usedas components in systems for automatic speech recognition (ASR),

      This definition actually helped me understand what the core of LM really does - it's basically just predicting the next token based on what happened before (or what's around it). It doesn't sound so magical when you put it like that. I think this is directly related to the "random parrot" idea later in this article: it makes sense that if all the model does is guess the most likely next word, it doesn't actually "understand" anything. It's very good at pattern matching.

    4. Just as environmental impact scales with model size, so doesthe difficulty of understanding what is in the training data. In §4,we discuss how large datasets based on texts from the Internetoverrepresent hegemonic viewpoints and encode biases potentiallydamaging to marginalized populations

      Since most internet content comes from already dominant groups, this model ends up reinforcing these same views. It's a twofold problem: not only is the data biased, but it's too large for anyone to fully inspect or fix.