33 Matching Annotations
  1. Jul 2024
    1. Whoosh provides methods for computing the “key terms” of a set of documents. For these methods, “key terms” basically means terms that are frequent in the given documents, but relatively infrequent in the indexed collection as a whole.

      Very interesting method, and way of looking at the signal. "What makes a document exceptional because something is common within itself and uncommon without".

  2. May 2024
  3. Feb 2024
  4. Feb 2023
    1. An AI model that can learn and work with this kind of problem needs to handle order in a very flexible way. The old models—LSTMs and RNNs—had word order implicitly built into the models. Processing an input sequence of words meant feeding them into the model in order. A model knew what word went first because that’s the word it saw first. Transformers instead handled sequence order numerically, with every word assigned a number. This is called "positional encoding." So to the model, the sentence “I love AI; I wish AI loved me” looks something like (I 1) (love 2) (AI 3) (; 4) (I 5) (wish 6) (AI 7) (loved 8) (me 9).

      Google’s “the transformer”

      One breakthrough was positional encoding versus having to handle the input in the order it was given. Second, using a matrix rather than vectors. This research came from Google Translate.

  5. Jan 2023
    1. a common technique in natural language processing is to operationalize certain semantic concepts (e.g., "synonym") in terms of syntactic structure (two words that tend to occur nearby in a sentence are more likely to be synonyms, etc). This is what word2vec does.

      Can I use some of these sorts of methods with respect to corpus linguistics over time to better identified calcified words or archaic phrases that stick with the language, but are heavily limited to narrower(ing) contexts?

    1. Fried-berg Judeo-Arabic Project, accessible at http://fjms.genizah.org. This projectmaintains a digital corpus of Judeo-Arabic texts that can be searched and an-alyzed.

      The Friedberg Judeo-Arabic Project contains a large corpus of Judeo-Arabic text which can be manually searched to help improve translations of texts, but it might also be profitably mined using information theoretic and corpus linguistic methods to provide larger group textual translations and suggestions at a grander scale.

  6. Dec 2022
    1. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT '21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922

  7. Nov 2022
    1. Robert Amsler is a retired computational lexicology, computational linguist, information scientist. His P.D. was from UT-Austin in 1980. His primary work was in the area of understanding how machine-readable dictionaries could be used to create a taxonomy of dictionary word senses (which served as the motivation for the creation of WordNet) and in understanding how lexicon can be extracted from text corpora. He also invented a new technique in citation analysis that bears his name. His work is mentioned in Wikipedia articles on Machine-Readable dictionary, Computational lexicology, Bibliographic coupling, and Text mining. He currently lives in Vienna, VA and reads email at robert.amsler at utexas. edu. He is currenly interested in chronological studies of vocabulary, esp. computer terms.

      https://www.researchgate.net/profile/Robert-Amsler

      Apparently follow my blog. :)

      Makes me wonder how we might better process and semantically parse peoples' personal notes, particularly when they're atomic and cross-linked?

  8. Oct 2022
    1. https://www.explainpaper.com/

      Another in a growing line of research tools for processing and making sense of research literature including Research Rabbit, Connected Papers, Semantic Scholar, etc.

      Functionality includes the ability to highlight sections of research papers with natural language processing to explain what those sections mean. There's also a "chat" that allows you to ask questions about the paper which will attempt to return reasonable answers, which is an artificial intelligence sort of means of having an artificial "conversation with the text".

      cc: @dwhly @remikalir @jeremydean

  9. Aug 2022
  10. Dec 2021
    1. Catala, a programming language developed by Protzenko's graduate student Denis Merigoux, who is working at the National Institute for Research in Digital Science and Technology (INRIA) in Paris, France. It is not often lawyers and programmers find themselves working together, but Catala was designed to capture and execute legal algorithms and to be understood by lawyers and programmers alike in a language "that lets you follow the very specific legal train of thought," Protzenko says.

      A domain-specific language for encoding legal interpretations.

  11. Nov 2021
  12. Jun 2021
  13. Mar 2021
    1. Originally he had used the terms usage scenarios and usage case – the latter a direct translation of his Swedish term användningsfall – but found that neither of these terms sounded natural in English, and eventually he settled on use case.
  14. Sep 2020
  15. Aug 2020
  16. Jul 2020
  17. May 2020
  18. Apr 2020
    1. Just as with wine-tasting, having a bigger vocabulary for colours allows specific colours to be perceived more readily and remembered more easily, even if not done consciously.
  19. Mar 2020
    1. This will of course depend on your perspective, but: beware Finnish and other highly inflected languages. As a grammar nerd, I actually love this stuff. But judging by my colleagues, you won’t.