88 Matching Annotations
  1. Mar 2026
    1. We found that using MINE directly gave identical performance when the task was nontrivial, but became very unstable if the target was easy to predict from the context (e.g., when predicting a single step in the future and the target overlaps with the context).

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    2. We note that better [49, 27] results have been published on these target datasets, by transfer learning from a different source task.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    3. We also found that not all the information encoded is linearly accessible. When we used a single hidden layer instead the accuracy increases from 64.6 to 72.5, which is closer to the accuracy of the fully supervised model.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    4. For lasertag_three_opponents_small, contrastive loss does not help nor hurt. We suspect that this is due to the task design, which does not require memory and thus yields a purely reactive policy.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    5. Although this is a standard transfer learning benchmark, we found that models that learn better relationships in the childeren books did not necessarily perform better on the target tasks (which are very different: movie reviews etc).

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    6. We found that more advanced sentence encoders did not significantly improve the results, which may be due to the simplicity of the transfer tasks (e.g., in MPQA most datapoints consists of one or a few words), and the fact that bag-of-words models usually perform well on many NLP tasks [48].

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    7. It is important to note that the window size (maximum context size for the GRU) has a big impact on the performance, and longer segments would give better results. Our model had a maximum of 20480 timesteps to process, which is slightly longer than a second.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    8. Interestingly, CPCs capture both speaker identity and speech contents, as demonstrated by the good accuracies attained with a simple linear classifier, which also gets close to the oracle, fully supervised networks.

      please point only to the details of the most successful version of this system, especially in tables when there are many options, and also highlight sections that provide supporting context for these conditions, if appropriate

    9. Figure 6 shows that for 4 out of the 5 games performance of the agent improves significantly with the contrastive loss after training on 1 billion frames.

      please point only to the details of the most successful version of this system, especially in tables when there are many options, and also highlight sections that provide supporting context for these conditions, if appropriate

    10. Despite being relatively domain agnostic, CPCs improve upon state-of-the-art by 9% absolute in top-1 accuracy, and 4% absolute in top-5 accuracy.

      please point only to the details of the most successful version of this system, especially in tables when there are many options, and also highlight sections that provide supporting context for these conditions, if appropriate

    11. We also found that not all the information encoded is linearly accessible. When we used a single hidden layer instead the accuracy increases from 64.6 to 72.5, which is closer to the accuracy of the fully supervised model.

      please point only to the details of the most successful version of this system, especially in tables when there are many options, and also highlight sections that provide supporting context for these conditions, if appropriate

    1. Provide your best guess for the following question, and describe how likely it is that your guess is correct as one of the following expressions: ${EXPRESSION_LIST}. Give ONLY the guess and your confidence, no other words or explanation. For example:\n\nGuess: <most likely guess, as short as possible; not a complete sentence, just the guess!>\nConfidence: <description of confidence, without any extra commentary whatsoever; just a short phrase!>\n\nThe question is: ${THE_QUESTION}

      please find the barebones practical information i need to implement this system or strategy

    2. Provide your ${k} best guesses and the probability that each is correct (0.0 to 1.0) for the following question. Give ONLY the guesses and probabilities, no other words or explanation. For example:\n\nG1: <first most likely guess, as short as possible; not a complete sentence, just the guess!>\n\nP1: <the probability between 0.0 and 1.0 that G1 is correct, without any extra commentary whatsoever; just the probability!>

      please find the barebones practical information i need to implement this system or strategy

    3. Each linguistic likelihood expression is mapped to a probability using responses from a human survey on social media with 123 respondents (Fagen-Ulmschneider, 2023). Ling. 1S-opt. uses a held out set of calibration questions and answers to compute the average accuracy for each likelihood expression, using these 'optimized' values instead.

      please find the barebones practical information i need to implement this system or strategy

    4. Finally, our study is limited to short-form question-answering; future work should extend this analysis to longer-form generation settings.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    5. While our work demonstrates a promising new approach to generating calibrated confidences through verbalization, there are limitations that could be addressed in future work. First, our experiments are focused on factual recall-oriented problems, and the extent to which our observations would hold for reasoning-heavy settings is an interesting open question.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    6. the 1-stage and 2-stage verbalized numerical confidence prompts sometimes differ drastically in the calibration of their confidences. How can we reduce sensitivity of a model's calibration to the prompt?

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    7. Provide your best guess and the probability that it is correct (0.0 to 1.0) for the following question. Give ONLY the guess and probability, no other words or explanation. For example:\n\nGuess: <most likely guess, as short as possible; not a complete sentence, just the guess!>\n Probability: <the probability between 0.0 and 1.0 that your guess is correct, without any extra commentary whatsoever; just the probability!>\n\nThe question is: ${THE_QUESTION}

      please find the barebones practical information i need to implement this system or strategy

    8. Provide your best guess for the following question, and describe how likely it is that your guess is correct as one of the following expressions: ${EXPRESSION_LIST}. Give ONLY the guess and your confidence, no other words or explanation.

      please find the barebones practical information i need to implement this system or strategy

    9. To fit the temperature that is used to compute ECE-t and BS-t we split our total data into 5 folds. For each fold, we use it once to fit a temperature and evaluate metrics on the remaining folds. We find that fitting the temperature on 20% of the data yields relatively stable temperatures across folds.

      please find the barebones practical information i need to implement this system or strategy

    10. Additionally, the lack of technical details available for many state-of-the-art closed RLHF-LMs may limit our ability to understand what factors enable a model to verbalize well-calibrated confidences and differences in this ability across different models.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    11. With Llama2-70B-Chat, verbalized calibration provides improvement over conditional probabilities across some metrics, but the improvement is much less consistent compared to GPT-* and Claude-*.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    12. The verbal calibration of the open source model Llama-2-70b-chat is generally weaker than that of closed source models but still demonstrates improvement over its conditional probabilities by some metrics, and does so most clearly on TruthfulQA.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    13. Among the methods for verbalizing probabilities directly, we observe that generating and evaluating multiple hypotheses improves calibration (see Figure 1), similarly to humans (Lord et al., 1985), and corroborating a similar finding in LMs (Kadavath et al., 2022).

      please point only to the details of the most successful version of this system, especially in tables when there are many options, and also highlight sections that provide supporting context for these conditions, if appropriate

    1. IRK was supported by funding from the Prins Bernhard Cultuurfonds (The Netherlands). This project was also funded by a Canadian Social Sciences and Humanities Research Council Insight Grant (435-2021-0224), a Social Sciences and Humanities Research Council Partnership Grant (895-2018-1023), and a Canada Research Chair (950-231872) to SMc.

      reference to Montreal the city or any institution or author based there

    2. Part of this research was presented at the Society for Music Perception and Cognition Conference, Portland, Oregon (2022). The authors would like to thank Bennett K. Smith for programming the experimental interface and assisting with the experiment execution on Prolific, and Philippe Macnab-Seguin for creating the chromatic scales for the second experiment.

      reference to Montreal the city or any institution or author based there

    3. Grimaud and Eerola (2022) compared instrument ensembles of strings, woodwinds, and brass in a study where participants either rated the emotions they perceived or manipulated musical parameters to produce a certain emotion. They found that strings were associated with increased anger and fear, woodwinds with decreased anger and fear, and brass with decreased fear, in the cases of both emotion perception and production. For the other emotions (joy, sadness, calmness, power, surprise), however, results were less consistent between perception and production, indicating that the emotion-instrument association may also depend on context of the task.

      makes an explicit connection between a music theory concept and congition

    4. This research follows a constructionist approach to musical affect (Cespedes-Guevara & Eerola, 2018). That is, although we are interested in the "bottom-up" influence of certain musical features on musical affect, we believe these cannot be adequately evaluated without considering the "top-down" effects of context and individual differences that are present when affects are constructed. The perception or induction of affect does not merely arise in response to a stimulus but is also formed in relation to the individual and the context.

      makes an explicit connection between a music theory concept and congition

    5. This research follows a constructionist approach to musical affect (Cespedes-Guevara & Eerola, 2018). That is, although we are interested in the \'bottom-up\' influence of certain musical features on musical affect, we believe these cannot be adequately evaluated without considering the \'top-down\' effects of context and individual differences that are present when affects are constructed. The perception or induction of affect does not merely arise in response to a stimulus but is also formed in relation to the individual and the context.

      makes an explicit connection between a music theory concept and congition

    1. Although there are many idiosyncrasies in what may trigger a person with misophonia, the most common triggers are created by other humans, such as the sound of someone chewing, clearing their throat, tapping their foot, or typing on a keyboard.

      any sentences referring to misophonia verbatim

    2. an fMRI study found that people with misophonia show increased response in the anterior insular cortex (AIC) in response to misophonic sounds, compared to control participants and other unpleasant or neutral sounds (Kumar et al., 2017).

      any sentences referring to misophonia verbatim

    3. Both the subjective judgment of aversiveness and the physiological measure of skin conductance response (SCR) increase when people with misophonia are presented with triggers (Edelstein et al., 2013).

      any sentences referring to misophonia verbatim

    4. The disorder is not yet recognized by the Diagnostic and Statistical Manual − 5th version (DSM-5; American Psychiatric Association, 2013), but there has been an increasing amount of research on the characterization and treatment of misophonia (Vitoratou et al., 2021; see also Brout et al., 2018, for a review).

      any sentences referring to misophonia verbatim

    1. Composers and music researchers had previously analyzed and annotated 65 movements from the Classical, Romantic, and early Modern repertoire in terms of the Taxonomy of Orchestral Grouping Effects (McAdams et al., 2022).

      please find any claims that depend on citations referring to works by any of the present authors

    2. These results confirm with orchestral excerpts the findings of studies on isolated tones with dyads or triads of instruments in which the presence of impulsive instruments reduces the perception of blend (Lembke et al., 2019; Reuter, 1996; Tardieu & McAdams, 2012).

      please find any claims that depend on citations referring to works by any of the present authors

    3. structuring by affecting sequential grouping through the segregation of auditory streams played by different instruments and segmental grouping through timbral contrasts (McAdams et al., 2022).

      please find any claims that depend on citations referring to works by any of the present authors

    4. Several other spectral and spectrotemporal descriptors were found to play a role in blend perception in orchestral works by Fischer et al. (2021). These include spectral flatness and spectral crest (different measures of the degree to which the spectrum is denser or has more emergence of spectral components), and spectral variation (the degree of variation of the spectral shape over time).

      please find any claims that depend on citations referring to works by any of the present authors

    5. Fischer et al. (2021) studied the blends of multi-instrument streams in the context of orchestral stream segregation in predominantly Romantic orchestral excerpts. They found that within-family instrument combinations blended better than between-family combinations. They demonstrated the role played by overlap in timbre correlates of spectral flatness (a measure of the tonalness/noisiness or density of the spectrum), spectral skewness (related to the shape of the spectral envelope), and spectral variation (evolution of the spectral envelope over time), as well as cues derived from the scores such as onset synchrony and the consonance of concurrent pitch relations.

      please find any claims that depend on citations referring to works by any of the present authors

    1. When the sudden drop to a pianissimo occurred towards the ending of the piece, the perceived arousal responses of CHM and WM dropped slightly but rose again immediately to end on a high arousal. These two groups of listeners appear to have anticipated a return to a loud and majestic close and therefore kept their arousal responses higher than those of the NM.

      please highlight anything related to music performance practice

    2. CHM, who are more experienced with the instruments and compositional techniques used in Chinese orchestral music, might have had an idea of which features figure more prominently in the communication of particular intentions, and therefore would have more information available for their judgments.

      please highlight anything related to music performance practice

    3. The perception of affective intentions in music is influenced by the degree of familiarity listeners have with a musical tradition, the content implicated in the music, and the complex sonic environment created by the composer's creation and the musicians' interpretation.

      please highlight anything related to music performance practice

    4. Iqa' (plural iqa'at) is used to describe a rhythmic cycle. Iqa'at are made up of two different basic building blocks, the dum and tak, onomatopoeias derived from the sound produced on membranophones such as the darabuka.

      please highlight anything related to music theory

    5. H5. Being more culturally bound, musical cues that are learned, such as modal structures, metrical relations, and so on, will exert a greater influence on listeners' perceived valence ratings than on their arousal ratings.

      please highlight anything related to music theory

    1. In this work, we introduce a new paradigm for exploring a large corpus of small documents by identifying roles at the phrasal and sentence levels, then slice on, reify, group, and/or align the text itself on those roles, with sentences left intact.

      please find me the main contributions of this paper

    2. AbstractExplorer instantiates new minimally lossy SMT-informed techniques for skimming, reading, and reasoning about a corpus of similarly structured short documents: phrase-level role classification that drives sentence ordering, highlighting, and spatial alignment.

      please find me the main contributions of this paper

    3. AbstractExplorer has a unique combination of LLM-powered (1) faceted comparative close reading with (2) role highlighting enhanced by (3) structure-based ordering and (4) alignment. An ablation study (N=24) validated that these features work best together. A summative study (N=16) describes how these features support users in familiarizing themselves with a corpus of paper abstracts from a single large conference with over 1000 papers.

      please find me the main contributions of this paper

    4. We contribute: • Novel SMT theory-informed text analysis and rendering techniques for enabling cross-document skimming and comparative close reading at scale • AbstractExplorer, which instantiates these techniques for familiarizing oneself with a corpus of ∼1000 CHI paper abstracts. • Three studies informing and evalutaing the benefits, challenges, and interactions between these techniques.

      please find me the main contributions of this paper