50 Matching Annotations
  1. Last 7 days
    1. Still, we can look for telltale signs. Another symptom of memorization is that GPT is highly sensitive to the phrasing of the question. Melanie Mitchell gives an example of an MBA test question where changing some details in a way that wouldn’t fool a person is enough to fool ChatGPT (running GPT-3.5). A more elaborate experiment along these lines would be valuable.

      OpenAI has memorised MBA tests- when these are rephrased or certain details are changed, the system fails to answer

    2. In fact, we can definitively show that it has memorized problems in its training set: when prompted with the title of a Codeforces problem, GPT-4 includes a link to the exact contest where the problem appears (and the round number is almost correct: it is off by one). Note that GPT-4 cannot access the Internet, so memorization is the only explanation.

      GPT4 knows the link to the coding exams that it was evaluated against but doesn't have "internet access" so it appears to have memorised this as well

    3. To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.

      OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training

  2. Mar 2023
    1. OpenChatKit은 다양한 응용 프로그램을위한 특수 및 범용 챗봇을 모두 생성 할 수있는 강력한 오픈 소스 기반을 제공합니다. 우리는 협력 법과 온 토코교육 데이터 세트를 작성합니다. 모델 릴리스 그 이상으로 이것은 오픈 소스 프로젝트의 시작입니다. 우리는 지역 사회 공헌으로 지속적인 개선을위한 도구와 프로세스를 발표하고 있습니다.Together는 오픈 소스 기초 모델이보다 포괄적이고 투명하며 강력하며 능력이 있다고 생각합니다. 우리는 공개하고 있습니다 OpenChatKit 0.15 소스 코드, 모델 가중치 및 교육 데이터 세트에 대한 전체 액세스 권한이있는 Apache-2.0 라이센스에 따라. 이것은 커뮤니티 중심의 프로젝트이며, 우리는 그것이 어떻게 발전하고 성장하는지 보게되어 기쁩니다!유용한 챗봇은 자연 언어로 된 지침을 따르고 대화 상자에서 컨텍스트를 유지하며 응답을 조정해야합니다. OpenChatKit은이베이스에서 특수 제작 된 챗봇을 도출하기위한 기본 봇과 빌딩 블록을 제공합니다.이 키트에는 4 가지 주요 구성 요소가 있습니다:100 % 탄소 음성 계산에 대한 4,300 만 건 이상의 명령으로 EleutherAI의 GPT-NeoX-20B에서 채팅을 위해 미세 조정 된 명령 조정 된 대용량 언어 모델;작업을 정확하게 수행하기 위해 모델을 미세 조정하는 사용자 정의 레시피;추론시 문서 저장소, API 또는 기타 실시간 업데이트 정보 소스의 정보로 봇 응답을 보강 할 수있는 확장 가능한 검색 시스템;봇이 응답하는 질문을 필터링하도록 설계된 GPT-JT-6B로 미세 조정 된 조정 모델.OpenChatKit에는 사용자가 피드백을 제공하고 커뮤니티 구성원이 새로운 데이터 세트를 추가 할 수 있도록하는 도구가 포함되어 있습니다. 시간이 지남에 따라 LLM을 개선 할 수있는 개방형 교육 데이터 모음에 기여합니다.

      OpenChatKit은 다양한 응용 프로그램을위한 특수 및 범용 챗봇을 모두 생성 할 수있는 강력한 오픈 소스 기반을 제공합니다. 우리는 협력 법과 온 토코교육 데이터 세트를 작성합니다. 모델 릴리스 그 이상으로 이것은 오픈 소스 프로젝트의 시작입니다. 우리는 지역 사회 공헌으로 지속적인 개선을위한 도구와 프로세스를 발표하고 있습니다.

      Together는 오픈 소스 기초 모델이보다 포괄적이고 투명하며 강력하며 능력이 있다고 생각합니다. 우리는 공개하고 있습니다 OpenChatKit 0.15 소스 코드, 모델 가중치 및 교육 데이터 세트에 대한 전체 액세스 권한이있는 Apache-2.0 라이센스에 따라. 이것은 커뮤니티 중심의 프로젝트이며, 우리는 그것이 어떻게 발전하고 성장하는지 보게되어 기쁩니다!

      유용한 챗봇은 자연 언어로 된 지침을 따르고 대화 상자에서 컨텍스트를 유지하며 응답을 조정해야합니다. OpenChatKit은이베이스에서 특수 제작 된 챗봇을 도출하기위한 기본 봇과 빌딩 블록을 제공합니다.

      이 키트에는 4 가지 주요 구성 요소가 있습니다:

      100 % 탄소 음성 계산에 대한 4,300 만 건 이상의 명령으로 EleutherAI의 GPT-NeoX-20B에서 채팅을 위해 미세 조정 된 명령 조정 된 대용량 언어 모델;

      작업을 정확하게 수행하기 위해 모델을 미세 조정하는 사용자 정의 레시피;

      추론시 문서 저장소, API 또는 기타 실시간 업데이트 정보 소스의 정보로 봇 응답을 보강 할 수있는 확장 가능한 검색 시스템;

      봇이 응답하는 질문을 필터링하도록 설계된 GPT-JT-6B로 미세 조정 된 조정 모델.

  3. Feb 2023
  4. Jan 2023
    1. Figure 3. The average drop in log probability (perturbation discrep-ancy) after rephrasing a passage is consistently higher for model-generated passages than for human-written passages. Each plotshows the distribution of the perturbation discrepancy d (x, pθ , q)for human-written news articles and machine-generated arti-cles; of equal word length from models GPT-2 (1.5B), GPT-Neo-2.7B (Black et al., 2021), GPT-J (6B; Wang & Komatsuzaki (2021))and GPT-NeoX (20B; Black et al. (2022)). Human-written arti-cles are a sample of 500 XSum articles; machine-generated textis generated by prompting each model with the first 30 tokens ofeach XSum article, sampling from the raw conditional distribution.Discrepancies are estimated with 100 T5-3B samples.

      quite striking here is the fact that more powerful/larger models are more capable of generating unusual or "human-like" responses - looking at the overlap in log likelihoods

    2. if we apply small perturbations to a passagex ∼ pθ , producing ̃x, the quantity log pθ (x) − log pθ ( ̃x)should be relatively large on average for machine-generatedsamples compared to human-written text.

      By applying small changes to text sample x, we should be able to find the log probs of x and the perturbed example and there should be a fairly big delta for machine generated examples.

    3. As in prior work, we study a ‘white box’ setting (Gehrmannet al., 2019) in which the detector may evaluate the log prob-ability of a sample log pθ (x). The white box setting doesnot assume access to the model architecture or parameters.While most public APIs for LLMs (such as GPT-3) enablescoring text, some exceptions exist

      The authors assume white-box access to the log probability of a sample \(log p_{\Theta}(x)\) but do not require access to the model's actual architecture or weights.

    4. Empirically, we find predictive entropy to be positively cor-related with passage fake-ness more often that not; there-fore, this baseline uses high average entropy in the model’spredictive distribution as a signal that a passage is machine-generated.

      this makes sense and aligns with the gltr - humans add more entropy to sentences by making unusual choices in vocabulary that a model would not.

    5. We find that supervised detectors can provide similardetection performance to DetectGPT on in-distribution datalike English news, but perform significantly worse than zero-shot methods in the case of English scientific writing andfail altogether for German writing. T

      supervised detection methods fail on out of domain examples whereas detectgpt seems to be robust to changes in domain.

    6. ex-tending DetectGPT to use ensembles of models for scoring,rather than a single model, may improve detection in theblack box setting

      DetectGPT could be extended to use ensembles of models allowing iot to work in black box settings where the log probs are unknown

    7. hile in this work, we use off-the-shelfmask-filling models such as T5 and mT5 (for non-Englishlanguages), some domains may see reduced performanceif existing mask-filling models do not well represent thespace of meaningful rephrases, reducing the quality of thecurvature estimate.

      The approach requires access to language models that can meaningfully and accurately rephrase (perturbate) the outputs from the model under evaluation. If these things do not align then it may not work well.

    8. For models be-hind APIs that do provide probabilities (such as GPT-3),evaluating probabilities nonetheless costs money.

      This does cost money to do for paid APIs and requires that log probs are made available.

    9. We simulate human re-vision by replacing 5 word spans of the text with samplesfrom T5-3B until r% of the text has been replaced, andreport performance as r varies.

      I question the trustworthiness of this simulation - human edits are probably going to be more sporadic and random.

    10. Figure 5. We simulate human edits to machine-generated text byreplacing varying fractions of model samples with T5-3B gener-ated text (masking out random five word spans until r% of text ismasked to simulate human edits to machine-generated text). Thefour top-performing methods all generally degrade in performancewith heavier revision, but DetectGPT is consistently most accurate.Experiment is conducted on the XSum dataset

      DetectGPT shows 95% AUROC for texts that have been modified by about 10% and this drops off to about 85% when text is changed up to 24%.

    11. DetectGPT’s performancein particular is mostly unaffected by the change in languagefrom English to Germa

      Performance of this method is robust against changes between languages (e.g. English to German)

    12. ecause the GPT-3 API does not provideaccess to the complete conditional distribution for each to-ken, we cannot compare to the rank, log rank, and entropy-based prior methods

      GPT-3 api does not expose the cond probs for each token so we can't compare to some of the prior methods. That seems to suggest that this method can be used with limited knowledge about the probabilities.

    13. improving detection offake news articles generated by 20B parameterGPT-NeoX

      The authors test their approach on GPT-NeoX. The question would be whether we can get hold of the log probs from ChatGPT to do the same

    14. his approach, which we call DetectGPT,does not require training a separate classifier, col-lecting a dataset of real or generated passages, orexplicitly watermarking generated text. It usesonly log probabilities computed by the model ofinterest and random perturbations of the passagefrom another generic pre-trained language model(e.g, T5)

      The novelty of this approach is that it is cheap to set up as long as you have the log probabilities generated by the model of interest.

    15. See ericmitchell.ai/detectgptfor code, data, and other project information.

      Code and data available at https://ericmitchell.ai/detectgpt

    1. Feng, 2022. "Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis"

      Shared and found via: Gowthami Somepalli @gowthami@sigmoid.social Mastodon > Gowthami Somepalli @gowthami StructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.

    1. Educators are now administering the Turing test in reverse: What are questions that only humans can answer well? What kinds of thinking does writing make possible for us? 
    2. GPT-3 threatens to “[undermine] the kind of writing intensive course that had served as the backbone of [his] teaching for two decades.” “I was less worried about whether GPT-3 is genuinely intelligent,” Symons writes, “and more worried about whether the development of these tools would make us less intelligent.” 
  5. Dec 2022
    1. Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
    1. natural-language processing is going to force engineers and humanists together. They are going to need each other despite everything. Computer scientists will require basic, systematic education in general humanism: The philosophy of language, sociology, history, and ethics are not amusing questions of theoretical speculation anymore. They will be essential in determining the ethical and creative use of chatbots, to take only an obvious example.
    2. The extraordinary ignorance on questions of society and history displayed by the men and women reshaping society and history has been the defining feature of the social-media era.
    1. Emergent abilities are not present in small models but can be observed in large models.

      Here’s a lovely blog by Jason Wei that pulls together 137 examples of ’emergent abilities of large language models’. Emergence is a phenomenon seen in contemporary AI research, where a model will be really bad at a task at smaller scales, then go through some discontinuous change which leads to significantly improved performance.

    1. Houston, we have a Capability Overhang problem: Because language models have a large capability surface, these cases of emergent capabilities are an indicator that we have a ‘capabilities overhang’ – today’s models are far more capable than we think, and our techniques available for exploring the models are very juvenile. We only know about these cases of emergence because people built benchmark datasets and tested models on them. What about all the capabilities we don’t know about because we haven’t thought to test for them? There are rich questions here about the science of evaluating the capabilities (and safety issues) of contemporary models. 
    1. As the metaphor suggests, though, the prospect of a capability overhang isn’t necessarily good news. As well as hidden and emerging capabilities, there are hidden and emerging threats. And these dangers, like our new skills, are almost too numerous to name.
    2. There’s a concept in AI that I’m particularly fond of that I think helps explain what’s happening. It’s called “capability overhang” and refers to the hidden capacities of AI: skills and aptitudes latent within systems that researchers haven’t even begun to investigate yet. You might have heard before that AI models are “black boxes” — that they’re so huge and complex that we don’t fully understand how they operate or come to specific conclusions. This is broadly true and is what creates this overhang.
    1. Which is why I wonder if this may be the end of using writing as a benchmark for aptitude and intelligence.
    2. Perhaps there are reasons for optimism, if you push all this aside. Maybe every student is now immediately launched into that third category: The rudiments of writing will be considered a given, and every student will have direct access to the finer aspects of the enterprise. Whatever is inimitable within them can be made conspicuous, freed from the troublesome mechanics of comma splices, subject-verb disagreement, and dangling modifiers.
    3. I’ve also long held, for those who are interested in writing, that you need to learn the basic rules of good writing before you can start breaking them—that, like Picasso, you have to learn how to reliably fulfill an audience’s expectations before you get to start putting eyeballs in people’s ears and things.
  6. Nov 2022
    1. “In literacy education, particularly for developing writers, instructors are looking for the level of desirable difficulty, or the point at which you are working yourself just as hard so that you don’t break but you also improve,” Laffin told Motherboard. “Finding the right, appropriate level of desirable difficulty level of instruction makes their capacity to write grow. So if you are doing compensation techniques that go beyond finding that level of desirable difficulty and instructing at that place, then you’re not helping them grow as a writer.”
  7. Aug 2022
  8. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
  9. Apr 2022
    1. # Input Input: 123, Output: Input: 121, Output: Input: 111, Output: Input: 123454321, Output: Input 123123, Output: # Instruction Output true if input is a palindrome # Output Input: 123, Output: false Input: 121, Output: true Input: 111, Output: true Input: 123454321, Output: true Input 123123, Output: false

      Example of using GPT-3 for programming

  10. Nov 2021
    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
  11. Jun 2021
    1. When creating a BIOS Boot Partition on a GPT system, you should make sure that it is at least 31 KiB in size.

      This is important. If not set this, the OS won't be detected when grub is used with GPT system.

  12. Apr 2021
  13. Feb 2021
  14. Jul 2020