- Apr 2022
Yeshiva teaching in the modern period famously relied on memorization of the most important texts, but a few medieval Hebrew manu-scripts from the twelfth or thirteenth centuries include examples of alphabetical lists of words with the biblical phrases in which they occurred, but without pre-cise locations in the Bible—presumably because the learned would know them.
Prior to concordances of the Christian Bible there are examples of Hebrew manuscripts in the twelfth and thirteenth centuries that have lists of words and sentences or phrases in which they occurred. They didn't include exact locations with the presumption being that most scholars would know the texts well enough to quickly find them based on the phrases used.
Early concordances were later made unnecessary as tools as digital search could dramatically decrease the load. However these tools might miss the value found in the serendipity of searching through broad word lists.
Has anyone made a concordance search and display tool to automatically generate concordances of any particular texts? Do professional indexers use these? What might be the implications of overlapping concordances of seminal texts within the corpus linguistics space?
Fun tools like the Bible Munger now exist to play around with find and replace functionality. https://biblemunger.micahrl.com/munge
Online tools also have multi-translation versions that will show translational differences between the seemingly ever-growing number of English translations of the Bible.
- Feb 2022
Together: responsive, inline “autocomplete” powered by an RNN trained on a corpus of old sci-fi stories.
I can't help but think, what if one used their own collected corpus of ideas based on their ever-growing commonplace book to create a text generator? Then by taking notes, highlighting other work, and doing your own work, you're creating a corpus of material that's imminently interesting to you. This also means that by subsuming text over time in making your own notes, the artificial intelligence will more likely also be using your own prior thought patterns to make something that from an information theoretic standpoint look and sound more like you. It would have your "hand" so to speak.
- Jan 2022
from: Eyeo Conference 2017
Robin Sloan at Eyeo 2017 | Writing with the Machine | Language models built with recurrent neural networks are advancing the state of the art on what feels like a weekly basis; off-the-shelf code is capable of astonishing mimicry and composition. What happens, though, when we take those models off the command line and put them into an interactive writing environment? In this talk Robin presents demos of several tools, including one presented here for the first time. He discusses motivations and process, shares some technical tips, proposes a course for the future — and along the way, write at least one short story together with the audience: all of us, and the machine.
Robin created a corpus using If Magazine and Galaxy Magazine from the Internet Archive and used it as a writing tool. He talks about using a few other models for generating text.
Some of the idea here is reminiscent of the way John McPhee used the 1913 Webster Dictionary for finding words (or le mot juste) for his work, as tangentially suggested in Draft #4 in The New Yorker (2013-04-22)
Croatian acapella singing: klapa https://www.youtube.com/watch?v=sciwtWcfdH4
Writing using the adjacent possible.
Corpus building as an art [~37:00]
Forgetting what one trained their model on and then seeing the unexpected come out of it. This is similar to Luhmann's use of the zettelkasten as a serendipitous writing partner.
How might we use information theory to do this more easily?
What does a person or machine's "hand" look like in the long term with these tools?
Can we use corpus linguistics in reverse for this?
What sources would you use to train your model?
- Andrej Karpathy. 2015. "The Unreasonable Effectiveness of Recurrent Neural Networks"
- Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, et al. "Generating sentences from a continuous space." 2015. arXiv: 1511.06349
- Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017. "A Hybrid Convolutional Variational Autoencoder for Text generation." arXiv:1702.02390
- Soroush Mehri, et al. 2017. "SampleRNN: An Unconditional End-to-End Neural Audio Generation Model." arXiv:1612.07837 applies neural networks to sound and sound production
- neural networks
- Milman Parry
- Draft #4
- le mot juste
- Webster's dictionary
- throat singing
- corpus linguistics
- artificial intelligence
- tools for thought
- Eyeo Festival
- adjacent possible
- Andrej Karpathy
- John McPhee
- Robin Sloan
- Jun 2021
The viciousness of church politics can rival pretty much any other politics you can name; the difference is that the viciousness within churches is often cloaked in lofty spiritual language and euphemisms.
It would be interesting to examine some of this language and these euphemisms to uncover the change over time.
- Feb 2021
Sanders, J., Tosi, A., Obradović, S., Miligi, I., & Delaney, L. (2021). Lessons from lockdown: Media discourse on the role of behavioural science in the UK COVID-19 response. PsyArXiv. https://doi.org/10.31234/osf.io/dw85a
Only fifteen of the thirty-seven commonplace books were written in his hand. He might have dictated the others to a secretary, but the nature of his authorship, if it existed, remains a matter of conjecture. A great deal of guesswork also must go into the interpretation of the entries in his own hand, because none of them are dated. Unlike the notes of Harvey, they consist of endless excerpts, which cannot be connected with anything that was happening in the world of politics.
I find myself wondering what this study of his commonplace books would look like if it were digitized and cross-linked? Sadly the lack of dates on the posts would prevent some knowledge from being captured, but what would the broader corpus look like?
Consider the broader digital humanities perspective of this. Something akin to corpus linguistics, but at the level of view of what a single person reads, thinks, and reacts to over the course of their own lifetime.
How much of a person could be recreated from such a collection?
- Oct 2020
To have, but maybe not to read. Like Stephen Hawking’s “A Brief History of Time,” “Capital in the Twenty-First Century” seems to have been an “event” book that many buyers didn’t stick with; an analysis of Kindle highlights suggested that the typical reader got through only around 26 of its 700 pages. Still, Piketty was undaunted.
Interesting use of digital highlights--determining how "read" a particular book is.
- Nov 2019
From this perspective, GPT-2 says less about artificial intelligence and more about how human intelligence is constantly looking for, and accepting of, stereotypical narrative genres, and how our mind always wants to make sense of any text it encounters, no matter how odd. Reflecting on that process can be the source of helpful self-awareness—about our past and present views and inclinations—and also, some significant enjoyment as our minds spin stories well beyond the thrown-together words on a page or screen.
And it's not just happening with text, but it also happens with speech as I've written before: Complexity isn’t a Vice: 10 Word Answers and Doubletalk in Election 2016 In fact, in this mentioned case, looking at transcripts actually helps to reveal that the emperor had no clothes because there's so much missing from the speech that the text doesn't have enough space to fill in the gaps the way the live speech did.
The most interesting examples have been the weird ones (cf. HI7), where the language model has been trained on narrower, more colorful sets of texts, and then sparked with creative prompts. Archaeologist Shawn Graham, who is working on a book I’d like to preorder right now, An Enchantment of Digital Archaeology: Raising the Dead with Agent Based Models, Archaeogaming, and Artificial Intelligence, fed GPT-2 the works of the English Egyptologist Flinders Petrie (1853-1942) and then resurrected him at the command line for a conversation about his work. Robin Sloan had similar good fun this summer with a focus on fantasy quests, and helpfully documented how he did it.
Circle back around and read this when it comes out.
Similarly, these other references should be an interesting read as well.
For those not familiar with GPT-2, it is, according to its creators OpenAI (a socially conscious artificial intelligence lab overseen by a nonprofit entity), “a large-scale unsupervised language model which generates coherent paragraphs of text.” Think of it as a computer that has consumed so much text that it’s very good at figuring out which words are likely to follow other words, and when strung together, these words create fairly coherent sentences and paragraphs that are plausible continuations of any initial (or “seed”) text.
This isn't a very difficult problem and the underpinnings of it are well laid out by John R. Pierce in An Introduction to Information Theory: Symbols, Signals and Noise. In it he has a lot of interesting tidbits about language and structure from an engineering perspective including the reason why crossword puzzles work.
close reading, distant reading, corpus linguistics
- Sep 2019
He is now intending to collaborate with Bourne on a series of articles about the find. “Having these annotations might allow us to identify further books that have been annotated by Milton,” he said. “This is evidence of how digital technology and the opening up of libraries [could] transform our knowledge of this period.”
- Apr 2019
Digital sociology needs more big theory as well as testable theory.
I can't help but think here about the application of digital technology to large bodies of literature in the creation of the field of corpus linguistics.
If traditional sociology means anything, then a digital incarnation of it should create physical and trackable means that can potentially be more easily studied as a result. Just the same way that Mark Dredze has been able to look at Twitter data to analyze public health data like influenza, we should be able to more easily quantify sociological phenomenon in aggregate by looking at larger and richer data sets of online interactions.
There's also likely some value in studying the quantities of digital exhaust that companies like Google, Amazon, Facebook, etc. are using for surveillance capitalism.