- Feb 2024
-
-
Contemporary language models are sustained not only by a global network of human labor, but by physical infrastructure. The computer warehouses where language models are trained are large: a modern data center can be millions of square feet (the size of several football stadiums) and require a lot of water to prevent the machines from overheating. For instance, a data center outside of Des Moines, Iowa, identified as the “birthplace” of GPT-4, used 11.5 million gallons of water for cooling in 2022, drawn from rivers that also provide the city’s drinking water. These challenges have led to decisions to build data centers in regions with cooler climates with more water to draw from; some companies have experimented with putting data centers underwater. (Data centers are used for a lot more than language models, of course; the entire internet lives on these machines.)
Ok, this is terrible environmentally. We have found yet another way to drain away our natural resources. This is infuriating. Instead of putting so much and effort into a non-essential, harmful technology, how about we try to solve the problems that are impacting basic quality of life for millions rather than contributing to the problems.
-
The data workers whose labor is essential to modern AI systems include prisoners in Finland and employees of data annotation agencies in Kenya, Uganda, and India.
What exactly are these workers doing to maintain these systems?
-
The kinds of “dispreferred” texts to which data labelers are exposed, in practice, have tended to describe horrific scenarios, following a well-entrenched pattern of offloading the most traumatic parts of maintaining automated systems AI maintenance to workers who are often precariously employed and given insufficient psychological support.
This really confused me. What is this trying to say? Why describe horrific scenarios?
-
While many authors, programmers, and other people who publish writing online are aghast to find that their work has been stolen and used as language model training fodder, some are enthusiastic about being included in AI training data. These debates have raised questions about what constitutes labor and what fair compensation might look like for (unwitting) intellectual contributions to the development of what are ultimately commercial systems being licensed for profit. How does the labor involved in maintaining a personal blog as a hobby compare with that of reporters and authors who are commissioned and paid to publish their work? With that of volunteer Wikipedia editors? How does the “labor” of posting online compare with the labor performed by workers conversing with prototypical chatbots and labeling text?
This scary. As someone who wants go into journalism and writing, the idea that ideas and writing can be used without my consent by this type of technology is alarming. It is one thing to analyze that which is already written and collect data, but I do not like that programs are being trained to synthesize and replicate writing to create its own material.
-
ne of those times, it is followed by “page”, and the other time it is followed by “article”—so the probability that it is followed by “page” is 50 percent. This is not a very robust language model for English—the vocabulary is incredibly small, and there is no variety of syntactic structures. A more representative sample of English, then, would require a much larger collection of sentences. We’ll return to this in a moment.
This seems like the same style of model Voyant uses to detect word count frequency in a corpus. It's cool how it uses word choice frequency to predict sentence structure based off basic statistics. It is now starting to make sense why Voyant cares so much about measuring word choice frequency!
-
-
every.to every.to
-
not actually that complicated.
Umm...
-
The placement of dishes in meal-space isn’t random anymore. In fact, there are underlying, hidden mathematical patterns that mean every food is placed in some logic relative to every other food.
Ok, this is pretty cool.
-
Notice that we didn’t look for any meals in which caesar and caprese salads occur together. They never need to occur together for us to deem the dishes similar. They simply need to be found among the same other dishes.
Ok. This makes sense. It identifies patterns and recreates patterns rather than copying the exact contents of something.
-
Given a few regions of food, it just needs to find the most common region the next dish would be in…
Go e are the days of trying to figure out what to eat for dinner and spending all day thinking about it..using data and AI hypothetically you could list the ingredients in your fridge and ask it to make a meal with a salad and main. How amazing! Has anyone given this a try before? I think I might have to.
-
To me, this is why the AI phenomenon we’re living through is so fascinating. Considering how transformative this technology is, it’s not actually that complicated. A few simple mathematical concepts, a whole lot of training data, a sprinkle of salt and pepper, and you’ve essentially built yourself a thinking machine.
Sure, the algorithm behind AI I suppose is basic in its intrinsic form but the actual development of AI tools and applications would indeed come with complexities no? I think the author wants us to see the basic forms of how AI was discovered but its integration with software and the various hurdles that brings can indeed be complex in its own right. For example, the development of AI tools like chatGPT obviously requires a talented team of engineering and software professionals to figure things out and probably utilize methods that are extremely complicated to manage.
-
Train a model to understand the relationships between words based on how often they appear in similar contexts. “A word is categorized by the company it keeps.” Feed it a ton of human-written data (and when I say a ton, I essentially mean the entire internet), and let it nudge word coordinates around appropriately.
When training a model, how do we establish what we are looking for to the AI? Sure we are giving it vast amounts of data to find specific patterns in the vector-space but how do we train it to stick to a particular path? There's this concept called objective functions which basically refers to the general optimization of an AI model why was this left out? https://www.larksuite.com/en_us/topics/ai-glossary/objective-function
Tags
Annotators
URL
-
-
engl201.opened.ca engl201.opened.ca
-
Recent debates may also tend to overstate the technical challenges of interdisci-plinarity. Distant readers admittedly enjoy discussing new unsupervised algorithmsthat are hard to interpret.5 But many useful methods are supervised, comparativelystraightforward, and have been in social-science courses for decades. A grad studentcould do a lot of damage to received ideas with a thousand novels, manually gath-ered metadata, and logistic regression.
Any time that people are afraid of something new or dismiss it for its problems they risk missing out on an opportunity to use a tool or allowing the tool to be used poorly. As soon as someone tries to scare you away from exploring something, that should be a signal to immediately learn everything you can I would think.
-
nstead ofsimply counting words or volumes, distant readers increasingly treat writing as a#eld of relations to be modeled, using equations that connect linguistic variablesto social ones.4 Once we grasp how this story #ts into the larger intellectual historyof our time, it no longer makes much sense to frame it as a debate within literarystudies
This is good. If this technology can help place data within the context of a much larger grouping of related data I think could be very beneficial for understanding large-scale themes and overarching relationships between disciplines.
-
big data,” for instance,because the term is new, terrifying, and so poorly de#ned that it can signify a widerange of threats.
The term definitely seems coined in order to instil fear, and terminology like this should always be treated with caution. It reminds me of when people talk about “big pharma” as a money hungry business out for everyone. It is not as scary as it seems though, and after a quick search big data simply refers to data sets that are too large or complex to be dealt with by traditional data-processing application software.
-
“distant reading.” It’s vivid, it doesn’t overemphasize technology,and it candidly admits that new methods are mainly useful at larger scales of anal-ysis
Distant reading is starting to make a lot more sense to me. Although, I feel to some extent the objective is to be able to read more without actually reading the context. Probably for some it means working smarter not harder. I wonder if this could potentially miss important pieces to the reading.
-
ecause changes of scale are easy to describe, journalists o!en stophere—reducing recent intellectual history to the buzzword “big data.” "e more inter-esting part of the story is philosophical rather than technical, and involves what LeoBreiman, # !een years ago, called a new “culture” of statistical modeling (Breiman)
I totally agree that a lot of people that utilize language/text models or AI in their professional lives throw the word "big data" around quite frequently. Does it just refer to the general computational handling of large data sets analogous to distant reading? or could it imply something deeper; not only handling large amounts of work but also finding a specific pattern or objective within those vast datasets?
-
-
uta.pressbooks.pub uta.pressbooks.pub
-
For instance, the word “Indians” nearly fell out of all addresses after Teddy Roosevelt while the word “God” has been a part of every presidential speech since Franklin Roosevelt. Also, a word like “women” was a word rarely used in the president’s annual address. Franklin Roosevelt, Harry Truman, and John F. Kennedy used the word to address both men and women. However, Jimmie Carter and Ronald Regan began speaking directly to women about gendered issues. This visualization is an example of distant reading.
This is fascinating. I am very interested to learn about the use of words over time, changing connotations, and the contexts in which they were used. This is such a cool way of looking at how difference people and issues and people have been recognized by government over the years. Names are so important and I wonder if some words have seen an increase in use as marginalized groups reclaim previously derogatory terms for their own.
-
Managing it effectively can mean the difference between finding what you are after and getting lost in a jumble of data. Distant reading is one process where we might use text mining software to analyze several textual documents.
It is still so easy to get lost in a jumble of information. I have only just discovered Ctrl + f this academic year. I do not know how I managed before I found that shortcut. It can allowed me to quickly scan long, dry academic articles when researching for papers to see if the information is relevant to my topic. This has also saved me when I know I've read something I want to quote but cannot remember which of my 27 tabs it was in.
-
We could then use a Topic Modeling Tool to perform a statistical analysis that scans a set of documents to detect linguistic patterns and then to cluster the words or “topics” together as groups.
This is really interesting! Honestly, as a student I wasn't aware of this software prior to last semester. I feel as though I was living under a rock. A student was caught plagiarizing on their exam and their answers were clearly pulled from using this tool.
-
“distant reading”
At first I thought Distant Reading included the concept of almost reading between the lines or having a view from afar, like reading something and having a complety different understanding. I'm interested to know how to better understand how to analyze larger amounts of data pulled from distant reading.
-
the word “Indians” nearly fell out of all addresses after Teddy Roosevelt
I was curious as to why the word "Indians" was nearly removed from vocabulary. To find out, it was very similar to assimilation efforts completed in Canada. Teddy Roosevelt took land away from rightful owners and wanted American Natives to be otherwise known as "white man"
-
For my purposes, I primarily rely on two main software: Voyant Tools and Topic Modeling Tool. Both tools are good at analyzing large bodies of text in different ways.
Analyzing large bodies of texts can indeed be useful in the digital humanities space. However, I think it limits us to the variety of works out there that may not be all text but rather audio. This video discusses how language models in the music realm can not only predict music recommendations based of lyrical content. It for sure uses text analysis and mining tools but it can also rely on factors like sample spectrum analysis to analyze large bodies of music and to recommend music with similar frequencies and amplitudes in a given beat. pretty cool stuff!
https://www.youtube.com/watch?v=PFAu93xxoGA&ab_channel=EuroPythonConference
-
ission to “explain ideas debated in culture with visual essays.” I have grown fond of the work of Matt Daniels, founder of the Pudding, over the last few years. His continued work producing hip hop visualizations shows me how I might incorporate DH concepts such as “distant reading” into my own projects.
This defiantly has potential to become an influential tool in the music space. If large language/text models can analyze musical works and understand lyrical patterns a given genre of music uses, this defiantly can be incorporated into out daily lives or the current apps we use. For example, apple music has a feature which can recommend new music based off previous styles listened to in the past. I wonder if "distant reading" style of algorithms is used in this context? perhaps its a combination of many different language models in one tool? I'm sure apple has the resources for this regardless!
-
-
-
At the same time, like most women with public personas on the Internet, Porpentine has also received her share of hostile feedback: emails and tweets wishing her dead, and at least one detractor who called the existence of Howling Dogs “a crime.”
People can get really invested in the gaming scene and this is a common thing to see. I think when the the last of us part 2 came out a lot of the voicing cast received death threats from fans due to how the plot was designed in the sequel.
-
- Oct 2022
-
engl201.opened.ca engl201.opened.ca
-
"e change we are experiencing is precisely that quantitative and qualitativeevidence are becoming easier to combine
Distant reading is not solely a product of the digital age and exists in many forms
-
We’ve become so used to ignorance at thisscale, and so good at blu$ng our way around it, that we tend to overestimate ouractual knowledge.6
I really enjoy how this was written. I think a lot of people in this world overestimate what they actually know and the knowledge they have accessible to themselves.
-
Much of thisboils down to gatekeeping, and it is rarely informed by a clear understanding ofthe thing that is to be kept out.
I personally find it interesting that people are gatekeeping this type of advancement. I understand that people love to be in competition with one another, but this technology has great capabilities to help us look through various information quickly. Why should we gatekeep us a thing, big data or not?
-
we can learn a lot from computa-tional social science
I am interested in seeing how far we can ( if there truly is a limit) push the combination of computation onto social sciences. Will we one day have such a deep understanding of society and culture that we can boil every interaction down to a singular algorithm? Or is life and society too complex and random to have anything defined so precise? Only time will tell.
-
that we tend to overestimate ouractual knowledge
the idea that history is written by the the victors also applies to the publishing world.
-
gatekeepin
I was wondering when this word was going to crop up in this DH course.
-
A theory of learning that emphasizes general-ization has shown researchers how to train models that have thousands of variableswithout creating the false precision called “over#tting.
I had to look up overfitting and it wasn't quite what I thought. It is about when a model fits too closely to theoretical data. This can happen because too few data points are used and can further limit the model's ability to consider other data/situations. I thought this was interesting as it reminded me about those videos talking about how algorithms can be biased and I had never heard the term overfitting before.
-
an increase in the sheer availability of data, mediated by the Internet anddigital libraries.
With the availability of the internet and modern technology, distant reading can or has already become much easier to do and more utilized in order to study intellectual historty.
-
Social scientists can now connect structured socialevidence to loosely structured texts or images or sounds, and they’re discoveringthat this connection opens up fascinating questions.
It will be interesting what the near future brings with how fast technology is advancing.
-
We have spent too much time on inward-lookingdebates that pit distant against close reading, and not enough time understandingconnections to other disciplines.
Of course with innovation comes back lash, it is within human nature to want to not have/want change.
Ie. Technology such as the newest phones and older generations not wanting to learn/ not knowing how to understand then.
We tend to pit every new idea to an assortment of way/methods/things we already know rather than exploring them for what they were thought to be made for. I think seeing that there was an inward debate on the subject wasn't much of surprise but rather a given! Knowing of critics such as Stephen Marche and Stanly Fish, it is easy to see way it was the way it is.
I do however wonder which other disciplines we could better connect to? And whether it would be a better use of our time just understanding distant reading at its surface level or to keep a comparative with others too? (I would say comparing to others may help in the overall scheme of things).
-
But many useful methods are supervised, comparativelystraightforward, and have been in social-science courses for decades
This is an interesting point that reminded me of occam's razor. It is normal to overcomplicate things when sometimes it is the simplest answer or method that is correct/best. Just because something is simple does not mean it is not a good method to use.
-
-
www.digitalhumanities.org www.digitalhumanities.org
-
If we take a long view of disciplinary history, recent research on large digital libraries is just one expression of a much broader trend, beginning around the middle of the twentieth century, that has tended to reinstate the original historical ambitions of literary scholarship.
Historical recordings are exploding in the time of digital humanities, making it far easier to explore trends throughout history and better keep track of our history as a society.
-
distant reading is presented as a recent change of course
Seems to me distant reading is the computer form of having grad students doing a bunch of grunt work. I think as the article is going to try to get at, distant reading has its own computational history and humanities also has it's own history of big data analysis.
-
Radway’s quantitative methods may at first seem remote from familiar examples of distant reading. She doesn’t discuss algorithms. Instead she uses numbers simply to count and compare — in order to ask, for instance, which elements of a romance novel are most valued by readers.
This method seems to resonate a lot with me as numbers are generally something that stick with me. Understanding the importance of these numbers and knowing what the results allow for a better interpretation of the population based on a given sample size. Although I would disagree partially as I think algorithms give greatly insight into what is or may be needed to be understood. As well as to help automize the process of otherwise very strenuous and labor intensive duties. (Radway's seems to run fine without it being automized but a larger sample size may warrant it)
-
As long as readers remember that many ingredients of this history have longer backstories elsewhere, no one will be misled.
I think this is an important point that is often overlooked. There are many things that are observed without actually looking at the background or history of it. It is normal to judge based on what you know, but oftentimes we do not know the full background on what we are judging. Everything has a backstory but often considering "how it came to be" is neglected.
-
Computer-aided literature studies have failed to have a significant impact on the field as a whole
I think paragraph goes on to essentially say that the questions being asked were not well suited to this tool. There's a connection to earlier when they talk about the fact that computational methods could be introduced to literary studies with the view of filling a gap in knowledge. So it makes sense that initially these computer based methods weren't seen as effective because scholars didn't know the right questions to ask.
-
Readers of Moretti’s early experiments on large collections were accordingly tempted to interpret them as a normative argument that the only valid sample of literature is the largest possible one
I think this is something people misinterpret about big data, as this article talks about in the beginning. It's the classic quantity vs quality argument. Again, about Paul Schacht's research he talks about analyzing Dicken's writings and using a big data method let him do that but analyzing more and more novels past the ones Dickens had wrote wouldn't have made his research better.
-
criticism would gain nothing if we let meticulous hypothesis-testing drain all the warmth and flexibility from our writing
It's intriguing, I think this paragraph has summed up digital humanities in a way all those articles from week 1 weren't able to do for me. DH seems to be a constant balancing act (and then analysis) between the digitization of research and presentation and the flexible way humans have of thinking. Maybe this has been really obvious to everyone else but the topics we have covered in the last few weeks (digitizing physical literature and no big data research) have really highlighted that aspect of DH.
-
That is the point of using a clearly-defined sample of readers and novels
I have to admit, it is probably because I come form a science background, but I like this way of doing research, using numbers and data wherever possible. I don't have fond memories of trying to analyze books in high school for metaphors or other literary tools.
-
Literary scholars have been much slower to imitate her methods, which depended on questionnaires, interviews, and numbers.
It's really interesting to read about the kind of questions being asked by these scholars. They are all quite creative questions and it makes me think about what I may what to do for my final project.
-
I want to emphasize that distant reading is not a new trend, defined by digital technology or by contemporary obsession with the word data
I think the Paul Schacht video for this week will have a lot of similar opinions to this article, just based on this first bit of reading. Like he says in the video, the distant reading analysis helps him to answer his question but he still had to use his intuition to come up with the question and he has to interpret the results. The term Big Data can make it seem like a human element is lost but I think that is very much not the case.
-
There is nothing wrong with writing a history of food in America
this analogy, although seeming kinda silly actually helped me understand this bit
-
-
-
Twine games look and feel profoundly different from other games,
Is a video game with no video still a video game? How are digital humanities defined and who gets to define them?
-
-
uta.pressbooks.pub uta.pressbooks.pub
-
I have often returned to the Pudding for inspiration
These readings with links embedded within them provide great distractions :)
-
This visualization is an example of distant reading
Is google also using distant reading, when you look up a word and google has a little graph of it's usage over time? Edit: oops, the paragraph below answers this question
-
dozen speeches by Martin Luther King, Jr. Voyant Tools would make it possible for us to instantly tabulate the number of words in the speeches as well as word types,
I wonder if certain key historical figures who had a massive societal impact used specific strings of words, connections to other words and specific language that we can correlate to each other. If such connections exist, the applications could be endless in our society. Imagine having the cheat sheet to grow a social movement of any sort. This idea both excites and terrifies me at the same time.
-
novels, speeches, song lyrics, poems, newspaper articles, movie and television scripts, and courtroom proceedings are increasingly available online to the public
It is increasingly more common to find these mediums of societal information available only in the digital space. To me, this means that there is not only an opportunity but rather a necessity for distant reading for those interested in pursuing the humanities of our own day and age as well as into the future of our society.
-
Whereas close reading relies on analysis about the apparent inner workings of a single literary text, distant reading takes account of hundreds and even thousands of compositions.
On more of a personal note, I find that the information to be garnered from distant reading can be much more interesting and insightful to more impactful subjects overall. One could read a singular novel and infer a few details about the author and what they are trying to convey or analyze several thousand novels from a specific time or culture and infer interesting notions such as the values of the time/culture and how they preferred to communicate.
-
Figure 1.4.2 This digital essay compares the number of unique words used by some of the most famous artists in hip hop by using each artist’s first 35,000 lyrics so, prolific artists, such as Jay-Z, can be compared to newer artists, such as Drake.
Seeing the comparative of both older artist and newer artist is amazing but the red text highlighting that some rap artists actually surpassed Shakespeare in terms of unique words used in their first 35,000 lyrics was astonishing! A big one that did stick with me was how Nikki Minaj had more unique words said than Drake and seeing an overall trend of older artist having use more unique words more frequently than new artists do now. (Of course with some outliners, here and there). I really enjoyed seeing this example and how diverse the scope of Voyants uses can be!
-
refers to the processes by which computers detect information from a large body of compositions. Data mining typically concentrates on all kinds of elements in datasets, whereas text mining usually concentrates on the content, such as the words in novels or speeches.
Text mining seems to have a large role in todays society, from Risk management (Such as financial/insurance) to Cybercrime Prevention to better and more enhanced Customer Services. Services such as chatbots capturing data and extrapolating it to give results relevant to one's problems and help fasten up the processes. Looking forward to using Voyant Tools!
-
To discover the themes, a user could create a separate document of each of the duo’s albums, upload the corpus to Topic Modeling Tool, and interpret the string of words that the tool finds to be most prominent.
This could also be a way in which artists could strategize the use of specific words in songs to attract a larger audience. They could look at the similarities between top hits and find certain words that were used in all of them and then include it when advertising the music. For example, they could use it as a tag when posting on instagram or twitter and it may attract more attention.
-
Some words remained consistent, some words fell out of use, and others grew in use over time.
Looking at the frequency in which words were used may be a good indicator of how history shaped literature. Certain words that were used more frequently may show what most of the population was feeling at that time or even what was going on. Words that fell out of use could also represent the end of certain period.
-