34 Matching Annotations
  1. Jul 2019
    1. There are two main command-line interfaces, or ‘shells,’ that many digital historians use. On OS X or many Linux installations, the shell is known as bash, or the ‘Bourne-again shell.’
  2. www.themacroscope.org www.themacroscope.org
    1. This is a generative approach: big data for the humanities is not only about justifying a story about the past, but generating new stories, new perspectives, given our new vantage points and tools.
    2. an approach to big data for the historian (we argue) needs to be a public approach
    3. This volume represents our view of what some of the most useful of these developing approaches are, how to use them, what to be wary of, and the kinds of questions and new perspectives our macroscope opens up.
    1. A macroscope is a bit like a microscope or a telescope, but instead of allowing you to see things that are small or far away, the macroscope makes it easier to grasp the incredibly large. It does so through a process of compression
    1. A key rule to remember is that there is no ‘right’ or ‘wrong’ way to do these forms of analysis: they are tools and for most historians, the real lifting will come once you have the results. Yet we do need to realize that these tools shape our research: they can occasionally occlude context, or mislead us. These questions are at the forefront of this chapter.
    1. While this is fraught with issues – words change meaning over time, different terms are used to describe similar concepts, and we still face the issues outlined above – we can arguably still learn something from this.
    2. It also represents the inversion of the traditional historical process: rather than looking at documents that we think may be important to our project and pre-existing thesis, we are looking at documents more generally to see what they might be about. With Big Data, it is sometimes important to let the sources speak to you, rather than looking at them with pre-conceptions of what you might find.
  3. www.themacroscope.org www.themacroscope.org
    1. AntConc is an invaluable way to carry out some forms of textual analysis on data sets. While it does not scale to the largest datasets terribly well, if you have somewhere in the ballpark of 500 or even 1,000 newspaper-length articles you should be able to crunch data and receive tangible results.
    1. Along with these benefits, however, digitisation has wrought concomitant problems. In a consumer-driven academic environment, funding for digitisation may be tied not only to concerns about access and preservation, but also to the need to increase visibility to ensure viability
    2. , I offer three questions researchers should consider before consulting materials in a digital archive. Have the individuals whose work appears in these materials consented to this? Whose labour was used and how is it acknowledged? What absences must be attended to among an abundance of materials? Finally, I suggest that researchers should draw on the existing body of scholarship about these issues by librarians and archivists.
    1. Topic modeling allows us to quantify and visualize this pattern, a pattern not immediately visible to a human reader.
    2. The question remains, how does a reader (computer or human) recognize and conceptualize the recurrent themes that run through nearly 10,000 entries?
    3. hort, content-driven entries that usually touch upon a limited number of topics appear to produce remarkably cohesive and accurate topics. In some cases (especially in the case of the EMOTION topic), MALLET did a better job of grouping words than a human reader.
    1. We cannot rely only on the computer-driven groups to use in analyzing texts.  The next step is to look at the texts that contain repeating word patterns and conduct a close reading to see what we can learn about the topic. Plotting the topic over time enables us to locate trends in how important the topic was to the author, or when we compare them with other authors, we can investigate differences in the ways that two authors valued these topics or the different ways that they expressed themselves.
    1. The ‘model’ in a topic model is the idea of how texts get written: authors compose texts by selecting words from a distribution of words (or ‘bag of words’ or ‘bucket of words’) that describe various thematic topics.
    1. Back to my spreadsheet I went where I discovered that variations in the title of the essay and in the author credits obfuscated the connections, as well as those for the other two essays by black women, Frances M. Beal and Maryanne Weathers.

      Shows the importance of double checking data and the hazards of working with others data.

    2. Almetrics

      Non-traditional bibliometrics, seen as alternatives or compliments to more traditional citation metrics. Can include people, journals, books, data sets, presentations, videos, source code repositories, web pages, etc.

    1. Notice how in the chart in figure 5.3, it can easily be noticed that whomever entered the data on book publication dates accidentally typed “1909” rather than “1990” for one of the books.

      Like last week when "Metadata" was the most frequent word in the Colored Conference documents.

    2. Visualization is a method of deforming, compressing, or otherwise manipulating data in order to see it in new and enlightening ways.

      definition

    1. Rectangles are sized proportionally to the amount of money received per category in 2013, and coloured by the percentage that amount had changed since the previous fiscal year.

      It is a visually appealing map but I don't know what half the funds are used for due to the blocks being so small.

    1. The most important aspect of choosing an appropriate graphic variable is to know the nature of your data variables.
    1. One popular service is colorbrewer (http://colorbrewer2.org/), which allows you to create a color scheme that fits whatever set of parameters you may need.
    1. Adobe Photoshop and Illustrator, as well as the free Inkscape and Gimp, are all good tools for creating legends.

      Tools to use.

    1. Formal networks are mathematical instantiations of the idea that entities and connections between them exist in consort. They embody the idea that connectivity is key in understanding how the world works, both at an individual and a global scale.
    1. Historians will want note when their networks are explicit / physically instantiated, and when they are implicit / derived. An explicit network could be created from letters between correspondents, or roads that physically exist between cities. A derived network might be that of the subjectively-defined similarity between museum artefacts or the bibliographic coupling network connecting articles together if they reference similar sources.

      Explicit & Natural vs. Implicit & Derived

    2. A directed edge is one that is part of an asymmetrical relationship, and an easy way of thinking about them is by imagining arrow tips on the edges.
    3. Transitivity is the concept that when A is connected to B and C, B and C will also be connected. Some networks, like those between friends, feature a high degree of transitivity; others do not.

      Transitivity definition.

    4. Assortativity, also called homophily, is the measure of how much like attracts like among dyads in a network. On the web, for instance, websites (nodes) tend to link (via edges) to topically similar sites. When dyads connect assortatively, a network is considered assortatively mixed. Networks can also experience disassortative mixing, for example when people from isolated communities with strong family ties seek dissimilarity in sexual partners.

      Assortative vs Disassortative mixing.

    5. Aggregate all of the data into one giant network representing the entire span of time, whether it is a day, a year, or a century. The network is static. Slowly build the network over time, creating snapshots that include the present moment and all of the past. Each successive snapshot includes more and more data, and represents each moment of time as an aggregate of everything that led up to it. The network continues to grow over time. Create a sliding window of time, e.g. a week or five years, and analyze successive snapshots over that period. Each snapshot then only contains data from that time window. The network changes drastically in form over time.

      Ways to implement networks.

    6. Edges

      Edges definition

    7. Nodes

      Node definition

    8. Entities are called nodes (figure 6.1) and the relationships between them are called edges (figure 6.2), and everything about a network pivots on these two building blocks.

      Nodes & Edges explanations

    9. Despite their name, networks are dichotomous, in that they only give you two sorts of things to work with: entities and relationships. Entities are called nodes (figure 6.1) and the relationships between them are called edges (figure 6.2), and everything about a network pivots on these two building blocks.