3,557 Matching Annotations
  1. Dec 2018
    1. Data Cleaning A basic axiom of log analysis is that the raw data cannot be assumed to correctly and completely represent the data being recorded. Validation is really the point of data cleaning: to understand any errors that might have entered into the data and to transform the data in a way that preserves the meaning while removing noise. Although we discuss web log cleaning in this section, it is important to note that these principles apply more broadly to all kinds of log analysis; small datasets often have similar cleaning issues as massive collections. In this section, we discuss the issues and how they can be addressed. How can logs possibly go wrong ? Logs suffer from a variety of data errors and distortions. The common sources of errors we have seen in practice include:

      Common sources of errors:

      • Missing events

      • Dropped data

      • Misplaced semantics (encoding log events differently)

    2. In addition, real world events, such as the death of a major sports fi gure or a political event can often cause people to interact with a site differently. Again, be vigilant in sanity checking (e.g., look for an unusual number of visitors) and exclude data until things are back to normal.

      Important consideration for temporal event RQs in refugee study -- whether external events influence use of natural disaster metaphors.

    3. Recording accurate and consistent time is often a challenge. Web log fi les record many different timestamps during a search interaction: the time the query was sent from the client, the time it was received by the server, the time results were returned from the server, and the time results were received on the client. Server data is more robust but includes unknown network latencies. In both cases the researcher needs to normalize times and synchronize times across multiple machines. It is common to divide the log data up into “days,” but what counts as a day? Is it all the data from midnight to midnight at some common time reference point or is it all the data from midnight to midnight in the user’s local time zone? Is it important to know if people behave differently in the morning than in the evening? Then local time is important. Is it important to know everything that is happening at a given time? Then all the records should be converted to a common time zone.

      Challenges of using time-based log data are similar to difficulties in the SBTF time study using Slack transcripts, social media, and Google Sheets

    4. Log Studies collect the most natural observations of people as they use systems in whatever ways they typically do, uninfl uenced by experimenters or observers. As the amount of log data that can be collected increases, log studies include many different kinds of people, from all over the world, doing many different kinds of tasks. However, because of the way log data is gathered, much less is known about the people being observed, their intentions or goals, or the contexts in which the observed behaviors occur. Observational log studies allow researchers to form an abstract picture of behavior with an existing system, whereas experimental log stud-ies enable comparisons of two or more systems.

      Benefits of log studies:

      • Complement other types of lab/field studies

      • Provide a portrait of uncensored behavior

      • Easy to capture at scale

      Disadvantages of log studies:

      • Lack of demographic data

      • Non-random sampling bias

      • Provide info on what people are doing but not their "motivations, success or satisfaction"

      • Can lack needed context (software version, what is displayed on screen, etc.)

      Ways to mitigate: Collecting, Cleaning and Using Log Data section

    5. Two common ways to partition log data are by time and by user. Partitioning by time is interesting because log data often contains signifi cant temporal features, such as periodicities (including consistent daily, weekly, and yearly patterns) and spikes in behavior during important events. It is often possible to get an up-to-the- minute picture of how people are behaving with a system from log data by compar-ing past and current behavior.

      Bookmarked for time reference.

      Mentions challenges of accounting for time zones in log data.

    6. An important characteristic of log data is that it captures actual user behavior and not recalled behaviors or subjective impressions of interactions.

      Logs can be captured on client-side (operating systems, applications, or special purpose logging software/hardware) or on server-side (web search engines or e-commerce)

    7. Large-scale log data has enabled HCI researchers to observe how information diffuses through social networks in near real-time during crisis situations (Starbird & Palen, 2010 ), characterize how people revisit web pages over time (Adar, Teevan, & Dumais, 2008 ), and compare how different interfaces for supporting email organi-zation infl uence initial uptake and sustained use (Dumais, Cutrell, Cadiz, Jancke, Sarin, & Robbins, 2003 ; Rodden & Leggett, 2010 ).

      Wide variety of uses of log data

    1. Ethnographic findings are not privileged, just particular: another country heard from. To regard them as anything more (or anything less) than that distorts both them and their implications, which are far profounder than mere primitivity, for social theory.

      This tension exists in HCI as well.

      Interpreted data vs empirical data and how each is systematically analyzed.

  2. Nov 2018
    1. One way to think about "core" biodiversity data is as a network of connected entities, such as taxa, taxonomic names, publications, people, species, sequences, images, collections, etc. (Fig. 1)
    1. “It’s about embracing the inscrutable nature of human interactions,” says Chang. Evidence-based medicine was a massive improvement over intuition-based medicine, he says, but it only covers traditionally quantifiable data, or those things that are easy to measure. But we’re now quantifying information that was considered qualitative a generation ago.

      Biggest challenges to redesigning the health care system in a way that would work better for patients and improve health

    2. “Our biggest opportunity is leaning into that. It’s either embracing the qualitative nature of that and designing systems that can act just on the qualitative nature of their experience, or figuring how to quantitate some of those qualitative measures,” says Chang. “That’ll get us much further, because the real value in health care systems is in the human interactions. My relationship with you as a doctor and a patient is far more valuable than the evidence that some trial suggests.”

      Biggest challenges to redesigning the health care system in a way that would work better for patients and improve health

    1. Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … – places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
    1. A KG typically spans across several domains and is built on topof a conceptual schema, orontology, which defines what types of entities (classes) are allowed inthe graph, alongside the types ofpropertiesthey can have

      Wikidata differs from typical KG as it is not build on top of classes (entity types). Any item (entity) can be connected by any property. Wikidata's only strict "classes" in the sense of KG classes are its data types (item, lemma, monolingual string...).

    Tags

    Annotators

    1. Does the widespread and routine collection of student data in ever new and potentially more-invasive forms risk normalizing and numbing students to the potential privacy and security risks?

      What happens if we turn this around - given a widespread and routine data collection culture which normalizes and numbs students to risk as early as K-8, what are our responsibilities (and strategies) to educate around this culture? And how do our institutional practices relate to that educational mission?

  3. Oct 2018
    1. As a recap, Chegg discovered on September 19th a data breach dating back to April that "an unauthorized party" accessed a data base with access to "a Chegg user’s name, email address, shipping address, Chegg username, and hashed Chegg password" but no financial information or social security numbers. The company has not disclosed, or is unsure of, how many of the 40 million users had their personal information stolen.

  4. Sep 2018
    1. End-Users

      Because Grafoscopio was used in critical digital literacy workshops, dealing with data activism and journalism, the intended users are people who don't know how to program necessarily, but are not afraid of learning to code to express their concerns (as activists, journalists and citizens in general) and if fact are wiling to do so.

      Tool adaptation was "natural" of the workshops, because the idea was to extend the tool so it can deal with authentic problems at hand (as reported extensively in the PhD thesis) and digital citizenship curriculum was build in the events as a memory of how we deal with the problems. But critical digital literacy is a long process, so coding as a non-programmers knowledge in service of wider populations able to express in code, data and visualizations citizen concerns is a long time process.

      Visibility, scalability and sustainablitiy of such critical digital literacy endeavors where communities and digital tools change each other mutually is still an open problem, even more considering their location in the Global South (despite addressing contextualized global problems).

  5. Aug 2018