1,080 Matching Annotations
  1. Last 7 days
    1. “But then again,” a person who used information in this way might say, “it’s not like I would be deliberately discriminating against anyone. It’s just an unfortunate proxy variable for lack of privilege and proximity to state violence.

      Twitter makes a number of predictions about users that could also be used as proxy variables for economic and cultural characteristics. It can display things like your audience's net worth as well as indicators commonly linked to political orientation. Triangulating some of this data could allow for other forms of intended or unintended discrimination.

      I've already been able to view a wide range (possibly spurious) information about my own reading audience through these analytics. On September 9th, 2019, I started a Twitter account for my 19th Century Open Pedagogy project and began serializing installments of critical edition, The Woman in White: Grangerized. The @OPP19c Twitter account has 62 followers as of September 17th.

      Having followers means I have access to an audience analytics toolbar. Some of the account's followers are nineteenth-century studies or pedagogy organizations rather than individuals. Twitter tracks each account as an individual, however, and I was surprised to see some of the demographics Twitter broke them down into. (If you're one of these followers: thank you and sorry. I find this data uncomfortable to look at.)

      Within this dashboard, I have a "Consumer Buying Styles" display that identifies categories such as "quick and easy" "ethnic explorers" "value conscious" and "weight conscious." These categories strike me as equal parts confusing and problematic: (Link to image expansion)

      I have a "Marital Status" toolbar alleging that 52% of my audience is married and 49% single.

      I also have a "Home Ownership" chart. (I'm presuming that the Elizabeth Gaskell House Museum's Twitter is counted as an owner...)

      ....and more

    1. More conspicuously, since Trump’s election, the RNC — at his campaign’s direction — has excluded critical “voter scores” on the president from the analytics it routinely provides to GOP candidates and committees nationwide, with the aim of electing down-ballot Republicans. Republican consultants say the Trump information is being withheld for two reasons: to discourage candidates from distancing themselves from the president, and to avoid embarrassing him with poor results that might leak. But they say its concealment harms other Republicans, forcing them to campaign without it or pay to get the information elsewhere.
  2. Sep 2019
    1. Methodology To determine the link between heat and income in U.S. cities, NPR used NASA satellite imagery and U.S. Census American Community Survey data. An open-source computer program developed by NPR downloaded median household income data for census tracts in the 100 most populated American cities, as well as geographic boundaries for census tracts. NPR combined these data with TIGER/Line shapefiles of the cities.

      This is an excellent example of data journalism.

    1. On the other hand, a resource may be generic in that as a concept it is well specified but not so specifically specified that it can only be represented by a single bit stream. In this case, other URIs may exist which identify a resource more specifically. These other URIs identify resources too, and there is a relationship of genericity between the generic and the relatively specific resource.

      I was not aware of this page when the Web Annotations WG was working through its specifications. The word "Specific Resource" used in the Web Annotations Data Model Specification always seemed adequate, but now I see that it was actually quite a good fit.

  3. Aug 2019
    1. Material Design Material System Introduction Material studies About our Material studies Basil Crane Fortnightly Owl Rally Reply Shrine Material Foundation Foundation overview Environment Surfaces Elevation Light and shadows Layout Understanding layout Pixel density Responsive layout grid Spacing methods Component behavior Applying density Navigation Understanding navigation Navigation transitions Search Color The color system Applying color to UI Color usage Text legibility Dark theme Typography The type system Understanding typography Language support Sound About sound Applying sound to UI Sound attributes Sound choreography Sound resources Iconography Product icons System icons Animated icons Shape About shape Shape and hierarchy Shape as expression Shape and motion Applying shape to UI Motion Understanding motion Speed Choreography Customization Interaction Gestures Selection States Material Guidelines Communication Confirmation & acknowledgement Data formats Data visualization Principles Types Selecting charts Style Behavior Dashboards Empty states Help & feedback Imagery Launch screen Onboarding Offline states Writing Guidelines overview Material Theming Overview Implementing your theme Components App bars: bottom App bars: top Backdrop Banners Bottom navigation Buttons Buttons: floating action button Cards Chips Data tables Dialogs Dividers Image lists Lists Menus Navigation drawer Pickers Progress indicators Selection controls Sheets: bottom Sheets: side Sliders Snackbars Tabs Text fields Tooltips Usability Accessibility Bidirectionality Platform guidance Android bars Android fingerprint Android haptics Android icons Android navigating between apps Android notifications Android permissions Android settings Android slices Android split-screen Android swipe to refresh Android text selection toolbar Android widget Cross-platform adaptation Data visualization Data visualization depicts information in graphical form. Contents Principles Types Selecting charts Style Behavior Dashboards Principles Data visualization is a form of communication that portrays dense and complex information in graphical form. The resulting visuals are designed to make it easy to compare data and use it to tell a story – both of which can help users in decision making. Data visualization can express data of varying types and sizes: from a few data points to large multivariate datasets. AccuratePrioritize data accuracy, clarity, and integrity, presenting information in a way that doesn’t distort it. HelpfulHelp users navigate data with context and affordances that emphasize exploration and comparison. ScalableAdapt visualizations for different device sizes, while anticipating user needs on data depth, complexity, and modality. Types Data visualization can be expressed in different forms. Charts are a common way of expressing data, as they depict different data varieties and allow data comparison.The type of chart you use depends primarily on two things: the data you want to communicate, and what you want to convey about that data. These guidelines provide descriptions of various different types of charts and their use cases.Types of chartsChange over time charts show data over a period of time, such as trends or comparisons across multiple categories. Common use cases include: Category comparison...Read MoreChange over timeChange over time charts show data over a period of time, such as trends or comparisons across multiple categories.Common use cases include: Stock price performanceHealth statisticsChronologies Change over time charts include:1. Line charts 2. Bar charts 3. Stacked bar charts 4. Candlestick charts 5. Area charts 6. Timelines 7. Horizon charts 8. Waterfall charts Category comparisonCategory comparison charts compare data between multiple distinct categories. Use cases include: Income across different countriesPopular venue timesTeam allocations Category comparison charts include: 1. Bar charts 2. Grouped bar charts 3. Bubble charts 4. Multi-line charts 5. Parallel coordinate charts 6. Bullet charts RankingRanking charts show an item’s position in an ordered list.Use cases include: Election resultsPerformance statistics Ranking charts include: 1. Ordered bar charts 2. Ordered column charts 3. Parallel coordinate charts Part-to-wholePart-to-whole charts show how partial elements add up to a total.Use cases include: Consolidated revenue of product categoriesBudgets Part-to-whole charts include: 1. Stacked bar charts 2. Pie charts 3. Donut charts 4. Stacked area charts 5. Treemap charts 6. Sunburst charts CorrelationCorrelation charts show correlation between two or more variables.Use cases include: Income and life expectancy Correlation charts include: 1. Scatterplot charts 2. Bubble charts 3. Column and line charts 4. Heatmap charts DistributionDistribution charts show how often each values occur in a dataset. Use cases include: Population distributionIncome distribution Distribution charts include: 1. Histogram charts 2. Box plot charts 3. Violin charts 4. Density charts FlowFlow charts show movement of data between multiple states.Use cases include: Fund transfersVote counts and election results Flow charts include: 1. Sankey charts 2. Gantt charts 3. Chord charts 4. Network charts RelationshipRelationship charts show how multiple items relate to one other.Use cases includeSocial networksWord charts Relationship charts include: 1. Network charts 2. Venn diagrams 3. Chord charts 4. Sunburst charts Selecting charts Multiple types of charts can be suitable for depicting data. The guidelines below provide insight into how to choose one chart over another. Showing change over timeChange over time can be expressed using a time series chart, which is a chart that represents data points in chronological order. Charts that express...Read MoreChange over time can be expressed using a time series chart, which is a chart that represents data points in chronological order. Charts that express change over time include: line charts, bar charts, and area charts.Type of chartUsageBaseline value * Quantity of time seriesData typeLine chartTo express minor variations in dataAny valueAny time series (works well for charts with 8 or more time series)ContinuousBar chartTo express larger variations in data, how individual data points relate to a whole, comparisons, and rankingZero4 or fewerDiscrete or categoricalArea chartTo summarize relationships between datasets, how individual data points relate to a wholeZero (when there’s more than one series)8 or fewerContinuous* The baseline value is the starting value on the y-axis.Bar and pie chartsBoth bar charts and pie charts can be used to show proportion, which expresses a partial value in comparison to a total value. Bar charts,...Read MoreBoth bar charts and pie charts can be used to show proportion, which expresses a partial value in comparison to a total value. Bar charts express quantities through a bar’s length, using a common baselinePie charts express portions of a whole, using arcs or angles within a circleBar charts, line charts, and stacked area charts are more effective at showing change over time than pie charts. Because all three of these charts share the same baseline of possible values, it’s easier to compare value differences based on bar length. Do.Use bar charts to show changes over time or differences between categories. Don’t.Don’t use multiple pie charts to show changes over time. It’s difficult to compare the difference in size across each slice of the pie. Area chartsArea charts come in several varieties, including stacked area charts and overlapped area charts: Overlapping area charts are not recommended with more than two time...Read MoreArea charts come in several varieties, including stacked area charts and overlapped area charts:Stacked area charts show multiple time series (over the same time period) stacked on top of one another Overlapped area charts show multiple time series (over the same time period) overlapping one anotherOverlapping area charts are not recommended with more than two time series, as doing so can obscure the data. Instead, use a stacked area chart to compare multiple values over a time interval (with time represented on the horizontal axis). Do.Use a stacked area chart to represent multiple time series and maintain a good level of legibility. Don’t.Don’t use overlapped area charts as it obscures data values and reduces readability. Style Data visualizations use custom styles and shapes to make data easier to understand at a glance, in ways that suit the user’s needs and context.Charts can benefit from customizing the following: Graphical elementsTypographyIconographyAxes and labelsLegends and annotationsStyling different types of dataVisual encoding is the process of translating data into visual form. Unique graphical attributes can be applied to both quantitative data (such as temperature, price,...Read MoreVisual encoding is the process of translating data into visual form. Unique graphical attributes can be applied to both quantitative data (such as temperature, price, or speed) and qualitative data (such as categories, flavors, or expressions). These attributes include:ShapeColorSizeAreaVolumeLengthAnglePosition DirectionDensityExpressing multiple attributesMultiple visual treatments can be applied to more than one aspect of a data point. For example, a bar color can represent a category, while a bar’s length can express a value (like population size). Shape can be used to represent qualitative data. In this chart, each category is represented by a specific shape (circles, squares, and triangles), which makes it easy to compare data both within a specific range or against other categories. ShapeCharts can use shapes to display data in a range of ways. A shape can be styled as playful and curvilinear, or precise and high-fidelity,...Read MoreCharts can use shapes to display data in a range of ways. A shape can be styled as playful and curvilinear, or precise and high-fidelity, among other ways in between. Level of shape detailCharts can represent data at varying levels of precision. Data intended for close exploration should be represented by shapes that are suitable for interaction (in terms of touch target size and related
  4. Jul 2019
    1. Every time your child opens the email, that person knows generally where they are (or specifically, if they have other info to triangulate against).
    1. In contrast to such pseudonymous social networking, Facebook is notable for its longstanding emphasis on real identities and social connections.

      Lack of anonymity also increases Facebook's ability to properly link shadow profiles purchased from other data brokers.

    1. our sum of squares is 41.187941.187941.1879

      Just considering the Y, and not the X. Calculating the residuals from the average/mean Y.

    1. in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance

      Use standardization, not min-max scaling, for clustering and PCA.

    2. As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.
    1. driven by data—where schools use data to identify a problem, select a strategy to address the problem, set a target for improvement, and iterate to make the approach more effective and improve student achievement.

      Gates data model.

    1. many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
  5. Jun 2019
  6. varsellcm.r-forge.r-project.org varsellcm.r-forge.r-project.org
    1. missing values are managed, without any pre-processing, by the model used to cluster with the assumption that values are missing completely at random.

      VarSelLCM package

    1. Success ina data science project comes not from access to any one exotic tool, but from having quantifiablegoals, good methodology, crossdiscipline interactions, and a repeatable workflow.



    1. Academicsarealsoatfaulthere:arecentanalysisof29millionpapersinover15,000peer-reviewedtitlespublishedaroundthetimeoftheZikaandEbolaepidemicsfoundthatlessthan1%exploredthegenderedimpactoftheoutbreaks

      How do we prevent this pattern here at Georgia Tech? There is a very obvious gender gap, especially in STEM where bad data in medicine and engineering are collected? What are some mini steps we can take to encourage pursuing data for different backgrounds? Education is always first, starting with class similar to this one informing people about how gender plays a role. Perhaps then we can create projects exploring the issue in data related to each person's major.



    1. However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.

      Use min-max scaling for image processing & neural networks.

    2. The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0μ=0\mu = 0 and σ=1σ=1\sigma = 1 where μμ\mu is the mean (average) and σσ\sigma is the standard deviation from the mean
  7. May 2019

      This is an interesting fact, usually when I think of visualization and data I go to the classic default charts and data. I'll have to keep this iin mind.


      I really like this because I don't see it often and it actually does draw my eye to the data and capture my interest.

    1. Virtually all BPMs have utilities for creating simple, data-gathering forms. And in many types of workflows, these simple forms may be adequate. However, in any workflow that includes complex document assembly (such as loan origination workflows), BPM forms are not likely to get the job done. Automating the assembly of complex documents requires ultra-sophisticated data-gathering forms, which can only be designed and created after the documents themselves have been automated. Put another way, you won't know which questions need to be asked to generate the document(s) until you've merged variables and business logic into the documents themselves. The variables you merge into the document serve as question fields in the data gathering forms. And here's the key point - since you have to use the document assembly platform to create interviews that are sophisticated enough to gather data for your complex documents, you might as well use the document assembly platform to generate all data-gathering forms in all of your workflows.
    1. El ritmo de las actividades de diseño e instalación de redes comunitarias en veredas del municipio de Fusagasugá se ve acrecentado por las convocatorias internas de investigación de la Universidad de Cundinamarca que a lo largo del tiempo de vida de Red FusaLibrehan sido un músculo financiero que les permite acelerar los proc

      Interesante vínculo entre comunidad y universidad. En nuestro caso, no hemos logrado un vínculo permanente y si bien algunos dineros de convocatorias de investigación universitaria y convocatorias internacionales permitieron pagar parte de los Data Weeks, junto con una contribución menor de algunos asistentes, en general ha sido un proyecto financiado con recursos propios y préstamos familiares.

    1. Developing economies’ copper demand has steadily grown over the last decades, fueling economic and social improvement. By 2011, China already represented 40% of the demand.

      Why does China need so much.

    2. Codelco is a state-owned Chilean mining company and the world’s largest copper producer. Based on their annual report and USGS statistics, they produced ~10% of the world’s copper in 2015 and own 8% of global reserves. They are also a large producer of greenhouse gas emissions. Last year, Codelco produced 3,2 t CO2e/millions tmf from both indirect and direct effects, and in 2011 it consumed 12% of the total national electricity supply.

      Goddamn they should start recylcling

    1. Methodology The classic OSINT methodology you will find everywhere is strait-forward: Define requirements: What are you looking for? Retrieve data Analyze the information gathered Pivoting & Reporting: Either define new requirements by pivoting on data just gathered or end the investigation and write the report.

      Etienne's blog! Amazing resource for OSINT; particularly focused on technical attacks.

  8. Apr 2019
    1. Powered by Data wrote 4 of the resources on this page. "Measuring Outcomes" is about admin data. "Understanding the Philanthropic Landscape" is about open data - sp. open grants data. "Effective Giving" is an intro. And "Emerging Data Practices" is a tech backgrounder from June 2015.

    1. Instead of encouraging more “data-sharing”, the focus should be the cultivation of “data infrastructure”,¹⁴ maintained for the public good by institutions with clear responsibilities and lines of accountability.

  9. Mar 2019
  10. www.archivogeneral.gov.co www.archivogeneral.gov.co
    1. Normalización de las entradas descriptivas: Personas, Lugares, Instituciones (utilización de Linked Open Data (LOD) cuando sea posible.

      ¿Qué sistema de organización de conocimiento se los posibilita? ¿Qué están usando para enlazar datos y en qué formato?

    1. The government needs to place tough restrictions on data collection and storage by businesses to limit the amount of damage in the event of a cyber breach.

      I find it hard to imagine how this could be usefully implemented. How is monitoring of data collection going to be done?

      Even simpler ideas, like the Do Not Call registry, have difficulty clamping down on businesses that breach regulations.

    1. Mithering about the unmodellable. "Sometime late last year I went to the Euro IA conference with Anya and Silver to give a talk on the domain modelling work we've been doing in UK Parliament."

    1. DXtera Institute is a non-profit, collaborative member-based consortium dedicated to transforming student and institutional outcomes in higher education.

      DXtera Institute is a non-profit, collaborative member-based consortium dedicated to transforming student and institutional outcomes in higher education. We specialize in helping higher education professionals drive more efficient access to information and insights for effective decision-making and realize long-term cost savings, by simplifying and removing barriers to systems integration and improving data aggregation and control.

      With partners across the U.S. and Europe, our consortium includes some of the brightest minds in education and technology, all working together to solve critical higher education issues on a global scale.

    1. Data journalism produced by two of the nation’s most prestigious news organizations — The New York Times and The Washington Post — has lacked transparency, often failing to explain the methods journalists or others used to collect or analyze the data on which the articles were based, a new study finds. In addition, the news outlets usually did not provide the public with access to that data

      While this is a worthwhile topic, I would like to see more exploration of data journalism in the 99.99999 percent of news organizations that are NOT the New York Times or the Washington Post and don't have the resources to publish so many data stories despite the desperate need for them across the nation. Also, why no digital news outlets included?

    2. Worse yet, it wouldn’t surprise me if we saw more unethical people publish data as a strategic communication tool, because they know people tend to believe numbers more than personal stories. That’s why it’s so important to have that training on information literacy and methodology.”

      Like the way unethical people use statistics in general? This should be a concern, especially as government data, long considered the gold standard of data, undergoes attacks that would skew the data toward political ends. (see the census 2020)

    3. fall short of the ideal of data journalism

      Is this the ideal of data journalism? Where is this ideal spelled out, and is there any sign that the NYT and WaPo have agreed to abide by this ideal?

  11. Feb 2019
    1. set; if this is higher, the tree 2can be considered to fit the data less well

      To test the fit between data and more than one alternative tree, you can just do a bootstrap analysis, and map the results on a neighbour-net splits graph based on the same data.

      Note that the phangorn library includes functions to transfer information between trees/tree samples and trees and networks:<br/> Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution (DOI:10.1111/2041-210X.12760.)[http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12760/full] – the basic functions and script templates are provided in the associated vignette.

    1. For example, the idea of “data ownership” is often championed as a solution. But what is the point of owning data that should not exist in the first place? All that does is further institutionalise and legitimate data capture. It’s like negotiating how many hours a day a seven-year-old should be allowed to work, rather than contesting the fundamental legitimacy of child labour. Data ownership also fails to reckon with the realities of behavioural surplus. Surveillance capitalists extract predictive value from the exclamation points in your post, not merely the content of what you write, or from how you walk and not merely where you walk. Users might get “ownership” of the data that they give to surveillance capitalists in the first place, but they will not get ownership of the surplus or the predictions gleaned from it – not without new legal concepts built on an understanding of these operations.
    1. These models are emerging, which is why its exciting to be involved in the ground floor of this sector, however some models clearly make sense already and thats largely because they closely follow the models free software itself has shaped. If you want status, then you can make a name for yourself by leading a team to write the docs ala free software itself, if you want money then build the reputation for the documentation team and contract out your knowledge (eg. extend the docs on contract ala free software).

      Creo que hay que conectarlo con modelos de microfinanciación y tiendas independientes tipo Itch.io y que el experimento debería ser progresivo pero dejar un mapa posible de su propio futuro. Algo así intentaremos en la edición 13a del Data Week.

    1. Dissecting Flavivirus Biology in Salivary Gland Cultures from Fed and Unfed Ixodes scapularis (Black-Legged Tick)

      Data worth viewing: a tick trachea with viral infection in its salivary glands.

    1. !..�P'�r\0CA \= e,;4 ��'-"-'

      Could empirical data made up of experiences present in the form of an ethnography? Or autoethnography? I'm not sure if this is what you were getting at here, but it is a thought that came to mind!

  12. Jan 2019
    1. Nyhan and Reifler also found that presenting challenging information in a chart or graph tends to reduce disconfirmation bias. The researchers concluded that the decreased ambiguity of graphical information (as opposed to text) makes it harder for test subjects to question or argue against the content of the chart.

      Amazingly important double-edged finding for discussions of data visualization!

    1. Big Data is a buzzword which works on the platform of large data volume and aggregated data sets. The data sets can be structured or unstructured.  The data that is kept and stored at a global level keeps on growing so this big data has a big potential.  Your Big Data is generated from every little thing around us all the time. It has changed its way as the people are changing in the organization. New skills are being offered to prepare the new generated power of Big Data. Nowadays the organizations are focusing on new roles, new challenges and creating a new business.  
  13. demandlab.weebly.com demandlab.weebly.com
    1. y bosses want to see quick wins, but I know we can achieve big w

      add "My data (database) quality sucks"

    1. The main thing Smith has learned over the past seven years is “the importance of ownership.” He admitted that Tumblr initially helped him “build a community around the idea of digital news.” However, it soon became clear that Tumblr was the only one reaping the rewards of its growing community. As he aptly put it, “Tumblr wasn’t seriously thinking about the importance of revenue or business opportunities for their creators.”
    1. You may not access or use the Site in any manner that could damage or overburden any MIT server, or any network connected to any MIT server. You may not use the Site in any manner that would interfere with any other party’s use of the Site.

      Vamos a realizar pequeños scrapping, que no sobrecargarán el servidor, así que estamos cumpliendo con esta parte y de hecho, después de que trabajemos, permitiran repartir la carga del servidor, pues una copia estará en nuestros servidores.

    1. Adoption of good practice to generate high quality data will depend on sharing the burden of capacity building in some way. That in turn, can-not happen until there is a framework that provides sufficient trust to allow the sharing and compar-ison of data and its management.

      harkening to the 'data trust' concept being discussed from U.S. Mellon-funded projects, also co-authored by the authors of this paper.

    1. I tried very hard in that book, when it came to social media, to be platform agnostic, to emphasize that social media sites come and go, and to always invest first and foremost in your own media. (Website, blog, mailing list, etc.)
  14. Dec 2018
    1. Outliers : All data sets have an expected range of values, and any actual data set also has outliers that fall below or above the expected range. (Space precludes a detailed discussion of how to handle outliers for statistical analysis purposes, see: Barnett & Lewis, 1994 for details.) How to clean outliers strongly depends on the goals of the analysis and the nature of the data.

      Outliers can be signals of unanticipated range of behavior or of errors.

    2. Understanding the structure of the data : In order to clean log data properly, the researcher must understand the meaning of each record, its associated fi elds, and the interpretation of values. Contextual information about the system that produced the log should be associated with the fi le directly (e.g., “Logging system recorded this fi le on 12-3-2012”) so that if necessary the specifi c code that gener-ated the log can be examined to answer questions about the meaning of the record before executing cleaning operations. The potential misinterpretations take many forms, which we illustrate with encoding of missing data and capped data values.

      Context of the data collection and how it is structured is also a critical need.

      Example, coding missing info as "0" risks misinterpretation rather than coding it as NIL, NDN or something distinguishable from other data

    3. Data transformations : The goal of data-cleaning is to preserve the meaning with respect to an intended analysis. A concomitant lesson is that the data-cleaner must track all transformations performed on the data .

      Changes to data during clean up should be annotated.

      Incorporate meta data about the "chain of change" to accompany the written memo

    4. Data Cleaning A basic axiom of log analysis is that the raw data cannot be assumed to correctly and completely represent the data being recorded. Validation is really the point of data cleaning: to understand any errors that might have entered into the data and to transform the data in a way that preserves the meaning while removing noise. Although we discuss web log cleaning in this section, it is important to note that these principles apply more broadly to all kinds of log analysis; small datasets often have similar cleaning issues as massive collections. In this section, we discuss the issues and how they can be addressed. How can logs possibly go wrong ? Logs suffer from a variety of data errors and distortions. The common sources of errors we have seen in practice include:

      Common sources of errors:

      • Missing events

      • Dropped data

      • Misplaced semantics (encoding log events differently)

    5. In addition, real world events, such as the death of a major sports fi gure or a political event can often cause people to interact with a site differently. Again, be vigilant in sanity checking (e.g., look for an unusual number of visitors) and exclude data until things are back to normal.

      Important consideration for temporal event RQs in refugee study -- whether external events influence use of natural disaster metaphors.

    6. Recording accurate and consistent time is often a challenge. Web log fi les record many different timestamps during a search interaction: the time the query was sent from the client, the time it was received by the server, the time results were returned from the server, and the time results were received on the client. Server data is more robust but includes unknown network latencies. In both cases the researcher needs to normalize times and synchronize times across multiple machines. It is common to divide the log data up into “days,” but what counts as a day? Is it all the data from midnight to midnight at some common time reference point or is it all the data from midnight to midnight in the user’s local time zone? Is it important to know if people behave differently in the morning than in the evening? Then local time is important. Is it important to know everything that is happening at a given time? Then all the records should be converted to a common time zone.

      Challenges of using time-based log data are similar to difficulties in the SBTF time study using Slack transcripts, social media, and Google Sheets

    7. Log Studies collect the most natural observations of people as they use systems in whatever ways they typically do, uninfl uenced by experimenters or observers. As the amount of log data that can be collected increases, log studies include many different kinds of people, from all over the world, doing many different kinds of tasks. However, because of the way log data is gathered, much less is known about the people being observed, their intentions or goals, or the contexts in which the observed behaviors occur. Observational log studies allow researchers to form an abstract picture of behavior with an existing system, whereas experimental log stud-ies enable comparisons of two or more systems.

      Benefits of log studies:

      • Complement other types of lab/field studies

      • Provide a portrait of uncensored behavior

      • Easy to capture at scale

      Disadvantages of log studies:

      • Lack of demographic data

      • Non-random sampling bias

      • Provide info on what people are doing but not their "motivations, success or satisfaction"

      • Can lack needed context (software version, what is displayed on screen, etc.)

      Ways to mitigate: Collecting, Cleaning and Using Log Data section

    8. Two common ways to partition log data are by time and by user. Partitioning by time is interesting because log data often contains signifi cant temporal features, such as periodicities (including consistent daily, weekly, and yearly patterns) and spikes in behavior during important events. It is often possible to get an up-to-the- minute picture of how people are behaving with a system from log data by compar-ing past and current behavior.

      Bookmarked for time reference.

      Mentions challenges of accounting for time zones in log data.

    9. An important characteristic of log data is that it captures actual user behavior and not recalled behaviors or subjective impressions of interactions.

      Logs can be captured on client-side (operating systems, applications, or special purpose logging software/hardware) or on server-side (web search engines or e-commerce)

    10. Table 1 Different types of user data in HCI research

    11. Large-scale log data has enabled HCI researchers to observe how information diffuses through social networks in near real-time during crisis situations (Starbird & Palen, 2010 ), characterize how people revisit web pages over time (Adar, Teevan, & Dumais, 2008 ), and compare how different interfaces for supporting email organi-zation infl uence initial uptake and sustained use (Dumais, Cutrell, Cadiz, Jancke, Sarin, & Robbins, 2003 ; Rodden & Leggett, 2010 ).

      Wide variety of uses of log data

    12. Behavioral logs are traces of human behavior seen through the lenses of sensors that capture and record user activity.

      Definition of log data

    1. Ethnographic findings are not privileged, just particular: another country heard from. To regard them as anything more (or anything less) than that distorts both them and their implications, which are far profounder than mere primitivity, for social theory.

      This tension exists in HCI as well.

      Interpreted data vs empirical data and how each is systematically analyzed.

    1. With Alphabet Inc.’s Google, and Facebook Inc. and its WhatsApp messaging service used by hundreds of millions of Indians, India is examining methods China has used to protect domestic startups and take control of citizens’ data.

      Governments owning citizens' data directly?? Why not have the government empower citizens to own their own data?