29 Matching Annotations
  1. Dec 2018
  2. Nov 2018
    1. Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … – places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
  3. Oct 2018
    1. tl;dr: data engineer = software, coding, cleaning data sets data architects = structure the technology to manage data models and database admin data scientist = stats + math models business analysts = communication and domain expertise

  4. May 2018
    1. Negative values included when assessing air quality In computing average pollutant concentrations, EPA includes recorded values that are below zero. EPA advised that this is consistent with NEPM AAQ procedures. Logically, however, the lowest possible value for air pollutant concentrations is zero. Either it is present, even if in very small amounts, or it is not. Negative values are an artefact of the measurement and recording process. Leaving negative values in the data introduces a negative bias, which potentially under represents actual concentrations of pollutants. We noted a considerable number of negative values recorded. For example, in 2016, negative values comprised 5.3 per cent of recorded hourly PM2.5 values, and 1.3 per cent of hourly PM10 values. When we excluded negative values from the calculation of one‐day averages, there were five more exceedance days for PM2.5 and one more for PM10 during 2016.
  5. Sep 2017
    1. We’re delighted to announce that the California Digital Library has been awarded a 2-year NSF EAGER grant to support active, machine-actionable data management plans (DMPs).
  6. Mar 2017
  7. Feb 2017
    1. After a brief training session, participants spent six hours archiving environmental data from government websites, including those of the National Oceanic and Atmospheric Administration and the Interior Department.

      A worthwhile effort.

    2. An anonymous donor has provided storage on Amazon servers, and the information can be searched from a website at the University of Pennsylvania called Data Refuge. Though the Federal Records Act theoretically protects government data from deletion, scientists who rely on it say would rather be safe than sorry.

      Data refuge.

  8. Oct 2016
    1. (courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf)

      nice reference !

  9. Sep 2016
    1. Activities such as time spent on task and discussion board interactions are at the forefront of research.

      Really? These aren’t uncontroversial, to say the least. For instance, discussion board interactions often call for careful, mixed-method work with an eye to preventing instructor effect and confirmation bias. “Time on task” is almost a codeword for distinctions between models of learning. Research in cognitive science gives very nuanced value to “time spent on task” while the Malcolm Gladwells of the world usurp some research results. A major insight behind Competency-Based Education is that it can allow for some variance in terms of “time on task”. So it’s kind of surprising that this summary puts those two things to the fore.

  10. Jul 2016
    1. p. 141

      Initially, the digital humanities consisted of the curation and analysis of data that were born digital, and the digitisation and archiving projects that sought to render analogue texts and material objects into digital forms that could be organised and searched and be subjects to basic forms of overarching, automated or guided analysis, such as summary visualisations of content or connections between documents, people or places. Subsequently, its advocates have argued that the field has evolved to provide more sophisticated tools for handling, searching, linking, sharing and analysing data that seek to complement and augment existing humanities methods, and facilitate traditional forms of interpretation and theory building, rather than replacing traditional methods or providing an empiricist or positivistic approach to humanities scholarship.

      summary of history of digital humanities

  11. Apr 2016
    1. Great Principles of Computing<br> Peter J. Denning, Craig H. Martell

      This is a book about the whole of computing—its algorithms, architectures, and designs.

      Denning and Martell divide the great principles of computing into six categories: communication, computation, coordination, recollection, evaluation, and design.

      "Programmers have the largest impact when they are designers; otherwise, they are just coders for someone else's design."

  12. Mar 2016
  13. Feb 2016
    1. Great explanation of 15 common probability distributions: Bernouli, Uniform, Binomial, Geometric, Negative Binomial, Exponential, Weibull, Hypergeometric, Poisson, Normal, Log Normal, Student's t, Chi-Squared, Gamma, Beta.

    1. Since its start in 1998, Software Carpentry has evolved from a week-long training course at the US national laboratories into a worldwide volunteer effort to improve researchers' computing skills. This paper explains what we have learned along the way, the challenges we now face, and our plans for the future.

      http://software-carpentry.org/lessons/<br> Basic programming skills for scientific researchers.<br> SQL, and Python, R, or MATLAB.

      http://www.datacarpentry.org/lessons/<br> Managing and analyzing data.

  14. Jan 2016
    1. 50 Years of Data Science, David Donoho<br> 2015, 41 pages

      This paper reviews some ingredients of the current "Data Science moment", including recent commentary about data science in the popular media, and about how/whether Data Science is really di fferent from Statistics.

      The now-contemplated fi eld of Data Science amounts to a superset of the fi elds of statistics and machine learning which adds some technology for 'scaling up' to 'big data'.

    1. "A friend of mine said a really great phrase: 'remember those times in early 1990's when every single brick-and-mortar store wanted a webmaster and a small website. Now they want to have a data scientist.' It's good for an industry when an attitude precedes the technology."
    1. paradox of unanimity - Unanimous or nearly unanimous agreement doesn't always indicate the correct answer. If agreement is unlikely, it indicates a problem with the system.

      Witnesses who only saw a suspect for a moment are not likely to be able to pick them out of a lineup accurately. If several witnesses all pick the same suspect, you should be suspicious that bias is at work. Perhaps these witnesses were cherry-picked, or they were somehow encouraged to choose a particular suspect.

  15. Dec 2015
    1. Big Sur is our newest Open Rack-compatible hardware designed for AI computing at a large scale. In collaboration with partners, we've built Big Sur to incorporate eight high-performance GPUs
  16. Nov 2015
    1. TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. Think of TPOT as your “Data Science Assistant”: TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines, then recommending the pipelines that work best for your data.

      https://github.com/rhiever/tpot TPOT (Tree-based Pipeline Optimization Tool) Built on numpy, scipy, pandas, scikit-learn, and deap.

  17. Apr 2015
    1. Wouldn’t it be useful, both to the scientific community or the wider world, to increase the publication of negative results?