18 Matching Annotations
  1. Nov 2015
    1. ess legible or cannot be known about users falls awa

      I think this is better understood as less quantifiable facets of a user rather than less knowable. things which are not discrete or at least not easily proxied by discrete data are very difficult to roll into most mathematical models. So they tend to get ignored, even if they are easily recorded by a system.

    2. algorithms

      sic. algorithm should be 'model' or perhaps 'trained model'

      here it's really important to separate the algorithm that is explicitly designed by a person or people and an inferred model that the algorithm creates with a data set for training. I think the later is what is meant in this instance.

    3. Even when mistakes

      so, this is a really valid point, but my concern is misdirection. This to me reads a clear case of lack of transparency of curatorial policy, with far reaching implications. But blaming data persistency and modelling is a distraction. These are (corporate) policy (in)decisions.

    4. A useful example

      see this is what I was afraid of. This is an example of poor curatorial policy. it has literally no dependancy on relational database technology engineering decisions.

    5. flexible forms of databases

      so, like, I am really unclear about what tech is being discussed here, given the really loose language. in the previous paragraph, I assumed they were talking about Relational Database Management Systems, then the specific modern subset of RDMS's using Structured Query Language (SQL) expressions. But now I'm entirely unsure which is which, making it a bit difficult to resolve the sociological statements against the tech that might drive them. fixed scheme v. various SQLs v. KVS vs. graph/triple stores these are very different with very different implications but here are getting all kinds of mixed up

    6. functionally automatic

      so obviously, the part that is done in software is automatic. That much is tautological. But there's certainly a trend in information retrieval and recommender systems for functional semiautomatic systems and tools for curatorial assistance (cf. http://link.springer.com/chapter/10.1007/978-3-540-78246-9_63#page-1 http://link.springer.com/chapter/10.1007/11811305_110#page-1 to get started, though there are many many more). So, I suppose what I'm saying is that a single reference from 1978 is a pretty weak cite for a very strong assertion.

    7. collected, readied for the algorithm, and sometimes excluded or demoted

      again, this is a lack of precision that makes things a bit tricky to understand. The quote prior sets up the dichotomy of data structures and algorithms within computation, but you cannot simply apply this dichotomy to the kind of unstructured data being referred two. Each of these 3 items (collected, readied for algorithm, excluded or demoted) represents a mechanised decision (ehem: algorithm). How do you find the data to collect? (spidering in a Web context) How do you map from the data source's ontological concepts to the learning algorithm's (or whatever) concepts (i.e. 'readied for the algorithm'). Should boosting/filtering be applied as a preprocess?

      Perhaps the confusion stems from these 'patterns of inclusion' frequently not being considered as explicit and separate by the practitioner and usually being domain specific solutions, where as what is called 'the algorithm' here is the part of the mechanical decision making that is generalisable?

    8. meaninglessmachines

      this whole paragraph is kinda sloppy. but this is particular is just missing some key insights. The algorithm has meaning: it is a precise encoding of assumptions about the expected data. That is neither inert or meaningless. (also the casual using of database to mean dataset is driving me nuts but I'm trying to work through it)

    9. are caught up in and are influencing the ways in which we ratify knowledge for civic life,

      But then, this is has been widespread for some time. e.g. 'Code is Law' though this is principally about code (in the form of DRM) being used by copyright holders to extend copyright law outside of a civil process. But the point is generalisable.

    10. how data is made algorithm ready

      I can't help but think here of the fact that there's this meme in Data Science, uh, Thought Leading that most of the work on non-toy datasets is in data clean up (i.e. making the data algorithm ready) and that some fairly boring math (i.e. the algorithm) is fine, given that cleaned data. This can have an obfuscating effect on the overall mechanism, if most of it is 'pre-processing'. c.f. http://www.kdnuggets.com/2015/05/data-science-inconvenient-truth.htm http://blog.revolutionanalytics.com/2014/08/data-cleaning-is-a-critical-part-of-the-data-science-process.html http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html