40 Matching Annotations
  1. May 2023
    1. agency

      And well-being. I've started to realize that as much as our society prioritizes "comfort" and "convenience," and people gravitate towards them, they are very often bad for us. Metabolic syndrome, loneliness, depression, fees, shitty Mechanical-Turk-like jobs, etc.

    2. How did we come to imagine ourselves as so powerless?

      A huge part of this is because people believe "computers are hard," and tech companies encourage that.

      We're happy to defer all that technical stuff to experts who make idiot-proof tools, and they all happen to be centralized in big for-profit companies.

      Decentralized, open source solutions aren't perceived as a viable alternative, mostly because they are obscure, complicated, ugly, and poorly supported. Or, in other words: they lack funding.

    3. .

      By... ?

    4. They are invoked as a complementary strategy with deep-learning based approaches as a means of realizing “explainable AI” since they can provide explicit provenance and constraints to results

      Yes, and this is a very problematic stance.

      Populating, validating, and maintaining high-quality KGs is really hard. Especially big ones that are meant to be comprehensive. It depends on batch analysis jobs over vast quantities of web documents, and NLP techniques to extract "facts" from the text and decide whether / how to integrate them into the KG.

      In other words, we want KGs to make LLMs safe, but increasingly we should expect LLMs to factor into how we build KGs, both by generating web documents and by interpreting them. It's a circular argument, with Mechanical-Turk-style data validation as the only barrier for propaganda or hallucinated "facts" becoming ground truth simply by being in a KG.

    5. .

      Yes, and the deeper reason is to monetize users. It's all about selling the powers that be access to the masses, to size them up and influence their behavior at scale.

      You mention this later, but it's a little weird to leave it unstated here, especially when the intentions you list aren't actually end goals.

    6. The intention is to make the assistant the primary means of interacting with apps and other digital systems

      And products, services, businesses, places, other human beings...

      They promote assistants as the ultimate convenience, but so much of what they do is automate things that were already trivial and replace human / human interaction with human / cloud interaction. This benefits the platforms way more than the users, which is probably why adoption has been so tepid...

    7. won’t work

      I don't feel like you've actually presented much evidence to back this claim.

      My biggest concern is that these technologies will sorta work. They will deliver some impressive results and have devoted fans / users. They'll also have serious limitations, failure modes, and side effects, but in the framing of the Cloud Orthodoxy, this is just a short-term inconvenience to be fixed with more data.

      I think the critical thing here is what counts as "working" and who gets to define that. If you say "this will fail," they'll say "nu-uh," but that's just because you have different notions of what it means for this technology to "work."

    8. it merely serves the role as a natural language interface to other systems.

      This is a really important point that I wish more people realized. This is the "killer app" for LLMs. The direct query / response model of today's chat bots is just a toy by comparison.

    9. This is, again, true73, and also the goal

      Not exactly. Google would actually really like to handle deliberative searches, because they know it's a gap in their product and because it means users would dwell on the page with the ads. They have tried repeatedly to launch features like this, but it's hard, and few have been successful.

      This may change with Google adding more shopping features to Search? Who knows.

    10. the strong structuring influence of Cloud Orthodoxy’s convenience-oriented platform service is clear on the direction of LLM research

      This also ties into the word "Intelligence" and the talk of AI eliminating jobs. With the focus on "consumer convenience," these tech companies have realized they are displacing human beings who would have filled that niche.

      They could be building tools to empower and bridge between creators and consumers, but that would mean disintermediating themselves, and that's not possible if you believe in the Cloud Orthodoxy.

    11. Large Language Models are interfaces to knowledge graphs

      You might want to change the wording.

      In their raw form, LLMs are very much not KGs, just abstract vectors representing text content. In fact, many of their failure modes come from not having ground truth factual data, like what you find in a KG.

      But it's not hard to integrate an LLM with a KG, that makes both tools vastly more powerful, and that is obviously what's coming next.

    12. sequential behavioral information like the multiple searches someone will do for a single topic

      This is true, but don't forget that most of this sort of data analysis is done in aggregate. Yes, if you search for "Jaguar" right after searching "Ferrari" you probably mean the car. What's more powerful, tho, is looking at the thousands or millions of searches per day that involve an entity in aggregate.

      For instance, how does Google know that LeBron James is primarily a basketball player, rather than an actor? It's because they know what questions people ask about him, what results they click on, what factboxes they interact with, etc. If he suddenly became a MUSICIAN overnight, how would they know? A handful of news articles and a deluge of search queries.

    13. PageRank64

      PageRank is always presented as the "secret sauce" for Google, but that's misleading. By far the most valuable ranking signal Google has is the billions of searches its users do every day. Every time a user searches for a query, clicks on a result, and either stays there or comes back to the search page, they are teaching Google what content is relevant. Not to mention the more fine-grained interaction data they track for the custom factbox UIs and the like, which feeds back to improve their KG.

      The upshot of this is that the quality of Google's search depends on both the Platform paradigm and their dominance of the search product space. The search box can only be useful like that when most of the world is funneled through it.

    14. The Platform matches different kinds of users like advertisers to customers, law enforcement agencies to suspects, etc. in order to maximize the overall value of all Platforms

      It's more granular than this, and that's significant. It's not just pairing advertisers with customers, it's connecting particular ad verticals, formats, and inventories with customers with specific demographics, classifications, and interests.

      That means potentially sensitive information about end users gets shared (anonymously, in theory) with advertisers who get to leverage that knowledge to manipulate them. It also means advertisers compete for access, and iterate with Platforms on how to most efficiently monetize the end users.

      And, of course, the same is true of law enforcement, etc.

    15. freely56

      More often, in exchange for viewing ads. Very little is presented as fully free these days. Think of all the sites that ask you to disable your ad blocker to proceed, for instance.

    16. As a platform owner I have a right to demand whatever the users will give me in exchange for my services

      Which is fully justified (in this ideology) by the (false) idea that users are free to leave or jump to a competitor if they don't like the terms.

    17. The Cloud Orthodoxy

      Having lived in Silicon Valley and worked on Google Search for 14 years, I can say this is an extremely thorough, accurate, and well presented summary of The Cloud Ideology. Love it.

    18. On their own, each of these groups describes noble goals: decreasing bias in the justice system, providing resources to formerly incarcerated or unhoused people, making government decisions more efficient. Taken together, however, the projects describe a panoptical surveillance system that wouldn’t even need to be reconfigured to be used for algorithmically-enhanced oppression.

      I actually think it's worse than you make it sound.

      They will design "metrics" to capture the health of these "verticals." This means flattening the diversity and lived experience of millions of people down to a few vectors. Once that happens, they will stop looking at the people and look only at the metrics, since they're "more objective" and "easier to understand." They will then proceed to turn the knobs on the systems (the ones that are easy and desirable for them to turn), until the metrics go up enough that they can declare victory and pat themselves on the back.

      Under a fascist regime or the influence of corporate greed this sort of thing could be wielded as a devastating weapon. But even if they are well-meaning, incorruptible, respectful, competent, and vigilant, a system like this would naturally produce algorithmically-enhanced oppression as a side effect. Any top-down system that operates on groups of people is going to screw over certain marginalized and intersectional communities. Bottom-up is the only humane way to do it, "efficiency" be damned.

    19. the NSF is building them for everything else

      Mixing what you said earlier about the messy collaborative nature of the semantic web with a little philosophy of science I read recently, this line hits hard.

      I don't want to see us entrench our current scientific thinking and colonialist ontology with technical debt and rigid schemas. Scientific stagnation is bad enough already, but imagine if we also had absolutely massive, "essential," and "irreplaceable" software systems that need heavy refactoring to make progress?

    20. a small group of experts wave a wand of unknowable algorithms over a bulging plastic trash bag of data to pull out the Magic Knowledge Rabbit

      It would be powerful if you could say something about the fruits of this approach, or lack thereof. Why do they think this works? Why should we doubt that it works?

      You pointed out a few examples where the black box can dispense garbage. But what about the exciting new discoveries and treatments they claim are powered by this model? Are they really happening? Are they making a material difference? How do those benefits weigh against the harms you raised earlier?

      This is similar to my comment about the benefits of Google's KG and the benefits of well-funded, product-oriented efforts. I'm just less informed about the healthcare side.

    21. Google

      Don't forget Alphabet's in-house biomedical research arm, Verily. From their website: Advancing precision health by closing the gap between research & care. We are an Alphabet company focused on applying AI & data science to accelerate evidence generation and enable more precise interventions.

    22. outmoded definition of “transsexualism” as a disease

      This is also a great example of a more general problem with KGs and LLMs: language and the world are constantly changing, but their data sources are rarely updated to reflect that. While a human brain can dynamically reorganize itself to work out the consequences of any new data point, and has no difficulty discerning counterfactual or deprecated data, our tech does not have that capacity, and we often forget it.

    23. “knowledge” is not a social, contextual, or dialogical phenomenon, but a “natural resource” that can be mined from information that is “out there.” A scientific paper is a neutral carrier of a factual link between entities.

      This is a really key point, and I would actually emphasize it more and flesh it out with a little philosophy.

      The critical thing here is that they are assuming that research produces a consistent, objective view of reality and that data which is framed differently or is otherwise incompatible is merely flawed or inconsistent and can safely be discarded.

      That's not how language, knowledge, or science works. That's how you entrench one particular cultural perspective and cherry pick results to make it look more accurate and self-consistent than it really is.

    24. highly curated biomedical informatics platforms, rather than basic researchers or the public at large

      This is not unreasonable. If their priority is volume and consistency, they should exclude these direct contributions. This is one of many disadvantages faced by small-scale research, but it's not clear there's anything wrong with that. No product is good for all use cases.

    25. The risk posed by a lack of a universal “language” was not being able to index all possible data, rather than inaccuracy or inequity20

      When you put it like this, you can see the hidden TESCREAL aspect to it: large-scale, long-term benefit automatically outweighs localized harm.

    26. It encodes the notion that there should be one “neutral” means of representing information for one (or a few) global search engines to understand, rather than for local negotiation over meaning

      I think it might be important to bring up the notion of "ontological violence." Having a rigid, generic schema limits what you can express with your structured data, and how it can be used. This isn't so big of a deal with Google's factboxes (though some data providers might disagree!), but it's a huge problem in the context of the NIH and NSF's KG projects. People get forced into categories and given generic treatments that don't fit. Intersectional identities and unique experiences get ignored, with real and direct impact on people's mental and physical health.

    27. In 2015, the increasing prevalence of Google’s information boxes caused a substantial decline in Wikipedia page views [68, 69] as its information was harvested into Google’s knowledge graph, and a “will she, won’t she” search engine arguably intended to avoid dependence on Google was at the heart of its 2014-2016 leadership crisis [70, 71]

      In the early days, Freebase and thus Google's KG consisted almost entirely of data from Wikipedia. It's also worth noting that Wikimedia was involved and supportive. They were happy to have their data reach more people, even if it meant less traffic to their site.

    28. Microsoft, a famously good actor in software, took this several steps further with GitHub, VSCode, and later Copilot, capturing a large chunk of the software development process in order to trick programmers to be the “humans in the loop” refining the neural network to write code and dilute their labor power [64, 65, 66, 67].

      This is a very good example. For me, it highlights the ambiguity of "openness." In this case, there is a mostly public platform for open software development collecting private data for the benefit of one company. The useful product lures developers into the ecosystem, where they serve as a resource. It's a bit like farming, with animals you can't fully control.

    29. it is their “graph plus compute” structure

      Glad to see you make explicit what you think is wrong here. This is an interesting framing. It's not what I would have said, but I think I like it.

      Why is this bad? Just thinking out loud: by bundling the computation, you remove direct access to the data. By making the computation a black box, you hide its bias and side-effects. You can also hide a vast pipeline with many inputs and outputs by presenting a much simpler subset API to the public, as if that were the whole thing. This lets you hide your profit model and the asymmetry in how much value the user / company gets out of the product.

    30. RELX’s risk division also provides “comprehensive data, analytics, and decision tools for […] life insurance carriers”

      Okay, wow. This point was really compelling to me. It's really striking how many different games this company is involved in, sharing data between patients, scientists, and competing companies in disparate industries. Surely this is not what people expect their data will be used for. Surely this represents a serious conflict of interest for RELX, but also for the folks who (knowingly or unknowingly) contribute data that gets used against their interests.

      Perhaps you get to this later, but it would be interesting to see how this intersects with the more mundane data collection of, say, Google or Meta tracking web browsing activity. If you want to argue that the pattern of KGs generally is unhealthy for society, making that link clear would be very compelling.

    31. So if you are a data broker, and you just made a hostile acquisition of another data broker who has additional surveillance information to fill the profiles of the people in your existing dataset, you can just stitch those new properties on like a fifth arm on your nightmarish data Frankenstein

      Love this point. It might also be worth mentioning that it's totally possible (and common, I assume) to join seemingly safe and innocuous KGs with private data / surveillance KGs. This works really well, and some of the mundane data might end up being the missing link that binds together the more controversial stuff and makes it useful.

    32. The same technologies, with minor variation, that were intended to keep the internet free became emblematic of and coproductive with the surveillance/platform model that has enclosed it.

      You keep suggesting KGs are used for surveillance, but you haven't yet explained how, which weakens your point. I'd like to see a sentence or two (either here or in the introduction where you first bring it up) explaining what you mean by this. What is the threshold between "data collection" and "surveillance"?

      You do explain this much better near the end, but that leaves people wondering for a long time.

    33. We vulgar commoners, we data subjects, are not allowed to touch the graph — even if it is built from our disembodied bits.

      Not entirely true. Google supports suggested edits, and can even allow someone to "claim" an entity and edit it freely. You can also indirectly add stuff to the KG by annotating web pages with schema markup. Of course, this is mostly to outsource quality control, or to empower brand managers and content providers within the Google ecosystem, and what gets in is at Google's whim.

    34. The mutation from “Linked Open Data” [16] to “Knowledge Graphs” is a shift in meaning from a public and densely linked web of information from many sources to a proprietary information store used to power derivative platforms and services.

      This is true, but it's one sided and missing something very important.

      Having powerful, well funded groups make products out of structured data also had several large, positive effects. The schemas became more practical for addressing the needs of real people. The volume, quality, and consistency of structured data on the web went up. The accessibility and usage of structured data went way up. Many useful products got built that probably never would have emerged without the corporate efforts.

      You do a good job of highlighting many down sides to this, which are also true. But if you don't bring up the good sides, people will accuse you of cherry picking. If you tell both sides, they'll argue why the benefits outweigh the drawbacks.

      So, I think it's worth disentangling the threads of not just what they did but how they did it. Funding products and product-oriented data curation was positive. But they had ulterior motives, which is why they designed the systems to be closed, proprietary, and unaccountable. It didn't have to be like that. Could we do it differently, avoid most of the harms, and still reap most of the benefits?

      I like your vulgar data vision, but it doesn't address this key point. How do we get high quality useful tools without handing everything over to a rich corporation? We do need an effective model for organizing and funding the vulgar efforts so they can compete.

    35. the privatization of technologies with initially liberatory aspirations

      Yeah. It's notable that Freebase was meant to be an open common space like Wikipedia. Eventually Google shut it down, not because they wanted to kill it, but because they had no good reason to keep paying for it.

    36. On platforms, rather than a system that “belongs” to everyone, you are granted access to some specific set of operations through an interface so that you can be part of a social process of producing and curating information for the platform holder.

      This is an interesting perspective. Web 2.0 has two sides. To users, they were reaping the benefits of the platforms to get things done. To platforms, this activity served as a source of data, validation, and ranking.

      Usually you hear just perspective one. It's rare to see just perspective two, like you wrote here. To me, they're two sides of the same coin, and we should always emphasize both.

      That said, there could be an interesting discussion about whether it's a "fair coin" given the power / value asymmetry here.

    37. It imagined the use of triplet links and shared ontologies at a protocol level as a way of organizing the information on the web into a richly explorable space: rather than needing to rely on a search bar, one could traverse a structured graph of information [16, 17] to find what one needed without mediation by a third party.

      This strikes me as a little strange. I'm not familiar with the Semantic Web project specifically, tho, so it may be true.

      It's just that a decentralized KG with custom schemas is not so different from a web of documents tied together with links. You still need some sort of third-party aggregator to find the right content. The big difference is you could go to a KG hub (like Freebase) instead of a Search engine and work with structured data instead of raw text. But this was still a third party.

    38. For example, in Wikidata, Peter Kropotkin (Q5752) is an instance of the “human” type, which has properties like sex or gender (male) and place of birth (Moscow), but also has additional properties not in the human type like signature

      This is more of an example of how entities can have multiple types, and that those types can be dynamically determined from the properties present. That doesn't mean the schema changes, though.

      For instance, LeBron James was a PERSON and an ATHLETE, but when he starred in Space Jam he became an ACTOR. The properties and type for the LeBron James entity changed, but the schemas for PERSON, ATHLETE, and ACTOR did not.

    39. not anticipated by a schema

      In theory, but in practice many KGs are populated by automated jobs that then have a validation pass that strips out anything that doesn't match the schema, orphaned nodes, and dangling edges. Often, to express a new relationship (certainly to express it in a useful way), one needs to update the schema.

    40. Introduction

      This section comes across as ideologically motivated, with dramatic claims that aren't clear yet because the paper is just getting started. When I read this, I strap myself in for the ride, but I worry some other readers would become defensive (and close their minds) before you have a chance to land any points.