1,186 Matching Annotations
  1. Sep 2024
    1. In sum, IA has not met its “burden of proving that the secondary use doesnot compete in the relevant market[s].” Warhol I, 11 F.4th at 49. Its empiricalevidence does not disprove market harm, and Publishers convincingly claim bothpresent and future market harm. Any short-term public benefits of IA’s FreeDigital Library are outweighed not only by harm to Publishers and authors butalso by the long-term detriments society may suffer if IA’s infringing use wereallowed to continue. For these reasons, the fourth fair use factor favors Publisher

      Factor 4: "does not disprove market harm, and Publishers convincingly claim both present and future market harm"

    2. In this case, the relevant market for purposes of analyzing the fourth fair usefactor is the market for the Works in general, without regard to format. TheCopyright Act protects authors’ works in whatever format they are produced. Seegenerally 17 U.S.C. § 106 (granting copyright owners the exclusive right toreproduce their works in numerous formats). Even if we consider the printeditions to be the “original” forms of the Works in this case―and we recognizethat the concept of a singular “original” form may be difficult to grasp in the digitalage―the Act recognizes that presentation of the same content in a differentmedium is a derivative work subject to the same legal protections as the purportedoriginal. 17 U.S.C. § 106(2). Here, Publishers obtained from authors the exclusiveright to publish their Works in numerous formats, including print and eBooks, andit is this exclusive right that IA is alleged to have violated via its Free DigitalLibrary. For that reason, the relevant harm―or lack thereof―is to Publishers’markets for the Works in any format.

      "Copyright Act protects authors' works in whatever format they are produced"

    3. Here, IA makes unauthorized digital copies of the Works and reveals thosecopies to the public in their entirety. The “amount and substantiality” of thecopying is not necessary to achieve a transformative secondary purpose, butserves to substitute Publishers’ books. 17 U.S.C. § 107(3). For that reason, the thirdfair use factor favors Publishers.

      Factor 3: "...reveals those copies to the public in their entirety"

    4. Here, while the nonfiction Works undoubtedly convey factual informationand ideas, they also represent the authors’ original expressions of those facts andideas—and those “subjective descriptions and portraits” reflect “the author’sindividualized expression.” Harper & Rowe, 471 U.S. at 563. Thus, because theWorks in Suit are “of the type that the copyright laws value and seek to protect,”the second fair use factor favors Publishers.

      Factor 2: Works...are of the type that copyright laws value and seek to protect

    5. We conclude, contrary to the district court, that IA’s use of the Works is notcommercial in nature. It is undisputed that IA is a nonprofit entity and that itdistributes its digital books for free. Of course, IA must solicit some funds to keepthe lights on: its website includes a link to “Donate” to IA, and it has previouslyreceived grant funding to support its various activities. App’x 6091. But unlikethe defendant in TVEyes, who charged users a fee to use its text-searchabledatabase, IA does not profit directly from its Free Digital Library. 883 F.3d at 175.It offers this service free of charge.

      Factor 1: "IA's use of the Works is not commercial in nature"

    6. But this characterization confuses IA’s practiceswith traditional library lending of print books. IA does not perform the traditionalfunctions of a library; it prepares derivatives of Publishers’ Works and deliversthose derivatives to its users in full. That Section 108 allows libraries to make asmall number of copies for preservation and replacement purposes does not meanthat IA can prepare and distribute derivative works en masse and assert that it issimply performing the traditional functions of a library.

      "IA does not perform the traditional functions of a library"

    7. We conclude that IA’s use of the Works is not transformative. IA createsdigital copies of the Works and distributes those copies to its users in full, for free.Its digital copies do not provide criticism, commentary, or information about theoriginals. Nor do they “add[] something new, with a further purpose or differentcharacter, altering the [originals] with new expression, meaning or message.”Campbell, 510 U.S. at 579. Instead, IA’s digital books serve the same exact purposeas the originals: making authors’ works available to read. IA’s Free Digital Libraryis meant to―and does―substitute for the original Works.

      Factor 1: "IA's use of the Works is not transformative"

    8. IA hosts over 3.2 million digital copies of copyrighted books on its website.Its 5.9 million users effectuate about 70,000 book “borrows” a day―approximately25 million per year. Critically, IA and its users lack permission from copyrightholders to engage in any of these activities. They do not license these materialsfrom publishers, nor do they otherwise compensate authors in connection with thedigitization and distribution of their works.

      Lack of license permissions

      Nor, of course, do libraries have to seek permission from publishers for any of these activities for printed books...but let's read on.

    9. Hachette Book Group, Inc. v. Internet Archive (23-1260)

      Court of Appeals for the Second Circuit

      via Recap archive: https://www.courtlistener.com/docket/67801014/hachette-book-group-inc-v-internet-archive/?order_by=desc

      This appeal presents the following question: Is it “fair use” for a nonprofit organization to scan copyright-protected print books in their entirety, and distribute those digital copies online, in full, for free, subject to a one-to-one owned-to-loaned ratio between its print copies and the digital copies it makes available at any given time, all without authorization from the copyright-holding publishers or authors? Applying the relevant provisions of the Copyright Act as well as binding Supreme Court and Second Circuit precedent, we conclude the answer is no. We therefore AFFIRM.

  2. Aug 2024
    1. Chris Hayes speaks with Internet Archive founder Brewster Kahle and Library Futures co-founder Kyle Courtney about why megapublishers are suing to redefine e-books. as legally different from paper books.

      Could the future of libraries as we’ve known them be completely different? Our guests this week say so. Megapublishers are suing the Internet Archive, perhaps best known for its Wayback Machine, to redefine e-books as legally different from paper books. A difference in how they are classified would mean sweeping changes for the way libraries operate. Brewster Kahle is a digital librarian at the Internet Archive. Kyle Courtney is a lawyer, librarian, director of copyright and information policy for Harvard Library. He’s the co-founder of Library Futures, which aims to empower the digital future for America’s libraries. They join to discuss what’s animating the lawsuit, information as a public good and the consequences should the publishers ultimately prevail.

      July 9, 2024

  3. Jul 2024
    1. Federal Justice Statistics, 2020 — May 2022, revised July 2023 U.S. Department of Justice, Office of Justice Programs, Bureau of Justice Statistics

    2. Total casesadjudicatedConvicted Not convictedTotal Guilty plea Bench/jury trial Total Bench/jury trial DismissedAll offenses 71,126 92.6% 90.9% 1.7% 7.4% 0.3% 7.1%

      In fiscal year 2020, there were 71k cases in U.S. district court. Of those, 92.6% were convicted: 90.9% by guilty plea and 1.7% by bench/jury trial. 7.4% were not convicted, 0.3% by bench/jury trial and 7.1% dismissed.

    1. Right. So let's start here: an Alford plea is fundamentally a form of coercion, because it's basically telling a person, "Admit to this crime or else we'll kill you."

      Alford Plea as a form of judicial coercion

    2. Johanna Hellgren, who has for years has been researching this Alford plea. JOHANNA HELLGREN: Yeah. PETER: So could we just start—like, where did the Alford plea—where did it even come from? JOHANNA HELLGREN: Yeah, the way it came to be was this guy, Henry Alford, was accused of first-degree murder. PETER: This is, like, the early 1960s. So first degree murder meant he was facing the death penalty. JOHANNA HELLGREN: But he took a plea for, I believe, second-degree murder. PETER: Which meant instead he'd get life in prison. JOHANNA HELLGREN: Yeah. PETER: But when he gets up to enter his plea in front of the judge, Alford says—and I'll quote from the transcript right here, hold on one second, he says, "I just pleaded guilty because they said if I didn't they'd gas me." JOHANNA HELLGREN: "I'm just pleading because I don't want to get the death penalty, but I didn't do it." PETER: And then later he said, "I'm not guilty, but I plead guilty."

      Origin of the "Alford Plea"

    3. "The Alford Plea", Radiolab, Jun 28, 2024

      In 1995, a tragic fire in Pittsburgh set off a decades-long investigation that sent Greg Brown Jr. to prison. But, after a series of remarkable twists, Brown found himself contemplating a path to freedom that involved a paradoxical plea deal—one that peels back the curtain on the criminal justice system and reveals it doesn’t work the way we think it does.

    1. The realization that you have to live in a society with people who are harmed by injustice, even if you personally escape that justice? It's the whole basis for solidarity.

      Defining solidarity

    1. But actually a physical library cannot make a physical copy of a book and circulate it in place of the original copy. It can't do that, right? So if it can't do that, then you're not doing just what a library is doing. So your argument actually depends on the physical, the digital copy being transformative and the physical copy not being transformative. And that's the thing that makes the difference, not the one-to-one ratio and the analogy to physical books.

      Is circulating a photocopy of a book to preserve the original like CDL?

    2. The struggle I'm having with your response to these questions is on the one hand you want to say, look, this is transformative because it's efficient and we can get people to read more books faster. They don't have to go to libraries. The efficiency is the value of this to the public. But at the same time, you're saying, but that efficiency has absolutely no impact on whether the publishers can sell the e-books or the hard copies. And it sounds wonderful when you're saying it, but when I step back and listen, I'm having trouble reconciling those two.

      Digitized lending efficiency has two sides

    3. that efficiency may or may not have an effect on either the number of copies that get sold or on the market for the overdrive service, which has a variety of different sort of different aspects and benefits over and above CDL. I mean, CDL is largely sort of image-scanned images of pages of paper books because it's the paper book. The overdrive service has a lot of many. You can flow the text. You can do different features and that is one reason why that is one explanation for the data that you see that there is no reduction in demand for overdrive.

      Digitized versus Digital books, explicitly

    4. you're reducing the market from the number of people who might want to read... Let's look at even the paper books. They'll pretend like take out the digital market for a second. The number of people who might want to read it ever, down to the number of people who might want to read it simultaneously. And if you put digital books into the mix, it's the same idea, right?

      The market becomes only as large as the number of people who simultaneously want to read a work

    5. That IA's brief and amiki try to create the impression that the public interest is on their side. And it is not. The protection of copyright is in the US Constitution and federal law because it creates an incentive for writers and artists to create new works to benefit our broader society. Internet archives, control digital lending is in direct conflict with that basic principle. And as I previously... You don't really think people are going to stop writing books because of the control digital lending to you? Well, I think publishers are going to go down the tubes if they do not have the revenues. I'm not going to publish your books. You think that that's really... I do, Your Honor. There's no question. I mean, and the standard here is not, will this eliminate... No, I understand. ...the... It's just a part. But this question about balancing the incentive to create a work with the larger distribution of it, that is the question to be decided in this case.

      "Publishers are going to go down the tubes is they don't have the revenues"

      Authors: the publishers are not necessarily your friends here...

    6. In the same way, control digital lending is a contrived construct that was put together at the behest of Internet Archive back in 2018 when they confronted the fact that libraries didn't want to deal with them. Libraries didn't want to give copies of their works to be digitized because they were concerned about copyright arguments. So they got in a room together with various people and contrived this principle of control digital lending to rationalize what they were doing.

      CDL was conceived by IA in 2018 because libraries didn't want to give IA digital copies?

      WHAT?!?

    7. that very point has been made by this court that you transform a book when you create or work when you create it into a new format. But that is not the type of transformativeness that the first factor looks at. You're converting it to a derivative form, not a transformative form.

      Digitized books as a derivative form, not a transformative form

    8. Really in question that if the comparator on the transformative issue were between the digital book and the ebook, that it really wouldn't be transformative because it is just a different version of a digital book. But he's saying, well, we have a right to use our physical copy and this is a transformative use of the physical copy. Is that the right way to think about it? Are you suggesting that actually the book is something more than the physical copy? And so when we think about the CDL version of the book, we should compare it both to the physical copy and to the ebook because those are both versions of the book that the publisher produces. They are not distributing the physical copy you're on. That's the whole point to control digital lending. That is not the right way to think about it. They are taking the physical copy and transforming it into a nude and different format with different capabilities that has a different market.

      Comparison between the digital (digitized) book and the ebook

      The judge here is attempting to hone in on the question of digitized versus digital book. The libraries are lending the digitized book. Publishers have a market for the digital book. They are similar, but not the same.

    9. You are still distributing the physical copy of the book. And as Your Honor recognized, there's a lot of friction involved with distribution of physical copies that is significantly different than what is capable with digital copies. That's why they are two independent markets with very distinct capabilities and the law and the digital economy button with books as everything else turns on the fact very key principles that the copyright owner owns the right to distribute their works in different formats and to distribute them under the terms for which they deem to be appropriate.

      Physical books and ebooks are two different markets

      The difference in friction between lending a physical book and an ebook means that these are two separate markets with "very distinct capabilities." CDL usurps publishers' rights in the ebook market.

      I could be more sympathetic to this argument if the "ebook" we were talking about was a digital book, not a digitized book.

    10. there is no market in control digital lending. But of course, there's no market in control digital lending. Control digital lending is predicated on infringement and the nonpayment of any fees.

      CDL is predicated on infringement and non-payment of fees

    11. this is so utterly transformative and so utterly, it's substantive. I'm sorry, it's utterly derivative is your position. Exactly, it's a derivative work. And there's, and it's doing nothing than repackaging and repurposing the derivative work.

      CDL is a transformative, derivative work

      Implication being, I think, that new copyright rights come to the library because of this transformation/derivation.

    12. let me focus for one moment on just a little bit of the evidence here on the commerciality of Internet Archive. Unlike most libraries, Internet Archive is owned by an individual largely who has funded this, and he, that Brewster Kale, as you well know. And every, virtually every page of the Internet Archive has a button that says you can buy this on better world books, which is giving an incredible amount of PR and certain revenue that drives between them.

      Commerciality of Internet Archive

      And the intersection with Better World Books, which makes this a distinct between this and public libraries.

    13. What about the evidence that certainly the district court cited to it, admissions or undisputed evidence in the 56th statement about pitching CDL or pitching, joining the open library project as a way to save money. And are you relying on that at all as a basis to show? Absolutely, Your Honor. It's very rare in a record that you have actual admitted evidence that shows that a party is intending to supplant your market. And that's what we have in this record. The Internet Archive on this appeal tries to dismiss this as rhetorical flourishes. These pitches were made to hundreds of libraries and hundreds of slide decks provided to libraries with this exact same pitch. You don't have to buy it again.

      IA's marketing of the Open Libraries program demonstrates their intent to subvert the digital licensing market

    14. ASTM versus public resource decision
    15. then your argument actually is that, yeah, okay, there's a statute that talks about fair use, but there's a more specific statute that says when libraries can digitize books and that should control the fair use statute. Correct, your honor. That should control. And that's what was decided by this Court. They said that if you want to change the law, your job is to go to Congress. We're not in the position to change the statute towards you. In terms of what the statute says, the statute says, okay, you could do whatever you want with a physical book, but you can only create a digital copy for archival purposes or other limited purposes, but you're not allowed to distribute it. Correct, your honor. It does not envision in any way the practices of Internet Archive, which is digitizing literally millions of copies of books and making them available around the world to users.

      Appellee: fair use doesn't apply, the more specific statue applies

      Where Congress has said that digital copies can occur is the only place digital copies can occur.

    16. So what was not recognized in the argument you just heard is that the copyright office and Congress over the last decade has repeatedly been approached to say, you need to think about the digital economy. You need to think about digital works. You need to think about the first sale doctrine and whether that should apply in the digital world. They have consistently rejected the changes to the law, both by the copyright office as well as Congress.

      Congress and the Copyright Office have rejected digital economy changes

    17. So if Congress had not codified the first sale doctrine, it didn't have Section 109 that authorizes libraries. It only had to rely on the fair use doctrine. Would it be obvious that you could do whatever you want with a physical book? Like would that, would libraries fall under fair use if we didn't have the first sale doctrine in the statute? Yes, I think it would, Your Honor. I mean, I think the Supreme Court, the physical books, the Supreme Court, is recognized that you have an unlimited right to distribute physical books. It's not simply because it was codified in Section 109.

      Physical lending in libraries could rely on fair use if Section 109 didn't authorize libraries

    18. I want to start by reframing and step back to really focus on the practical realities of what Internet Archive is doing and what is before this court. Internet Archive is asking this court to disregard the controlling law of this court as well as the Supreme Court. And what it is seeking is a radical change in the law that if accepted would disable the digital economy. Not just for books, but for movies, for music, for TV and the like.

      Appellee' opening position: CDL is a destabilizing act on the digital economy

    19. You have not addressed the National Emergency Library. That's been sort of silent today. So given your statement now, you would agree that the National Emergency Library was a violation of copyright, because it wasn't one-to-one, correct? I would not agree. I mean, you were allowing multiple users to use the same digital copy of a hard book. The National Emergency Library does present different facts and different justifications you're under.

      National Emergency Library

      Judge, paraphrased: are there other circumstances where libraries would come to the court saying it was legal to break the physical sequestration of loaned items?

    20. With constraints that impact the value of the library's ability to do that that are very much tied to the physical instantiation of the book, right? That's right. You can't rely on that one book to serve the serial needs of people globally because the costs of sending the book would exceed the costs of just getting another book on the other side of the world. I don't think I agree with that, Your Honor, and I don't think the record supports it.

      Question on the costs of shipping the physical book for lending versus purchasing a new copy

      The cost of a library buying a copy has additional costs, such as putting it into the inventory and shelving the physical copy. Question: how does that compare to the costs between libraries of ILLing a book around?

    21. Like the Wikipedia links where we are, where people are able to... Yeah, but the way that works is like snippet view, right? You can click on it and go to the particular part of the book. But if you want the whole book, you have to do it through CDL. Again, this is not... That's not really part of CDL, the Wikipedia links, right?

      Wikipedia reference links are like Google Books snippet view

      This seems like a distraction from the core question...this really isn't a part of CDL.

    22. This is exactly what was going- this is exactly what the plaintiff said and what was going on in Sony as well. They said, well, you don't need to tape these movies off the air. We'll rent you a tape. We'll sell you a tape. You can get those benefits this other way just by paying us. You shouldn't be able to use technology yourself with the access you already have to get those benefits. That is the same thing that's going on here.

      Using technology to get the benefits that you already have

      Someone was already entitled to receive the content, a la the Sony case.

    23. if they already have a physical copy and they want a circulated digital copy now, in the absence of your program, they would have to license an ebook. But once the program is available, they don't need to and they can just digitize or rely on you to digitize the physical book they have, right? This offers them another way of using the access they've already got, the right they already have to lend it to one patron at a time. In exactly the same way that the VCR and Sony allowed the person to access the material later instead of right now

      CDL is an application of own-to-loan for physical items

    24. You are right that the license terms that the publishers offer to libraries do not allow them to have electronic materials in their prominent collection, which is these libraries have print materials in their prominent collection. And if they want to use CDL as an alternative to rental, right, the overdrive scenario, they need to buy those books.

      Publishers don't have license terms that allow for electronic materials in a library's permanent collection

    25. When they buy those books, they buy the physical copies to lend to their patrons one at a time or through an interlibrary change. They also buy e-books to make those available to their patrons. We're focused here on e-books and impacting e-licensing. I have a hard time reconciling those two, specifically as to e-licensing. Why would libraries ever pay for an e-license if they could have internet archives, scan all the books, hard copies they buy and make them available on an unlimited basis?

      Why would libraries buy ebook licenses when they could get the same from CDL?

    26. under factor 4 you say that actually there's one reason there's still be a market for e-books is because e-books are more attractive than digitized versions of physical books. Right? Because they have features and they're more user friendly or whatever. So what that kind of means is what you're saying is that your digital copies are more convenient or more attractive, I guess more convenient than physical books, but less convenient than e-books.

      Digitized physical books are different from publisher supplied ebooks

      Publishers have an inherently superior product with "born digital" ebooks than what libraries can produce with scanned physical books: reflowing pages, vector illustrations, enhanced table-of-contents and indexes, etc.

    27. So that statute, Section 109, talks about you can lend out the physical copy, but then it also specifically delineates when you can make a copy of it or a digital copy and it limits when you can distribute that. So why wouldn't it conflict with what Congress has specified to say, well, this is really just the same as the physical copy.

      Section 109 "First Sale Doctrine"

    28. That's right, and that is why we have been doing this without molestation by the publishers since 2011. But it's your position that you could lend it out during the first five years and that would still be fair use? That would be a different case, Your Honor. And we've very... I think the answer to that is... My only reason is if you're just doing it in your discretion. So the answer to that is we think it would... We think that would be fair use, Your Honor, because we don't think that would have a market effect either. There might... If they could show there was, or if the facts were different, that's why fair use is case by case, and if there were a case presenting those different facts, that might be different.

      IA believes it is fair use immediately but has the 5-year embargo to assuage publisher concerns

    29. in the real world, there's a lot more friction in the sort of market for passing a paper book from one person to another. And I'm imagining that that's priced into the price of the paper book. Your premise is that a scanned digital version of that paper book is nothing more. It is tantamount to the same thing as the book. But we know there's a distinct market for those digital books. They're priced separately. So you're taking something from one market and you're inserting it into another market without ever having paid the premium in that new market.

      "Friction" of lending physical items

      Reducing friction is seen as a benefit of a transformative use?

    30. Hachette Book Group, Inc. v. Internet Archive Appeal Oral Argument Second Circuit (88 min audio)

      by United States Court of Appeals for the Second Circuit

    31. If the forms are considered distinct things in their separate markets for them, why shouldn't the law recognize that converting the paper book into a digital book isn't just the same thing as passing around the paper book?

      Question early in the oral argument focuses on the first fair use factor

  4. May 2024
    1. Google translate is generative AI

      Google Translate as generative AI

    2. the various contracts or license agreements that publishers require libraries to sign to access research content for our users. And this is generally described as the problem of contractual override.

      Contractual override of fair use

      ...includes proposed contract language starting in the 46th minute.

    3. why training artificial intelligence in research context is and should continue to be a fair use

      Examination of AI training relative to the four factors of fair use

    4. is the output copyrightable

      Is the output copyrightable?

    5. let's go to the second question. Does the output

      Does the output infringe?

      Is it substantially similar in protected expression? What is the liability of the service provider?

    6. And then Singapore and Japan also have provisions and they basically allow the reproduction necessary for text and data mining.

      Singapore and Japan have laws that allow for reproduction necessary for text and data mining

    7. So, let's look at the first question.

      Does ingestion for training AI constitute infringement?

      Later:

      supposedly no expression, none of the original expression from the works that were ingested ends up in the model. That the model should just be this, you know, kind of mass of relationships and patterns and algorithms and all that kind of stuff, but no expression itself.

      In the U.S., this would be seen as a matter of fair use: search engines, plagiarism software, Google Books. The underlying theory is that we'll ignore the copies made by the computer...the expression coming out isn't the same.

    8. three different issues that are being implicated by artificial intelligence. And this is true with, you know, all artificial intelligence, not just a generative but particularly generative.

      Three issues implicated by Generative AI

      1. Does ingestion for training AI constitute infringement?
      2. Does the output infringe?
      3. Is the output copyrightable?

      The answer is different in different jurisdictions.

    9. And one way we've seen artificial intelligence used in research practices is in extracting information from copyrighted works. So researchers are using this to categorize or classify relationships in or between sets of data. Now sometimes this is called using analytical AI and it evolves processes that are considered part of text and data mining. So we know that text data mining research methodologies can but they don't necessarily need to rely on artificial intelligence systems to extract this information.

      Analytical AI: categorize and contextualize

      As distinct from generative AI...gun example in motion pictures follows in the presentation.

    10. Handling Academic Copyright and Artificial Intelligence Research Questions as the Law Develops

      Spring 2024 Member Meeting: CNI websiteYouTube

      Jonathan Band Copyright Attorney Counsel to the Library Copyright Alliance

      Timothy Vollmer Scholarly Communication & Copyright Librarian University of California, Berkeley

      The United States Copyright Office and courts in many United States jurisdictions are struggling to address complex copyright issues related to the use of generative artificial intelligence (AI). Meanwhile, academic research using generative AI is proliferating at a fast pace and researchers still require legal guidance on which sources they may use, how they can train AI legally, and whether the reproduction of source material will be considered infringing. The session will include discussion of current perspectives on copyright and generative AI in academic research.

    1. the hardest part of doing this has not been the technology, it's not been the APIs, it's not been training engineers. It's been working with the practitioners with the change management doing it this way requires. Because when we do this recontextualization, when we cross the streams between various disciplines where we take the information that someone has spent their career cataloging and throw a 90% out of it to present it to an audience who doesn't care, that really, really irks people who have spent their entire career doing it in a way that they were trained is the best way you could do this. And some of the things that you can do with technology run into conflict with longstanding practices that were put in place before that technology existed.

      The hardest part: helping practitioners understand the recontextualization potential, not building the technology

      The practitioners are used to dealing with data in their own silos. They are carefully curate that data. They may not take kindly to you taking a portion of it to use in the global data model and discarding the rest.

    2. The other cool thing that we've added is the knowing what records of change is really valuable, but the other question, really a more scholarly question is, how did it change? How has information changed over time? And so we built our infrastructure in support in Memento, which is one of the protocols behind the net archive, to give our data the equivalent of that way back machine, be able to say, what did that record look like a year ago? What did that record look like two years ago? To give you that audit log, so you can go and look at how that changed over time, and to give ourselves the ability to audit what changed when?

      Use of Memento to expose how data has changed

    3. You need something that will help you show documents that will help you map to those understandings, those contexts that people bring to the information very quickly. And documents are also the way the internet works, right? You want to be able to use the sorts of affordances that the engineers understand. Rest APIs, JSON documents, cache control. Because those are things that engineers know that will let this stuff work fast. It will also let you hire people who know how the internet works and don't have to understand all the complex crazy stuff that we call children, heritage people do, they make it possible.

      Linked Data—and web architecture in general—as a hirable engineering skill

      If you can hire engineers that don't immediately need to know the intricacies of cultural heritage metadata, you can expand the pool of entry-level people you can hire and then train up on the cultural heritage details as needed.

    4. In a context based on the sort of context that people would understand, we say, this is the record about an artwork because you have a shared idea of what an artwork's information might be. It's probably not deep in the weeds around like Van Gogh had a brother named Theo Van Gogh and Theo Van Gogh was married and that wife turns out that's not part of what we described the artwork, that's part of the ecosystem knowledge around it. But graphs, on the other hand, are really optimized for asking those sorts of questions, for saying, I have a context, maybe I'm interested in siblings of artists and I want to be able to ask that kind of question.

      Van Gogh's family — not cataloged in the metadata of a painting but available elsewhere

      This is an interesting question that link data enables. A painting would not have specific details about the artist's family, but that data could be linked to an artist's entity URI through another system.

    5. But when we want to provide access to that data, what we end up doing is recontextualizing that data. We have to change the lens from that around someone who knows the discipline, who knows the form into one that reaches the way that users expect to see that data, which is often a really different shape to different flavor because their needs are different than the needs of catalogers. And so when we do that, we end up displaying records that are aggregations of data coming from multiple different systems because the tool that you would use to catalog the source is different from the tool you'd use to capture digital media, which is different from the tool you'd use to capture collections records, which is different from where you put the audio for the audio guide.

      Enabling recontextualization of data

      Allow catalogers to work in the environments that suit them best, but then enable the data to move into different contexts.

    6. There's this research use case where there are people who are not looking for information, but they're looking for questions that haven't been asked, patterns that they haven't seen before. Things in the data that we don't already know and couldn't share with them, but that they could discover with their expertise.

      Answering the expert researcher's question

    7. Digital infrastructure isn't designed to make computers happy because computers aren't happy. We do it to empower people to be more effective in meeting their mission. And so when we think about who those people are, the reason we have the ecosystem we have is because we have many different constituents with many different kinds of needs.

      Digital infrastructure to meet constituent needs

    8. over the past five years, what we've done is said, what if we take those sort of technologies, a single standard in linked art to linked things together, the vocabularies as connective glue, and used it to power our entire discovery ecosystem, including our archival collections, but also our museum collections on top of that. What if we used it also to power the audio guide that we provide to our visitors, or to provide interesting novel experiential interfaces on top of the collections that we have? What if we worked with third parties to use it, both large parties like Google, Arts, and Culture, and small projects like the Spanish Art in the US, which is a project of the cultural office of the embassy of Spain defined Spanish artworks in American museums and bring them together? And so we've worked across projects at all of these scales to pull records together and say, what would happen if you really tried to build this system out? So we end up with a unified API model, a way to access that data that spans across all of these collections and brings them together into a single data model.

      What if we built a unified API model across projects at the Getty?

    9. American Art Collaborative. And this was a 2017 project to take 14 art museum collections together and use these sort of linked data principles that these things were built on to say, could you bridge across 14 institutions? Could you find connections? Could you provide a unified discovery environment for that many institutions at the same time? And what came out of that was a project that is called linked art. Rob Sanderson, who I'm sure many of you know, he's a regular here and a good colleague of mine. He and I worked together to create this data model called linked art based on the data model of the American Art Collaborative, which was the underlying sort of connective data tissue, building on years and years of work in the academic community under C-Dark to say, what if we had a tool that could bridge these things together?
    10. Linked Data in Production: Moving Beyond Ontologies

      Spring 2024 Member Meeting: CNI websiteYouTube

      David Newbury Assistant Director, Software and UX Getty

      Over the past six years, Getty has been engaged in a project to transform and unify its complex digital infrastructure for cultural heritage information. One of the project’s core goals was to provide validation of the impact and value of the use of linked data throughout this process. With museum, archival, media, and vocabularies in production and others underway, this sessions shares some of the practical implications (and pitfalls) of this work—particularly as it relates to interoperability, discovery, staffing, stakeholder engagement, and complexity management. The session will also share examples of how other organizations can streamline their own, similar work going forward.

      http://getty.edu/art/collection/ http://getty.edu/research/collections/ http://vocab.getty.edu https://www.getty.edu/projects/remodeling-getty-provenance-index/

    1. So I'm not surprised, and perhaps the last question about user privacy.

      Question about the privacy of the user interacting with the RAG

      It doesn't seem like JSTOR has thought about this thoroughly. In the beta there is kind of an expectation that the service is being validated so what users are doing is being watched closer.

    2. I quite often would like to ask my research assistant to go and explore for me. Part of what they would be looking for is who is mentioned, what reference, what arguments do they mention, what school of thoughts. So this would be a very simple way and it would certainly make their lives easier.

      Research thinks JSTOR's RAG would make their research assistant lives easier

    3. we can see like there's a whole body of things that aren't working well.

      General feedback

      Users would like to see longer summaries. They talk-to-the-document to ask if it mentions the patron's topic — similar to a Control-F on the document. They also ask about the nature of the discussion on a topic — seemingly coming from an advanced researcher with "something specific in mind." Also ask what methods are used and whether concepts are related.

      There are also queries that seem to want to push the boundaries of the LLM RAG.

    4. So they say, I really love this tool. It is great how it attempts to relate the search query back to the content and the article. I find that I'm spending more time capturing and downloading the AI summaries than the downloading the PDFs.

      Feedback: changed user behavior from downloading PDFs to downloading LLM-generated summaries

    5. And then we have, and this is a work in progress, we have automated mechanisms for assessing several different aspects of the responses.

      Assessment mechanisms

      In addition to the patron-submitted assessments of LLM output (positive/negative, qualitative comments), there are several automatic processes: toxicity (using a hate speech model), faithfulness (how close the response is to the document content), relevancy (measured similar to how it is done for search), and similarity (making sure the response is complete...similar to faithfulness).

    6. So how does this work? I wanted to give this picture of what's actually happening behind the scenes, especially with this question and answer. So first, I will say that we're using a combination of OpenAI's GPT 3.5 to do this as well as some open source, smaller open source models to generate the vectors for the semantic search.

      JSTOR implements a RAG

      RAG == Retrieval Augmented Generation

    7. For any instance where, in most cases, the users will be able to trace back from the response to the point in the article where that information was pulled from. So you can see arrow pointing to the highlighted text that is the start of the segment that was used to generate that answer.

      Footnotes in the LLM response go to specific passages in the text

      This helps the patron understand that the analyzed output can be more trustworthy than a general prompt to an LLM. That there are "guardrails" that "keep the user within the scope of this document".

    8. And in this on the side, you see we have this new chat box where the user can engage with the content and this very first action. The user doesn't have to do anything. They land on the page and as long as they run a search, we immediately process a prompt that says what in your voice, how is the query you put in?

      Initial LLM chat prompt: why did this document come up

      Using the patron's keyword search phrase, the first chat shown is the LLM analyzing why this document matched the patron's criteria. Then there are preset prompts for summarizing what the text is about, recommended topics to search, and a prompt to "talk to the document".

    9. Navigating Generative Artificial Intelligence: Early Findings and Implications for Research, Teaching, and Learning

      Spring 2024 Member Meeting: CNI websiteYouTube

      Beth LaPensee Senior Product Manager ITHAKA

      Kevin Guthrie President ITHAKA

      Starting in mid-2023, ITHAKA began investing in and engaging directly with generative artificial intelligence (AI) in two broad areas: a generative AI research tool on the JSTOR platform and a collaborative research project led by Ithaka S+R. These technologies are so crucial to our futures that working directly with them to learn about their impact, both positive and negative, is extremely important.

      This presentation will share early findings that illustrate the impact and potential of generative AI-powered research based on what JSTOR users are expecting from the tool, how their behavior is changing, and implications for changes in the nature of their work. The findings will be contextualized with the cross-institutional learning and landscape-level research being conducted by Ithaka S+R. By pairing data on user behavior with insights from faculty and campus leaders, the session will share early signals about how this technology-enabled evolution is beginning to take shape.

      https://www.jstor.org/generative-ai-faq

    1. you can see across the bottom we have a set of like preset prompts these are highly optimized to generate um responses that we've we've crafted so the first one is what is this text about this is a summary of the entire document

      Default prompt: output a summary of the document

    2. Navigating Generative Artificial Intelligence: Early Findings and Implications for Research, Teaching, and Learning

      Spring 2024 Member Meeting: CNI websiteYouTube

      Beth LaPensee Senior Product Manager ITHAKA

      Kevin Guthrie President ITHAKA

      Starting in mid-2023, ITHAKA began investing in and engaging directly with generative artificial intelligence (AI) in two broad areas: a generative AI research tool on the JSTOR platform and a collaborative research project led by Ithaka S+R. These technologies are so crucial to our futures that working directly with them to learn about their impact, both positive and negative, is extremely important.

      This presentation will share early findings that illustrate the impact and potential of generative AI-powered research based on what JSTOR users are expecting from the tool, how their behavior is changing, and implications for changes in the nature of their work. The findings will be contextualized with the cross-institutional learning and landscape-level research being conducted by Ithaka S+R. By pairing data on user behavior with insights from faculty and campus leaders, the session will share early signals about how this technology-enabled evolution is beginning to take shape.

      https://www.jstor.org/generative-ai-faq

    1. In approaching this material, suspend your disbelief, avoid choosing apreferred scenario, and embrace the full set of possibilities included in this material.Remember, the future will not be as described in any one scenario but will be made upof components of all four scenarios

      I do t know if it is typical of such tools, but he scenarios presented seem written such that one is highly desirable, one is not, and two are somewhat realistic. That might be the nature of the two-axis outline for the scenarios. I almost wish I had read them in a different order to try to limit a best-to-worst bias. (Although, as presented, the fourth is not the worst-case scenario…it is the third.)

    2. Intellectual property rights and the impact of AI-consumed and -produced content on the rights of others are not mentioned in the brief, which seems like a significant omission.

    3. AI not used tobenefit society’sbetterment,drivercapitalism

      I think this is trying to say that AI is seen as benefiting capitalist-driven entities, not society as a whole. But in any case it is awkwardly worded. .

    4. Alex was a couple years in now as a codirector of HIF with her partner, MITA. MITA hadbeen her assistant for several years ahead of their promotion, but at the time of thepromotion it was so clear that they were more partners than a boss and assistantrelationship. So, HIF made the decision to opt for the codirector approach and so far ithad been a rousing success.

      The digital assistant has been granted a kind of personhood status?

    5. The most advanced libraries operate almost exclusively on an AI platform

      What does this mean? What is an AI platform? “Operate” is a very broad term: content acquisition and licensing, inventory and description, publication and announcement/advertising, discovery and delivery.

    6. The ARL/CNI 2035 Scenarios: AI-Influenced Futures in the Research Environment. Washington, DC, and West Chester, PA: Association of Research Libraries, Coalition for Networked Information, and Stratus Inc., May 2024. https://doi.org/10.29242/report.aiscenarios2024

  5. Apr 2024
    1. These emails — which I encourage you to look up — tell a dramatic story about how Google’s finance and advertising teams, led by Raghavan with the blessing of CEO Sundar Pichai, actively worked to make Google worse to make the company more money. This is what I mean when I talk about the Rot Economy — the illogical, product-destroying mindset that turns the products you love into torturous, frustrating quasi-tools that require you to fight the company’s intentions to get the service you want.

      Rot Economy: taking value from the users

      Not [[enshitification]]…Value was taken directly from the users to the company. Is there a parallel to recent Boeing actions here?

    1. A grander sense of partnership is in the air now. What were once called AI bots have been assigned lofty titles like “copilot” and “assistant” and “collaborator” to convey a sense of partnership instead of a sense of automation. Large language models have been quick to ditch words like “bot” altogether.

      AI entities are now anthropomorphized

    2. Norman, now 88, explained to me that the term “user” proliferated in part because early computer technologists mistakenly assumed that people were kind of like machines. “The user was simply another component,” he said. “We didn’t think of them as a person—we thought of [them] as part of a system.” So early user experience design didn’t seek to make human-computer interactions “user friendly,” per se. The objective was to encourage people to complete tasks quickly and efficiently. People and their computers were just two parts of the larger systems being built by tech companies, which operated by their own rules and in pursuit of their own agendas.

      “User” as a component of the bigger system

      Thinking about this and any contrast between “user experience design” and “human computer interaction”. And about schema.org constructs embedded in web pages…creating web pages that were meant to be read by both humans and bots.

    3. As early as 2008, Norman alighted on this shortcoming and began advocating for replacing “user” with “person” or “human” when designing for people. (The subsequent years have seen an explosion of bots, which has made the issue that much more complicated.) “Psychologists depersonalize the people they study by calling them ‘subjects.’ We depersonalize the people we study by calling them ‘users.’ Both terms are derogatory,” he wrote then. “If we are designing for people, why not call them that?”

      “User” as a depersonalized, derogatory term

    1. We often think of software development as a ticket-in-code-out business but this is really only a very small portion of the entire thing. Completely independently of the work done as a programmer, there exists users with different jobs they are trying to perform, and they may or may not find it convenient to slot our software into that job. A manager is not necessarily the right person to evaluate how good a job we are doing because they also exist independently of the user–software–programmer network, and have their own sets of priorities which may or may not align with the rest of the system.

      Software development as a conversation

    1. I guess her own self-description that it doesn’t actually matter where she stops, that the important thing in the making of the painting is the making and destroying and making and destroying, that that’s actually what the whole thing is about.

      Deciding where to stop is a choice onto itself

      This session had me in a panic: how do we put descriptive metatata on this? Where would we draw the line between different representations of the work? Which representation becomes the featured one…the one the artist picked as a point on the timeline of creation, the one the describer picked for an aesthetic reason, one one that broke through in the public consciousness?

    2. There are, in my view, three stages of making art. One of them is the imagining, and the final one is the shaping. But in between, there is the judging, which is kind of what we’re talking about here, the editing.

      Three stages of creation: imagining, judging, shaping

      it is the middle one that is often invisible to the point of being lost. That is the role of editing.

    3. Now, many people, when they read, listen to anything, when they take in media, they don’t necessarily even know where it was from.

      The lost role of the editor with the decontainerization of digital media

    4. The Work of Art, How Something Comes From Nothing

      Publisher link. He talks later in the podcast about how the physical book itself is a work of art..from the texture of the paper (which he thought was too smooth) to the cloth cover (which he pointedly advocated for with the publisher).

    1. But the National Emergency Library (NEL) refutes that.

      First time the National Emergency Library is mentioned in the brief.

    2. Properly understood, controlled digital lending simply enables modernlibraries to carry out their time-honored missions in the more efficient and effectiveway digital technologies allow.

      CDL is an "efficient and effective digital way"

      I wonder if the point will be made later that there is a non-zero cost to providing CDL infrastructure. CDL, and the digital infrastructure required to support it, cost to a rough approximation the same as shelving the physical book.

      Edited to add: yes it is mentioned later in the brief.

    3. Publishers’ claim that the scope of IA’s lending is too small to calculatemarket harm (Resp.Br. 52-53) is equally unfounded. To start, Publishers ignore thehuge increase in fixed costs—purchasing and storing books, building and expandingdigital infrastructure, and more—that would necessarily accompany (and imposelimits on) any expansion of controlled digital lending.

      CDL programs have their own costs

      ..so the costs of CDL do get mentioned in the brief.

    4. IA’s lending serves additional transformative purposes by enabling innovativeinteractions between books and the Internet, such as cite-checking online resourceslike Wikipedia.

      Wikipedia cite-checking as transformative CDL purpose

    5. Controlled digital lending permits libraries to build permanent collections andto archive and lend older books in a form that preserves their original printing.Publishers’ ebook licenses cannot serve this preservation mission because they arenot photographs of the original editions, and ongoing access depends on Publishers’discretion and is subject to change without notice.

      CDL supports preservation of published material

    6. Publishers erroneously claim that controlled digital lending does notexpand utility because their ebook licenses already provide the same efficiency.

      Publishers can offer greater utility

      I propose a notion that may not be legally relevant in this case: publishers, who have the digital source file of the publication, can do so much more than a library can do with CDL. The perfect example of this is the X-Ray functionality in Kindle: "allows readers to learn more about a character, topic, event, place, or any other term, simply by pressing and holding on the word or phrase that interests them." To provide equivalent functionality, the library would need to OCR the image, correct it, and then layer on the additional functionality. In the non-fiction arena, the publisher can make interactive graphs and tables that would be difficult to do based on the page images.

    7. In any event, IA’s controlled digital lending program serves a differentpurpose than physically lending the books it owns. Although both ultimately enablepatrons to read the content, that does not mean the purposes are the same—most usesof a book involve viewing its content. Controlled digital lending serves a differentpurpose by permitting libraries to lend the books they own to a broader range ofpeople for whom physical lending would be impractical.

      CDL enables access beyond physical boundaries

      Here the brief says that libraries can use CDL to lend beyond the physical boundaries of its territory. This is one of the scenarios envisioned by the NISO IS-CDL working group's draft recommended practice. The recommended practice doesn't offer a legal opinion; instead, it leaves it up to the risk assessment of each organization.

    8. But what matters is whether theborrowers are “entitled to receive the content.” Redigi, 910 F.3d at 661 (emphasisadded).

      Purchase on content versus purchase of format

      This is an interesting argument—that what is purchased is the content of the book, not the physical/digital artifact itself. This seems right to me (again, unencumbered by deep legal knowledge/reasoning).

      Side note: in library school I once argued that de-spining the pages of a book was a legitimate way to ensure good digital copies could be made. The professor was horrified at the suggestion.

    9. Publishers mischaracterize BWB’s ownership. Resp.Br. 17. As explained(IABr. 23 n.8), BWB is not owned by IA or Brewster Kahle, but by Better WorldLibraries, which has no owner. A-6087-89. BWB and Better World Libraries areoperated by a three-member board. A-6089. IA has no control over either entity.While Kahle has leadership roles in each entity, some overlap in personnel does notundermine the separateness of corporate entities.

      Kahle participates in both BWB/BWL and IA

      I don't know the legal significance of this, but from a lay-person's view there does seem to be some entanglement.

    10. Second, Publishers present an erroneously cramped view of libraries’missions. Libraries do not acquire and lend books solely to make them physicallyavailable to patrons within a restricted geographic area. Contra Resp.Br. 4, 9.Libraries provide readers more egalitarian access to a wider range of books,overcoming socioeconomic and geographic barriers by sharing resources with otherlibraries through interlibrary loans.

      Second of two "critical misconceptions": libraries' mission to provide broad access to information

    11. First, Publishers disregard the key feature of controlled digital lending: thecontrols that ensure borrowing a book digitally adheres to the same owned-to-loanedratio inherent in borrowing a book physically. Publishers repeatedly compare IA’slending to inapposite practices that lack this key feature.

      First of two "critical misconceptions": CDL controls on owned-to-loaned ratio

      CDL isn't the open redistribution of content (a la openly posting with no restrictions or digital resale).

    12. Hachette Book Group, Inc. v. Internet Archive (23-1260) Court of Appeals for the Second Circuit

      REPLY BRIEF, on behalf of Appellant Internet Archive, filed 04/19/2024

      https://www.courtlistener.com/docket/67801014/hachette-book-group-inc-v-internet-archive/

    1. Vaughn says the temperatures along with carbon dioxide levels have naturally fluctuated over earth's history inside lasting between 144,000 years. Well, over the last million years, co two in the atmosphere has never really gone despite its ups and downs never gone above maybe 280 parts per million. Until now. As of January 2024 the amount of heat trapping carbon dioxide is a whopping 422 parts per million. We've had a wonderful party with fossil fuels for a couple of centuries. We've, we have changed the world at a cost that's now only becoming evident.

      Ice cores provide a history of carbon dioxide in the atmosphere

    2. Each ice core is kind of unique and shows you a different climatic window. Vaughn uses water isotopes to determine what the temperature was when each layer of ice was formed. Isotopes are molecules that have the same number of protons and electrons, but a different number of neutrons affecting their mass. For example, water h2o has oxygen that has either a molecular rate of 16 or 18. And so it's a heavy and light water precipitation that falls in warmer temperature tends to be heavier water. He says, but in colder air like at the poles, the snow that falls is generally lighter water by looking at these ratios of ice tops in ice cores. We were able to infer the temperature from when it fell. As snow.

      Using ratio of the molecular weight of water to determine temperature

  6. Mar 2024
    1. Posted to YouTube on 12-Mar-2024

      Abstract

      Arial, Times New Roman, Consolas, Comic Sans... digital typography has turned us all into typesetters. The tools we use, the apps we build, the emails we send: with so much of our lives mediated by technology, something as seemingly innocuous as picking a typeface can end up defining our relationship with the systems we use, and become part of the identity that we project into the world. Typography is a fundamental part of modern information design, with implications for user experience, accessibility, even performance - and when it goes wrong, it can produce some of the most baffling bugs you've ever seen.

      Join Dylan Beattie for a journey into the weird and wonderful history of digital typography, from the origins of movable type in 8th century Asia, to the world of e-ink displays and web typography. We'll look at the relationship between technology and typography over the centuries: the Gutenberg Press, Linotype machines, WYSIWYG and the desktop publishing revolution. What was so special about the Apple II? How do you design a pixel font? We'll learn why they're called upper and lower case, we'll talk about why so many developers find CSS counter-intuitive - and we'll find out why so many emails used to end with the letter J.

    1. many of those servers are of course honeypots some of them are staging and demo environments of the vendors it's interesting to not that this protocol and technology is also used for animals so many of the records on the internet actually are for cats and dogs and many of the records are exposed via universities or research centers that they just share anonymized data with other research centers

      Some DICOM servers are intended to be public

    2. many hospitals started moving their dcom infrastructures to the cloud because it's cheaper it's easier it's faster it's good so they did the shift and they used the Legacy protocol dcom without sufficient security

      Protocol intended for closed networks now found on open cloud servers

    3. dcom is the standard that defines how these images should be digitally structured and stored also it defines a network protocol that says how these images can be transferred in a network

      DICOM is a file structure and a network protocol

      I knew of DICOM as using JPEG2000 for image formats, but I didn't know it was a network protocol, too.

    4. Millions of Patient Records at Risk: The Perils of Legacy Protocols

      Sina Yazdanmehr | Senior IT Security Consultant, Aplite GmbH Ibrahim Akkulak | Senior IT Security Consultant, Aplite GmbH Date: Wednesday, December 6, 2023

      Abstract

      Currently, a concerning situation is unfolding online: a large amount of personal information and medical records belonging to patients is scattered across the internet. Our internet-wide research on DICOM, the decade-old standard protocol for medical imaging, has revealed a distressing fact – Many medical institutions have unintentionally made the private data and medical histories of millions of patients accessible to the vast realm of the internet.

      Medical imaging encompasses a range of techniques such as X-Rays, CT scans, and MRIs, used to visualize internal body structures, with DICOM serving as the standard protocol for storing and transmitting these images. The security problems with DICOM are connected to using legacy protocols on the internet as industries strive to align with the transition towards Cloud-based solutions.

      This talk will explain the security shortcomings of DICOM when it is exposed online and provide insights from our internet-wide research. We'll show how hackers can easily find, access, and exploit the exposed DICOM endpoints, extract all patients' data, and even alter medical records. Additionally, we'll explain how we were able to bypass DICOM security controls by gathering information from the statements provided by vendors and service providers regarding their adherence to DICOM standards.

      We'll conclude by providing practical recommendations for medical institutions, healthcare providers, and medical engineers to mitigate these security issues and safeguard patients' data.

    1. 109. On information and belief, in addition to her extensive online presence, she has aGitHub (a software code hosting platform) account called, “anarchivist,” and she developed arepository for a python module for interacting with OCLC’s WorldCat® Affiliate web services.

      Matienzo has a GitHub account with code that interacts with OCLC’s API

      Is this really the extent of the connection between Matienzo and Anna’s Archive? I don’t know what is required at the Complaint stage of a civil lawsuit to prove someone is connected to an anonymous collective, but surely something more than this plus public statements and an internet handle (“anarchivist”) is required to convict. Does Ohio have SLAPP protections?

    2. 99. This includes metadata unique to WorldCat® records and created by OCLC. Forexample, the Anna’s Archive’s blog post indicates that Defendants harvested metadata that denotesassociations between records, such as between an original work and a parodying work. OCLCadds these associations data as part of its enrichment process.

      Example of enrichment process: association between original works and parodies

    3. 1792. In total, WorldCat® has 1.4 billion OCNs, meaning Defendants claim that theywere able to harvest, to some extent, 97.4% of unique WorldCat® records

      In complaint, OCLC says it has 1.4b OCNs

    4. 78. The bots also harvested data from WorldCat.org by pretending to be an internetbrowser, directly calling or “pinging” OCLC’s servers, and bypassing the search, or user interface,of WorldCat.org. More robust WorldCat® data was harvested directly from OCLC’s servers,including enriched data not available through the WorldCat.org user interface.

      Web scrapers web-scraped, but more robust data?

      The first sentence is the definition of a web scraper…having done an analysis of the URL structure, it goes directly to the page it is interested in rather than going through the search engine. (Does going through the search engine make a web scraper’s activities somehow legitimate?)

      The second sentence is weird…how did the web scraper harvest “more robust WorldCat data” that wasn’t available through WorldCat.org?

    5. 38. This includes adding OCLC’s own unique identifying number, the “OCN,” whichenhances queries and serves as an authoritative index for specific items or works.

      OCN, OCLC’s unique identifier

      Remember…OCNs are in the public domain: OCLC Control Numbers - Lots of them; all public domain

    6. 1476. These attacks were accomplished with bots (automated software applications) that“scraped” and harvested data from WorldCat.org and other WorldCat®-based research sites andthat called or pinged the server directly. These bots were initially masked to appear as legitimatesearch engine bots from Bing or Google.

      Bots initially masked themselves as search engine bots

    7. 58. When an individual searches on WorldCat.org, the individual agrees to the OCLCWorldCat.org Services Terms and Conditions (attached here as Exhibit B).

      Terms and Conditions

      I just tried in a private browser window to be sure, but there isn’t an up-front display of the terms and conditions to click-through. I’m also not sure what the state of validity of website terms and conditions is. The Terms and Conditions link at the bottom of the page says it hasn’t been updated since 2009.

    8. 51. The information available through WorldCat.org on a result page includes data thatis freely accessible on the web, such as title, publication, copyright, author, and editor, and limiteddata “enriched” by OCLC, such as OCN, International Standard Book Number (“ISBN”),International Standard Serial Number (“ISSN”), and pagination. This enriched data is moredifficult to find outside of WorldCat® and varies by each result in WorldCat.org.52. Most WorldCat® data available in a WorldCat® record is unavailable to anindividual on WorldCat.org. This is because a full WorldCat® record is part of a member library’ssubscription for cataloging and other library services.

      Subset of data from WorldCat is on the public WorldCat.org site

      “Most WorldCat data” is not available on the public-facing website? That would be an interesting comparison study. Notably missing from the list of publicly available data is subject headings and notes…I’m pretty sure those fields are available, too.

    9. 39. Of the entire WorldCat® collection, more than 93% of the records have beenmodified, improved, and/or enhanced by OCLC.

      Percentage of WorldCat that has been “modified, improved, and/or enhanced”

    10. OCLC is a non-profit, membership, computer library service and researchorganization dedicated to the public purposes of furthering access to the world’s information andreducing the rate of the rise in library costs

      How OCLC defines itself

      …with that phrase “reducing the rate of rise in library costs” — how about flat out reducing library costs, OCLC?

    11. By hacking WorldCat.org, scraping and harvesting OCLC’s valuable WorldCat

      Complain equates “hacking” with “scraping and harvesting”

      This is a matter of some debate—notably the recent LLM web scraping cases.

    12. In the blog post announcing their hacking and scraping of the data viaWorldCat.org, Defendants publicly thanked OCLC for “the decades of hard work you put intobuilding the collections that we now liberate. Truly: thank you.”

      Anna’s Archive blog post announcing data

      1.3B WorldCat scrape & data science mini-competition. Lead paragraph:

      TL;DR: Anna’s Archive scraped all of WorldCat (the world’s largest library metadata collection) to make a TODO list of books that need to be preserved, and is hosting a data science mini-competition.

    13. 7. When OCLC member libraries subscribe to WorldCat® through OCLC’sWorldCat® Discovery Services/FirstSearch, the subscription includes the WorldCat.org service.Libraries are willing to pay for WorldCat.org as part of their WorldCat® subscription

      Libraries pay for WorldCat.org visibility

      The first sentence of the paragraph says that a subscription to WorldCat Discovery Services includes visibility on WorldCat.org. The second sentence says that libraries are willing to pay for this visibility. I’m not sure what else is included in a WorldCat Discovery Services subscription…is there a contradiction in these two sentences?

    14. 6. To accomplish this, WorldCat.org allows individuals to search member libraries’catalogs as represented by their corresponding WorldCat® records in the WorldCat® database.When an individual views a search result from WorldCat.org, they see a more limited view of aWorldCat® record, i.e., with less metadata than is available for the record in the WorldCat®database for cataloging purposes

      WorldCat database is a subset of WorldCat.org

    15. Complaint

      OCLC Online Computer Library Center, Inc. v. Anna's Archive (2:24-cv-00144)

      District Court, Southern District of Ohio

    1. Proactively manage staff and user wellbeing: Cyber-incident management plans shouldinclude provisions for managing staff and user wellbeing. Cyber-attacks are deeply upsettingfor staff whose data is compromised and whose work is disrupted, and for users whoseservices are interrupted

      Cybersecurity is a group effort

      It would be easy to pin this all on the tech who removed that block on the account that may have been the beachhead for this attack. As this report shows, the organization allowed the environment to flourish that culminated in that one bit-flip to bring the organization down.

      I’ve never been in that position. I’m mindful that I could someday be in that position looking back at what my action or inaction allowed to happen. I’ll probably be risk being in that position until the day I retire and destroy my production work credentials.

    2. Manage systems lifecycles to eliminate legacy technology: ‘Legacy’ systems are not justhard to maintain and secure, they are extremely hard to restore. Regular investment in thelifecycle of all critical systems – both infrastructure and applications – is essential toguarantee not just security but also organisational resilience

      What is cutting edge today is legacy tomorrow

      As our layers of technology get stacked higher, the bottom layers get squeezed and compressed to thin layers that we assume will always exist. We must maintain visibility in those layers and invest in their maintenance and robustness.

    3. Enhance intrusion response processes: An in-depth security review should be commissionedafter even the smallest signs of network intrusion. It is relatively easy for an attacker toestablish persistence after gaining access to a network, and thereafter evade routinesecurity precautions

      You have to be right 100% of the time; your attacker needs to be lucky once

    4. The need to embed security more deeply than ever into everything we do will requireinvestment in culture change across different parts of the Library. There is a risk that thedesire to return to ‘business as usual’ as fast as possible will compromise the changes intechnology, policy, and culture that will be necessary to secure the Library for the future. Astrong change management component in the Rebuild & Renew Programme will beessential to mitigate this risk, as will firm and well considered leadership from seniormanagers

      Actively avoiding a return to normal

      This will be among the biggest challenges, right? The I-could-do this-before-why-can’t-I-do-it-now question. Somewhere I read that the definition of “personal character” is the ability to see an action through after the emotion of the commitment to the action has passed. The British Library was a successful institution and will and to return to that position of being seen as a successful instituting as quick as it possibly can.

    5. a robust and resilient backup service, providing immutable and air-gapped copies, offsitecopies, and hot copies of data with multiple restoration points on a 4/3/2/1 model

      Backup models

      I’m familiar with the 3-2-1 strategy for backups (three copies of your data on two distinct media with one stored off-site), but I hadn’t heard of the 4-3-2-1 strategy. Judging from this article from Backblaze, the additional layer accounts for a fully air-gapped or unavailable-online copy. The AWS S3 “Object Lock” option noted earlier is one example: although the backed up object is online and can be read, there are technical controls that prevent its modification until a set period of time elapses. (Presumably a time period long enough for you to find and extricate anyone that has compromised your systems before the object lock expires.)

    6. The substantial disruption of the attack creates an opportunity to implement a significant number ofchanges to policy, processes, and technology that will address structural issues in ways that wouldpreviously have been too disruptive to countenance

      Never let a good crisis go to waste

      Oh, yeah.

    7. our reliance on legacy infrastructure is the primary contributor to the length of time that theLibrary will require to recover from the attack. These legacy systems will in many cases needto be migrated to new versions, substantially modified, or even rebuilt from the ground up,either because they are unsupported and therefore cannot be repurchased or restored, orbecause they simply will not operate on modern servers or with modern security controls

      Legacy infrastructure lengthens recovery time

      I wonder how much of this “legacy infrastructure” are bespoke software systems that were created internally and no longer have relevant or reliable documentation. Yes, they may have the data, but they can’t reconstruct the software development environments that would be used to upgrade or migrate to a new commercial or internally-developed system.

    8. some of our older applications rely substantially on manual extract, transform and load (ETL)processes to pass data from one system to another. This substantially increases the volumeof customer and staff data in transit on the network, which in a modern data managementand reporting infrastructure would be encapsulated in secure, automated end-to-end

      Reliance on ETL seen as risky

      I’m not convinced about this. Real-time API connectivity between systems is a great goal…very responsive to changes filtering through disparate systems. But a lot of “modern” processing is still done by ETL batches (sometimes daily, sometimes hourly, sometimes every minute).

    9. our historically complex network topology (ie. the ‘shape’ of our network and how itscomponents connect to each other) allowed the attackers wider access to our network thanwould have been possible in a more modern network design, allowing them to compromisemore systems and services

      Historically complex network topology

      Reading between the lines, I think they dealt with the complexity by having a flat network…no boundaries between functions. If one needed high level access to perform a function on one system, hey had high level access across large segments of the network.

    10. viable sources of backups had been identified that wereunaffected by the cyber-attack and from which the Library’s digital and digitised collections,collection metadata and other corporate data could be recovered

      Viable backups

      I suddenly have a new respect for write-once-read-many (WORM) block storage like AWS’ Object Lock: https://aws.amazon.com/blogs/storage/protecting-data-with-amazon-s3-object-lock/

    11. The Library has not made any payment to the criminal actors responsible for the attack, nor engagedwith them in any way. Ransomware gangs contemplating future attacks such as this on publicly-funded institutions should be aware that the UK’s national policy, articulated by NCSC, isunambiguously clear that no such payments should be made

      Government policy not to reward or engage with cyber attackers

    12. The lack of MFA on thedomain was identified and raised as a risk at this time, but the possible consequences were perhapsunder-appraised.

      No MFA on the remote access server

      If you, dear reader, are in the same boat now, seriously consider reprioritizing your MFA rollout.

    13. The intrusion was first identified as a major incident at 07:35 on 28 October 2023

      Attack started overnight Friday-to-Saturday

      If there were network and service availability alarms, were they disabled in the attack? Were they on internal or external systems? Was the overnight hours into a weekend a factor in how fast the problem was found?

    14. The criminal gang responsible for the attack copied and exfiltrated (illegally removed) some 600GBof files, including personal data of Library users and staff. When it became clear that no ransomwould be paid, this data was put up for auction and subsequently dumped on the dark web. OurCorporate Information Management Unit is conducting a detailed review of the material included inthe data-dump, and where sensitive material is identified they are contacting the individualsaffected with advice and support.

      Ransom not paid and data published

      Not sure yet whether they will go into their thinking behind why they didn’t pay, but that is the recommended course of action.

    15. LEARNING LESSONS FROM THE CYBER-ATTACK British Library cyber incident review 8 MARCH 2024

    1. Actually, ChatGPT is INCREDIBLY Useful (15 Surprising Examples) by ThioJoe on YouTube, 8-Feb-2024

      • 0:00 - Intro
      • 0:28 - An Important Point
      • 1:26 - What If It's Wrong?
      • 1:54 - Explain Command Line Parameters
      • 2:36 - Ask What Command to Use
      • 3:04 - Parse Unformatted Data
      • 4:54 - Use As A Reverse Dictionary
      • 6:16 - Finding Hard-To-Search Information
      • 7:48 - Finding TV Show Episodes
      • 8:20 - A Quick Note
      • 8:37 - Multi-Language Translations
      • 9:21 - Figuring Out the Correct Software Version
      • 9:58 - Adding Code Comments
      • 10:18 - Adding Debug Print Statements
      • 10:42 - Calculate Subscription Break-Even
      • 11:40 - Programmatic Data Processing
  7. Feb 2024
    1. Bobbs-Merrill Company v. Straus, the 1908 Supreme Court case that established the First Sale Doctrine in United States common law, flowed directly from a publisher’s attempt to control the minimum price that the novel The Castaway could be sold for on the secondary market.15 In that case, The Castaway’s publisher, the Bobbs-Merrill Company, added a notice to each copy of the book that no dealer was “authorized” to sell the book for less than $1. When the Straus brothers purchased a number of copies and decided to sell them for less than $1, Bobbs-Merrill sued to enforce its $1 price floor. Ultimately, the US Supreme Court ruled that Straus did not need “authorization” from Bobbs-Merrill (or anyone else) to sell the books at whatever price they chose. Once Bobbs-Merrill sold the books, their preferences for how the books were used did not matter.

      1908 Supreme Court case established First Sale Doctrine

    2. Over the years, publishers have made many attempts to avoid this exchange, controlling both the purchase price and what purchasers do with the books after they are sold. For example, in the early 1900s, publishers tried to control resale prices on the books people bought from retailers by stamping mandatory resale prices on a book’s front page.6 (That attempt was rejected by the US Supreme Court).7 Publishers also tried to limit where people could resell books they bought, in one case claiming that a book sold in Thailand couldn’t be resold in the US.8 (That attempt was also rejected by the US Supreme Court, in 2013).9 These attempts failed because the publisher’s copyright does not give them absolute control of a book in perpetuity; the copyright system is a balance between publishers and purchasers.10 If publishers want the benefits of the copyright system, they also have to accept the limits it imposes on their power.

      Attempts by publishers to limit post-sale activities

    1. Moving scanning from the server to the client pushes it across the boundary between what is shared (the cloud) and what is private (the user device). By creating the capability to scan files that would never otherwise leave a user device, CSS thus erases any boundary between a user’s private sphere and their shared (semi-)public sphere [6]. It makes what was formerly private on a user’s device potentially available to law enforcement and intelligence agencies, even in the absence of a warrant. Because this privacy violation is performed at the scale of entire populations, it is a bulk surveillance technology.

      Client-side scanning is a bulk surveillance technology

    2. Many scanning systems make use of perceptual hash functions, which have several features that make them ideal for identifying pictures. Most importantly, they are resilient to small changes in the image content, such as re-encoding or changing the size of an image. Some functions are even resilient to image cropping and rotation.

      Perceptual hash function for content scanning

      One way to scan for target material: run a function on the content that results in a manipulation-resistant identifier that is easy to compare.

    3. The alternative approach to image classification uses machine-learning techniques to identify targeted content. This is currently the best way to filter video, and usually the best way to filter text. The provider first trains a machine-learning model with image sets containing both innocuous and target content. This model is then used to scan pictures uploaded by users. Unlike perceptual hashing, which detects only photos that are similar to known target photos, machine-learning models can detect completely new images of the type on which they were trained.

      Machine learning for content scanning

    4. In what follows, we refer to text, audio, images, and videos as “content,” and to content that is to be blocked by a CSS system as “targeted content.” This generalization is necessary. While the European Union (EU) and Apple have been talking about child sex-abuse material (CSAM)—specifically images—in their push for CSS [12], the EU has included terrorism and organized crime along with sex abuse [13]. In the EU’s view, targeted content extends from still images through videos to text, as text can be used for both sexual solicitation and terrorist recruitment. We cannot talk merely of “illegal” content, because proposed UK laws would require the blocking online of speech that is legal but that some actors find upsetting [14].

      Defining "content"

      How you define "content" in client-side scanning is key. The scope of any policies will depend on the national (and local?) laws in place.

    5. Harold Abelson, Ross Anderson, Steven M Bellovin, Josh Benaloh, Matt Blaze, Jon Callas, Whitfield Diffie, Susan Landau, Peter G Neumann, Ronald L Rivest, Jeffrey I Schiller, Bruce Schneier, Vanessa Teague, Carmela Troncoso, Bugs in our pockets: the risks of client-side scanning, Journal of Cybersecurity, Volume 10, Issue 1, 2024, tyad020, https://doi.org/10.1093/cybsec/tyad020

      Abstract

      Our increasing reliance on digital technology for personal, economic, and government affairs has made it essential to secure the communications and devices of private citizens, businesses, and governments. This has led to pervasive use of cryptography across society. Despite its evident advantages, law enforcement and national security agencies have argued that the spread of cryptography has hindered access to evidence and intelligence. Some in industry and government now advocate a new technology to access targeted data: client-side scanning (CSS). Instead of weakening encryption or providing law enforcement with backdoor keys to decrypt communications, CSS would enable on-device analysis of data in the clear. If targeted information were detected, its existence and, potentially, its source would be revealed to the agencies; otherwise, little or no information would leave the client device. Its proponents claim that CSS is a solution to the encryption versus public safety debate: it offers privacy—in the sense of unimpeded end-to-end encryption—and the ability to successfully investigate serious crime. In this paper, we argue that CSS neither guarantees efficacious crime prevention nor prevents surveillance. Indeed, the effect is the opposite. CSS by its nature creates serious security and privacy risks for all society, while the assistance it can provide for law enforcement is at best problematic. There are multiple ways in which CSS can fail, can be evaded, and can be abused.

      Right off the bat, these authors are highly experienced and plugged into what is happening with technology.

    1. Less discussed than these broader cultural trends over which educators have little control are the major changes in reading pedagogy that have occurred in recent decades—some motivated by the ever-increasing demand to “teach to the test” and some by fads coming out of schools of education. In the latter category is the widely discussed decline in phonics education in favor of the “balanced literacy” approach advocated by education expert Lucy Calkins (who has more recently come to accept the need for more phonics instruction). I started to see the results of this ill-advised change several years ago, when students abruptly stopped attempting to sound out unfamiliar words and instead paused until they recognized the whole word as a unit. (In a recent class session, a smart, capable student was caught short by the word circumstances when reading a text out loud.) The result of this vibes-based literacy is that students never attain genuine fluency in reading. Even aside from the impact of smartphones, their experience of reading is constantly interrupted by their intentionally cultivated inability to process unfamiliar words.

      Vibe-based literacy

      Ouch! That is a pretty damming label.

  8. Jan 2024
    1. And those light sticks aren't handed out as part of the event, they're mementos that fans will sometimes spend more than $100 on.

      Bluetooth technology

      Lighting devices are tied to an app on a phone via Bluetooth. The user also puts their location into the app.

    2. But the more advanced wristbands, like what you see at the Super Bowl or Lady Gaga, use infrared technology.

      Infrared technology

      Transmitters on towers sweep over the audience with infrared signals. Masks in the transmitters can be used to create designs.

    3. Let's start off with the simplest, RF wristbands, that receive a radio frequency communicating the precise timing and colors for each band.

      RF technology

      Bands distributed to seating areas are assigned to one of several channels. The RF transmitter transmits the channel plus color information across a broad area.

    4. Jun 1, 2023

      Abstract

      WSJ goes behind the scenes with PixMob, a leading concert LED company, to see how they use “old tech” to build creative light shows, essentially turning the crowd into a video canvas.

    1. it seems that one Wilshire has always had a crisis of location curiously the building isn't actually located at the address one Wilshire Boulevard it actually sits at 624 South Grand Avenue when Wilshire was a marketing name developed afterward

      One Wilshire isn't really on Wilshire Drive

    2. in 2013 this building sold for 437 million dollars the 660 dollars per square foot of leasable space that's by far the highest price paid of any office building in Downtown LA

      Most expensive commercial real estate in the US

      As a carrier hotel and co-location space for internet companies. 250 network service providers.

    3. Jun 1, 2023

      Abstract

      Sometimes buildings just don't look as important as they are. This the case of One Wilshire Blvd in Los Angeles. At first glance, its a generic office building in downtown. But, that blank facade is hiding one of the most important pieces of digital infrastructure within the United States. In this video we visit 1 Wilshire Blvd, explain how it works, and chat with Jimenez Lai who wrote a story about the building which explores its outsized role in our digital lives.

    1. Part of it is that old cardboard can't be recycled indefinitely. The EPA says it can only go through the process about seven times. Each time it goes through pulping and blending, the long, strong pine fibers get a bit shorter and weaker, and eventually the degraded paper bits simply wash through the screens and out of the process. So recycling is very important, but even if 100% of boxes got reused, making new ones would still mean cutting down trees.

      Paper degrades and can't be indefinitely recycled

    2. The southern US, sometimes called "America's wood basket," is home to 2% of the world's forested land, yet it produces nearly 20% of our pulp and paper products.

      Pulp and paper products produced overwhelmingly in the southern U.S.

    3. Forester Alex Singleton walked us through an area whose trees were sold to International Paper two years ago. Alex: It has since been replanted with longleaf pine. Narrator: But it will still take decades for the new crop to mature. For many foresters, we only see a site harvested once during our careers. From this stage to there would probably be around 30 years.

      30 years to get mature trees for corrugated packaging

    4. Sep 14, 2023

      Abstract

      Cardboard has a high recycling rate in the US. But it can't be reused forever, so the massive paper companies that make it also consume millions of trees each year.

    1. Law enforcement contends that they want front dooraccess, where there is a clear understanding of when theyare accessing a device, as the notion of a back door soundssecretive. This front door could be opened by whomeverholds the key once investigators have demonstrated a lawfulbasis for access, such as probable cause that a crime isbeing committed. Whether front or back, however, buildingin an encrypted door that can be unlocked with a key—nomatter who maintains the key—adds a potentialvulnerability to exploitation by hackers, criminals, andother malicious actors. Researchers have yet to demonstratehow it would be possible to create a door that could only beaccessed in lawful circumstances.

      Rebranding law enforcement access as "front-door"

      ...because "back door" sounds secretive. And it would be secretive if the user didn't know that their service provider opened the door for law enforcement.

    2. Some observers say law enforcement’sinvestigative capabilities may be outpaced by the speed oftechnological change, preventing investigators fromaccessing certain information they may otherwise beauthorized to obtain. Specifically, law enforcement officialscite strong, end-to-end encryption, or what they have calledwarrant-proof encryption, as preventing lawful access tocertain data.

      "warrant-proof" encryption

      Law enforcement's name for "end-to-end encryption"

    1. With respect to medical marijuana, a key difference between placement in Schedule I and ScheduleIII is that substances in Schedule III have an accepted medical use and may lawfully be dispensed byprescription, while Substances in Schedule I cannot.

      Legal issues remain even if marijuana is rescheduled

      Schedule III allows for "accepted medical use", but the FDA has not approved marijuana as a drug.

    2. In each budget cycle since FY2014, Congress has passed an appropriations rider barring theDepartment of Justice (DOJ) from using taxpayer funds to prevent states from “implementing their ownlaws that authorize the use, distribution, possession, or cultivation of medical marijuana.” Courts haveinterpreted the appropriations rider to prohibit federal prosecution of state-legal activities involvingmedical marijuana. However, it poses no bar to federal prosecution of activities involving recreationalmarijuana.

      Marijuana still illegal from a federal standpoint, but federal prosecution is prohibited

      In states that have passed medical marijuana statutes, Congress has said that DOJ cannot prosecute through an annual appropriations rider. (e.g., "no money can be used...")

    3. Congress placed marijuana in Schedule I in 1970 when it enacted the CSA. A lower schedulenumber carries greater restrictions under the CSA, with controlled substances in Schedule I subject to themost stringent controls. Schedule I controlled substances have no currently accepted medical use. It isillegal to produce, dispense, or possess such substances except in the context of federally approvedscientific studies, subject to CSA regulatory requirements designed to prevent abuse and diversion.

      Marijuana on CSA from the start

      Schedule I substances in the Controlled Substances Act have the most stringent regulations, and have no acceptable medical uses. Congress put Marijuana on the Schedule I list when it passed the CSA.

    4. Cannabis and its derivatives generally fall within one of two categories under federal law: marijuana orhemp.

      CSA definitions for marijuana and hemp

      Hemp is cannabis with a delta-9 tetrahydrocannobinol (THC) of less than 0.3%. Marijuana is everything else. Hemp is not a controlled substance while marijuana is.

    1. So we have 50 independent electoral systems that kind of work in conjunction in tandem, but they're all slightly different and they're all run by the state.

      It is worse than that. In Ohio, each county has its own election system. Rules are set at the state level, but each county buys and maintains the equipment, hires and does training, and reports its results.

    1. Images of women are more likely to be coded as sexual in nature than images of men in similar states of dress and activity, because of widespread cultural objectification of women in both images and its accompanying text. An AI art generator can “learn” to embody injustice and the biases of the era and culture of the training data on which it is trained.

      Objectification of women as an example of AI bias

    1. “Information has become a battlespace, like naval or aerial”, Carl Miller, Research Director of the Centre for the Analysis of Social Media, once explained to me. Information warfare impacts how information is used, shared and amplified. What matters for information combatants is not the truth, reliability, relevance, contextuality or accuracy of information, but its strategic impact on the battlespace; that is, how well it manipulates citizens into adopting desired actions and beliefs.

      Information battlespace

      Sharing “information” not to spread truth, but to influence behavior

    1. Harvard had posture tests as early as 1880 and many other colleges would follow suit Harvard's program was developed by da Sergeant who created a template of the statistically average American and measured Harvard students in an effort to get students to reach a perfect muscular form women's college Vasser began keeping posture and other physical records in 1884 the ideal test was for a patient to stand naked in front of a mirror or have a nude photo taken and for an expert to comment and offer remedies for poor posture

      Posture tests in Harvard (1880s) and Vasser (1884)

    2. Lord Chesterfield's widely published 1775 book lord Chesterfield's advice to his son on men and manners recommended new standards advising against odd motions strange postures and ental carriage
    3. Dec 29, 2023

      Abstract

      Most of us can probably say we struggle with posture, but for a long period after the turn of the twentieth century an American obsession with posture led to dramatic efforts to make students “straighten up."

    1. Facial Recognition Technology: Current Capabilities, Future Prospects, and Governance (2024) 120 pages | 8.5 x 11 | PAPERBACK ISBN 978-0-309-71320-7 | DOI 10.17226/27397

      http://nap.nationalacademies.org/27397

    1. Santosh Vempala, a computer science professor at Georgia Tech, has also studied hallucinations. “A language model is just a probabilistic model of the world,” he says, not a truthful mirror of reality. Vempala explains that an LLM’s answer strives for a general calibration with the real world—as represented in its training data—which is “a weak version of accuracy.” His research, published with OpenAI’s Adam Kalai, found that hallucinations are unavoidable for facts that can’t be verified using the information in a model’s training data.

      “A language model is just a probabilistic model of the world”

      Hallucinations are a result of an imperfect model, or attempting answers without the necessary data in the model.

    1. As with Midjourney, DALL-E 3 was capable of creating plagiaristic (near-identical) representations of trademarked characters, even when those characters were not mentioned by name.DALL-E 3 also created a whole universe of potential trademark infringements with this single two-word prompt: “animated toys” [bottom right].

      DALL-E 3 produced the same kinds of plagiaristic output

    2. Put slightly differently, if this speculation is correct, the very pressure that drives generative AI companies to gather more and more data and make their models larger and larger (in order to make the outputs more humanlike) may also be making the models more plagiaristic.

      Does the amount of training data affect the likelihood of plagiaristic output?

    3. Moreover, Midjourney apparently sought to suppress our findings, banning Southen from its service (without even a refund of his subscription fee) after he reported his first results, and again after he created a new account from which additional results were reported. It then apparently changed its terms of service just before Christmas by inserting new language: “You may not use the Service to try to violate the intellectual property rights of others, including copyright, patent, or trademark rights. Doing so may subject you to penalties including legal action or a permanent ban from the Service.” This change might be interpreted as discouraging or even precluding the important and common practice of red-team investigations of the limits of generative AI—a practice that several major AI companies committed to as part of agreements with the White House announced in 2023. (Southen created two additional accounts in order to complete this project; these, too, were banned, with subscription fees not returned.)

      Midjourney bans researchers and changes terms of service

    4. One user on X pointed to the fact that Japan has allowed AI companies to train on copyright materials. While this observation is true, it is incomplete and oversimplified, as that training is constrained by limitations on unauthorized use drawn directly from relevant international law (including the Berne Convention and TRIPS agreement). In any event, the Japanese stance seems unlikely to be carry any weight in American courts.

      Specifics in Japan for training LLMs on copyrighted material

    5. Such examples are particularly compelling because they raise the possibility that an end user might inadvertently produce infringing materials. We then asked whether a similar thing might happen in the visual domain.

      Can a user inadvertently produce infringing material

      Presumably, the LLM has been trained on copyrighted material because it is producing these textual (New York Times) and visual (motion pictures) plagiaristic outputs.

    6. After a bit of experimentation (and in a discovery that led us to collaborate), Southen found that it was in fact easy to generate many plagiaristic outputs, with brief prompts related to commercial films (prompts are shown).

      Plagiaristic outputs from blockbuster films in Midjourney v6

      Was the LLM trained on copyrighted material?

    7. We will call such near-verbatim outputs “plagiaristic outputs,” because if a human created them we would call them prima facie instances of plagiarism.

      Defining “plagiaristic outputs”

    1. Newspaper and magazine publishers could curate their content, as could the limited number of television and radio broadcasters. As cable television advanced, there were many more channels available to specialize and reach smaller audiences. The Internet and WWW exploded the information source space by orders of magnitude. For example, platforms such as YouTube receive hundreds of hours of video per minute. Tweets and Facebook updates must number in the hundreds of millions if not billions per day. Traditional media runs out of time (radio and television) or space (print media), but the Internet and WWW run out of neither. I hope that a thirst for verifiable or trustable facts will become a fashionable norm and part of the soluti

      Broadcast/Print are limited by time and space; is digital infinite?

    1. So, Sam decided, why not count the adverts he himself saw. It's just one number and applies only to the editor at a marketing magazine living in London on one arbitrary day. And what I saw were 93 ads I tried to be as open as I could about the fact that it's likely that I didn't notice every ad I could have done, but equally, I didn't miss that many, I don't think Sam also persuaded other people in the industry to do their own count. The most I've seen is 100 and 54. And I think I was quite generous. The lowest I've seen is 26. The most interesting version of the experiment was that I tasked someone to see as many as he could in a day and he got to 512 what a way to spend the day. And you will have noticed it's nowhere close to 10,000 ads.

      One person counted 93 per day

    2. Sam Anderson didn't marketing gurus have been making claims about advertising numbers for a very long time and Sam followed the trail all the way back to the 19 sixties. The very start of it was this piece of research by a man called Edwin Abel, who was a marketer for General Foods. Edwin Abel wanted to do a rough calculation on how many adverts people saw. He looked at how many hours of TV and radio people watched or listened to every day and worked out the average number of ads per hour on those mediums. So he multiplied those two numbers together to come up with this number. And this is our 1500 this 1500 ads a day number is still kicking around today. Often as the lowest of the big numbers in the blogosphere. And it is potentially a kind of legitimate calculation for the number of ads seen or heard, albeit from a quite different time in history. But there's some fine print to consider that is a number for a family of four. So if you divided that between a family of four, you'd actually be looking at something like 375 ads in this estimation.

      Research from the 1960s suggests 375 ads/day

    3. One of the numbers we quoted was a high estimate at the time, which was 5000 ads per day and that's the number that got latched on to. So this 5000 number was not for the number of adverts that a consumer actually sees and registers each day, but rather for the clutter as Walker calls it, these large numbers are not counts of the number of ads that people pay attention to.

      Number of impressions of advertising

      This advertising impressions may roll over us, but we don't actually see and register them.

    1. One of the arguments that they make is that the original parents' rights crusade in the US was actually in opposition to the effort to ban child labor. The groups really driving that opposition, like the National Association of Manufacturers, these conservative industry groups, what they were opposed to was an effort to muck up what they saw as the natural state of affairs that would be inequality.

      Earlier, "parents rights" promoted by National Association of Manufacturers

      The "conservative industry group" didn't want to have an enforced public education mandate take away from the inequitable "natural state of affairs".

    2. Well, there are some people who have never liked the idea of public education because it's the most socialist thing that we do in this country. We tax ourselves to pay for it and everybody gets to access it. That's not a very American thing to do. Then you have conservative religious activists. They see a real opening, thanks to a whole string of Supreme Court cases to use public dollars to fund religious education. Then you have people who don't believe in public education for other reasons. Education is the single largest budget item in most states. If your goal is to cut taxes way back, if your goal is to give a handout to the wealthiest people in your state, spending less on education is going to be an absolute requirement. The same states that are enacting these sweeping school voucher programs, if you look at states like Iowa and Arkansas, they've ushered in huge tax cuts for their wealthiest residents. That means that within the next few years, there will no longer be enough funds available to fund their public schools, even at a time when they have effectively picked up the tab for affluent residents of the state who already send their kids to private schools.

      Public education is the most socialist program in the U.S.

      Defunding it is seen as a positive among conservative/libertarian wings. Add tax cuts for the wealthy to sway the government more towards affluent people.

    3. As unregulated as these programs are, that minimal data is something we have access to. There's been great coverage of this, including a recent story in The Wall Street Journal by an education reporter named Matt Barnum, that what we are seeing in state after state is that in the early phases of these new programs, that the parents who are most likely to take advantage of them are not the parents of low-income and minority kids in the public schools despite that being the big sales pitch, that instead they are affluent parents whose kids already attended private school. When lawmakers are making the case for these programs, they are making the Moms for Liberty arguments.

      Benefits of state-based school choice programs are going to affluent parents

      The kinds are already going to private schools; the money isn't going to low-income parents. As a result, private schools are emboldened to raise tuition.

    4. The Heritage Foundation has been an early and very loud backer of Moms for Liberty. There, I think it's really instructive to see that they are the leader of the project 2025 that's laying out the agenda for a next Trump administration. You can look at their education platform, it is not about taking back school boards. It's about dismantling public education entirely.

      Heritage Foundation backing this effort as part of a goal to "dismantle" public education

    5. I would point you to something like recent Gallup polling. We know, it's no secret that American trust and institutions has plummeted across the board, but something like only 26% of Americans say that they have faith in public schools. Among Republicans, it's even lower, it's 14%. Groups like Moms for Liberty have played a huge part in exacerbating the erosion of that trust.

      Gallup polling shows drop in confidence in public schools, especially among Republicans

      Losses by Moms-for-Liberty candidates reinforce the notion that there is partisanship in public education.

    6. Annotations are on the Transcript tab of this web page

      Abstract

      Last month, it seemed like Moms for Liberty, the infamous political group behind the recent push for book bans in schools across the country, might be on the wane. In November, a series of Moms for Liberty endorsed candidates lost school board elections, and in local district elections, the group took hit after hit. In Iowa, 12 of 13 candidates backed by the Moms were voted out, and in Pennsylvania, Democrats won against at least 11 of their candidates. But recently, Moms for Liberty co-founder Tiffany Justice claimed in an interview, "we're just getting started," boasting about the group's plans to ramp up efforts in 2024.

    1. mail trap

      Mailtrap email testing

      https://mailtrap.io/email-sandbox/

    2. Papercut is a Windows application that just sits in the corner of your screen and your system notification tray and it's an SMTP server that doesn't send email every email that you send it intercepts
    3. mailjet markup language

      Origins of Mailjet Markup Language for richly formatted emails

      MJML → CSHTML (Razor) → HTML

    4. John Gilmore John is a very interesting person he's one of those people who I agree with everything he does right up to the point where I think he turns into a bit of a dick and then he kind of stops just past that point John was employee number five at Sun Microsystems he was one of the founders of the Electronic Frontier Foundation he is a uh I've seen him described as an extreme libertarian cipherpunk activist and uh the most famous quote I've seen from John is this one the net interprets censorship as damage and Roots around it if you start blocking ports because you don't like what people are doing the internet is designed to find another way rounded and you know he has taken this philosophy to an extreme he runs an open mail relay if you go to hop.toad.com it will accept email from anyone on the planet on Port 25 and it will deliver it doesn't care who you are doesn't care where you came from which is kind of the libertarian ethos in a nutshell

      John Gilmore's open SMTP relay

    5. I went through to see how many email addresses I could register

      Attempt to register quirky usernames at major email providers

      The RFC allows for "strange" usernames, but some mail providers are more restrictive.

    6. in 1978 Gary Turk was working for the digital Equipment Corporation he was a sales rep and his job was to sell these the deck system 20. now this thing had built-in arpanet protocol support it was like you don't have to do anything special you could plug it into a network and it would just work and rightly or wrongly Gary thought well I reckon people who are on the upper net might be interested in knowing about this computer and digital didn't have a whole lot of sales going on on the US West Coast they had a big office on the East Coast but West Coast you know California Portland those kind of places they didn't really have much of a presence so he got his assistant to go through the arpanet directory and type in the email addresses of everybody on the American West Coast who had an email address 393 of them and put them now at this point they overflowed the header field so all the people who got this email got an email which started with about 250 other people's email addresses and then right down at the end of it it says hey we invite you to come and see the deck system 2020

      Gary Turk "invented" spam in 1978

    7. one person whose Innovation is still a significant part of the way we work with it was this guy it's Ray Tomlinson and he was working on an opponent Mail system in 1971 and Rey invented at Rey is the person who went well hang on if we know the user's name and we know the arpanet host where they host their email we could put an at in the middle because it's Alice at the machine

      Ray Tomlinson invented the use of @ in 1971