119 Matching Annotations
  1. Sep 2024
    1. databases are not designed to be browsed.

      Casey Newton makes this blanket statement. Any real evidence for this beyond his "gut"?

      Many "paper machines" like Niklas Luhmann's zettelkasten were almost custom made not just for searching, but for browsing through regularly much like commonplace books.

      Perhaps the question is really, how is your particular database designed?

  2. Aug 2024
  3. Apr 2024
    1. Inscriptions, Al-Jallad explained, tend to cluster on higher ground, where nomadic herders could keep an easier watch for predators. In a landscape with no other traces of human civilization, the rocks preserved the nomads’ names and genealogies, along with descriptions of their animals, their wars, their journeys, and their rituals. There were prayers to deities, worries about the lack of rain, and complaints about the cruelty of Romans.
  4. Feb 2024
  5. Nov 2023
  6. Oct 2023
  7. Sep 2023
    1. I wonder what you think of a distinction between the more traditional 'scholar's box', and the proto-databases that were used to write dictionaries and then for projects such as the Mundaneum. I can't help feeling there's a significant difference between a collection of notes meant for a single person, and a collection meant to be used collaboratively. But not sure exactly how to characterize this difference. Seems to me that there's a tradition that ended up with the word processor, and another one that ended up with the database. I feel that the word processor, unlike the database, was a dead end.

      reply to u/atomicnotes at https://www.reddit.com/r/Zettelkasten/comments/16njtfx/comment/k1tuc9c/?utm_source=reddit&utm_medium=web2x&context=3

      u/atomicnotes, this is an excellent question. (Though I'd still like to come to terms with people who don't think it acts as a knowledge management system, there's obviously something I'm missing.)

      Some of your distinction comes down to how one is using their zettelkasten and what sorts of questions are being asked of it. One of the earliest descriptions I've seen that begins to get at the difference is the description by Beatrice Webb of her notes (appendix C) in My Apprenticeship. As she describes what she's doing, I get the feeling that she's taking the same broad sort of notes we're all used to, but it's obvious from her discussion that she's also using her slips as a traditional database, but is lacking modern vocabulary to describe it as such.

      Early efforts like the OED, TLL, the Wb, and even Gertrud Bauer's Coptic linguistic zettelkasten of the late 1970s were narrow enough in scope and data collected to make them almost dead simple to define, organize and use as databases on paper. Of course how they were used to compile their ultimate reference books was a bit more complex in form than the basic data from which they stemmed.

      The Mundaneum had a much more complex flavor because it required a standardized system for everyone to work in concert against much more freeform as well as more complex forms of collected data and still be able to search for the answers to specific questions. While still somewhat database flavored, it was dramatically different from the others because of it scope and the much broader sorts of questions one could ask of it. I think that if you ask yourself what sorts of affordances you get from the two different groups (databases and word processors (or even their typewriter precursors) you find even more answers.

      Typewriters and word processors allowed one to get words down on paper quicker by a magnitude of order or two faster, and in combination with reproduction equipment, made it easier to spin off copies of the document for small scale and local mass distribution a lot easier. They do allow a few affordances like higher readability (compared with less standardized and slower handwriting), quick search (at least in the digital era), and moving pieces of text around (also in digital). Much beyond this, they aren't tremendously helpful as a composition tool. As a thinking tool, typewriters and word processors aren't significantly better than their analog predecessors, so you don't gain a huge amount of leverage by using them.

      On the other hand, databases and their spreadsheet brethren offer a lot more, particularly in digital realms. Data collection and collation become much easier. One can also form a massive variety of queries on such collected data, not to mention making calculations on those data or subjecting them to statistical analyses. Searching, sorting, and making direct comparisons also become far easier and quicker to do once you've amassed the data you need. Here again, Beatrice Webb's early experience and descriptions are very helpful as are Hollerinth's early work with punch cards and census data and the speed with which the results could be used.

      Now if you compare the affordances by each of these in the digital era and plot their shifts against increasing computer processing power, you'll see that the value of the word processor stays relatively flat while the database shows much more significant movement.

      Surely there is a lot more at play, particularly at scale and when taking network effects into account, but perhaps this quick sketch may explain to you a bit of the difference you've described.

      Another difference you may be seeing/feeling is that of contextualization. Databases usually have much smaller and more discrete amounts of data cross-indexed (for example: a subject's name versus weight with a value in pounds or kilograms.) As a result the amount of context required to use them is dramatically lower compared to the sorts of data you might keep in an average atomic/evergreen note, which may need to be more heavily recontextualized for you when you need to use it in conjunction with other similar notes which may also need you to recontextualize them and then use them against or with one another.

      Some of this is why the cards in the Thesaurus Linguae Latinae are easier to use and understand out of the box (presuming you know Latin) than those you might find in the Mundaneum. They'll also be far easier to use than a stranger's notes which will require even larger contextualization for you, especially when you haven't spent the time scaffolding the related and often unstated knowledge around them. This is why others' zettelkasten will be more difficult (but not wholly impossible) for a stranger to use. You might apply the analogy of context gaps between children and adults for a typical Disney animated movie to the situation. If you're using someone else's zettelkasten, you'll potentially be able to follow a base level story the way a child would view a Disney cartoon. Compare this to the zettelkasten's creator who will not only see that same story, but will have a much higher level of associative memory at play to see and understand a huge level of in-jokes, cultural references, and other associations that an adult watching the Disney movie will understand that the child would completely miss.

      I'm curious to hear your thoughts on how this all plays out for your way of conceptualizing it.

  8. Aug 2023
    1. The BTL Online database provides electronic access to all editions of Latin texts published in the Bibliotheca Teubneriana, ranging from antiquity and late antiquity to medieval and neo-Latin texts. A total of approximately 13 million word forms are thus accessible electronically.
  9. May 2023
    1. ake sure you are mentoring someone, always. Make sure you’re always mentoring others. And I have found that mentoring others gives me new perspectives on things. Because I may be telling them things, but I’m learning a lot while I’m telling them.

      Great

  10. Apr 2023
  11. Mar 2023
    1. Die Erfahrungen Ermans und seiner Mitarbeiter lehren nur zu deutlich, dass die Buchform der Präsentation eines solchen Materialbestands durchaus nicht entgegenkommt.

      For some research the book form is just not conducive to the most productive work. Both the experiences of Beatrice Webb (My Apprenticeship, Appendix C) and Adolph Erman (on Wb) show that database forms for sorting, filtering, and comparing have been highly productive and provide a wealth of information which simply couldn't be done otherwise.

    2. Ausgangspunkt und Zentrum der Arbeit am Altägyptischen Wörterbuch ist die Anlage eines erschöpfenden Corpus ägyptischer Texte.

      In the early twentieth century one might have created a card index to study a large textual corpus, but in the twenty first one is more likely to rely on a relational database instead.

    1. 9/8b2 "Multiple storage" als Notwendigkeit derSpeicherung von komplexen (komplex auszu-wertenden) Informationen.

      Seems like from a historical perspective hierarchical databases were more prevalent in the 1960s and relational databases didn't exist until the 1970s. (check references for this historically)

      Of course one must consider that within a card index or zettelkasten the ideas of both could have co-existed in essence even if they weren't named as such. Some of the business use cases as early as 1903 (earlier?) would have shown the idea of multiple storage and relational database usage. Beatrice Webb's usage of her notes in a database-like way may have indicated this as well.

  12. Jan 2023
    1. After browsing through a variety of the cards in Gertrud Bauer's Zettelkasten Online it becomes obvious that the collection was created specifically as a paper-based database for search, retrieval, and research. The examples and data within it are much more narrowly circumscribed for a specific use than those of other researchers like Niklas Luhmann whose collection spanned a much broader variety of topics and areas of knowledge.

      This particular use case makes the database nature of zettelkasten more apparent than some others, particularly in modern (post-2013 zettelkasten of a more personal nature).

      I'm reminded here of the use case(s) described by Beatrice Webb in My Apprenticeship for scientific note taking, by which she more broadly meant database creation and use.

  13. Dec 2022
    1. https://tellico-project.org/

      Tellico<br /> Collection management software, free and simple

      <small><cite class='h-cite via'> <span class='p-author h-card'>Fernando Borretti</span> in Unbundling Tools for Thought (<time class='dt-published'>12/29/2022 15:59:17</time>)</cite></small>

    1. Good stories are of diminishing importance — ironic, given how audiences were traditionally drawn into a world, like that of The Odyssey, by way of a single character’s journey.

      There's a Le Guin quote in the piece, on how a world of the story needs to be described in the story, and "that's tricky business". And this quote argues about the diminishing role of narratives. So what takes over their role? I think that, broadly speaking, databases and information: any narrative today quickly gets surrounded with coral-like growth of commentary, reviews, fanfiction, databases of world's details. This has some reference to Johnson's work on database as a modern media form.

    1. Keeping track of research materials used to require an excellent memory, exceptional bookkeeping skills or blind luck; now we have databases.

      Love the phrasing of this. :)

  14. Nov 2022
    1. Genealogy Garage: Researching at the Huntington Library

      <iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/0f2j2K6JWGg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
      • Julie Huffman jhuffman@lapl.org (host)
      • Stephanie Arias
      • Anne Blecksmith
      • Li Wei Yang
      • Clay Stalls cstalls@huntington.org

      ECPP

      Huntington Library

      Visit checklist

  15. Oct 2022
    1. A career-line study of the presidents, all cabinet members,and all members of the Supreme Court. This 1 'already have onIBM cards from the constitutional period through Truman'ssecond term, but I want to expand the items used and analyze itafresh.

      Notice that it's not just notes, but data on IBM cards that he's using for research here. This sort of data analysis is much easier now, but is also of the sort detailed by Beatrice Webb in her scientific note taking.

    Tags

    Annotators

  16. Sep 2022
  17. Aug 2022
    1. Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value. There’s a row for every attribute. There’s a row for title, url, author, spam votes, etc. When they add new features they didn’t have to worry about the database anymore. They didn’t have to add new tables for new things or worry about upgrades. Easier for development, deployment, maintenance.

      Reddit uses only 2 tables, with the cost of not being able to use cool relational features

    2. Schema updates are very slow when you get bigger. Adding a column to 10 million rows takes locks and doesn’t work. They used replication for backup and for scaling. Schema updates and maintaining replication is a pain.

      Schema updates and replications are not easy to handle

    1. In 1896, Dewey formed a partnership with Herman Hollerith and the Tabulating Machine Company (TMC) to provide the punch cards used for the electro-mechanical counting system of the US government census operations. Dewey’s relationship with Hollerith is significant as TMC would be renamed International Business Machines (IBM) in 1924 and become an important force in the information age and creator of the first relational database.
  18. Jul 2022
    1. It wasnot until we had completely re-sorted all our innumerable sheets ofpaper according to subjects, thus bringing together all the facts relatingto each, whatever the trade concerned, or the place or the date—andhad shuffled and reshuffled these sheets according to various tentativehypotheses—that a clear, comprehensive and verifiable theory of theworking and results of Trade Unionism emerged in our minds; tobe embodied, after further researches by way of verification, in ourIndustrial Democracy (1897).

      Beatrice Webb was using her custom note taking system in the lead up to the research that resulted in the publication of Industrial Democracy (1897).

      Is there evidence that she was practicing this note taking/database practice earlier than this?

    2. On many occasions we have been com¬pelled to break off the writing of a particular chapter, or even of aparticular paragraph, in order to test, by reshuffling the whole of ournotes dealing with a particular subject, a particular place, a particularorganisation or a particular date, the relative validity of hypotheses asto cause and effect. I may remark, parenthetically, that we have foundthis “ game with reality ”, this building up of one hypothesis andknocking it down in favour of others that had been revealed or verifiedby a new shuffle of the notes—especially when we severally “ backed ”rival hypotheses—a most stimulating recreation! In that way alonehave we been able “ to put our bias out of gear ”, and to make ourorder of thought correspond, not with our own prepossessions, butwith the order of things discovered by our investigations.

      Beatrice Webb's note taking system here shows indications of being actively used as a database system!

    3. An instance may be given of the necessity of the “ separate sheet ” system.Among the many sources of information from which we constructed our bookThe Manor and the Borough were the hundreds of reports on particular boroughsmade by the Municipal Corporation Commissioners in 1835 .These four hugevolumes are well arranged and very fully indexed; they were in our own possession;we had read them through more than once; and we had repeatedly consulted themon particular points. We had, in fact, used them as if they had been our own boundnotebooks, thinking that this would suffice. But, in the end, we found ourselvesquite unable to digest and utilise this material until we had written out every oneof the innumerable facts on a separate sheet of paper, so as to allow of the mechanicalabsorption of these sheets among our other notes; of their complete assortment bysubjects; and of their being shuffled and reshuffled to test hypotheses as to suggestedco-existences and sequences.

      Webb's use case here sounds like she's got the mass data, but that what she really desired was a database which she could more easily query to do her work and research. As a result, she took the flat file data and made it into a manually sortable and searchable database.

  19. May 2022
    1. Create the new empty table Write to both old and new table Copy data (in chunks) from old to new Validate consistency Switch reads to new table Stop writes to the old table Cleanup old table

      7 steps required while migrating to a new table

  20. Mar 2022
  21. Jan 2022
    1. Explore the Origins and Forced Relocations of Enslaved Africans Across the Atlantic World The SlaveVoyages website is a collaborative digital initiative that compiles and makes publicly accessible records of the largest slave trades in history. Search these records to learn about the broad origins and forced relocations of more than 12 million African people who were sent across the Atlantic in slave ships, and hundreds of thousands more who were trafficked within the Americas. Explore where they were taken, the numerous rebellions that occurred, the horrific loss of life during the voyages, the identities and nationalities of the perpetrators, and much more.
    1. It is thanks to decades of painstaking, difficult work that we know a great deal about the scale of human trafficking across the Atlantic Ocean and about the people aboard each ship. Much of that research is available to the public in the form of the SlaveVoyages database. A detailed repository of information on individual ships, individual voyages and even individual people, it is a groundbreaking tool for scholars of slavery, the slave trade and the Atlantic world. And it continues to grow. Last year, the team behind SlaveVoyages introduced a new data set with information on the domestic slave trade within the United States, titled “Oceans of Kinfolk.”
    1. https://www.youtube.com/watch?v=z3Tvjf0buc8

      graph thinking

      • intuitive
      • speed, agility
      • adaptability

      ; graph thinking : focuses on relationships to turn data into information and uses patterns to find meaning

      property graph data model

      • relationships (connectors with verbs which can have properties)
      • nodes (have names and can have properties)

      Examples:

      • Purchase recommendations for products in real time
      • Fraud detection

      Use for dependency analysis

    1. https://www.goedel.io/p/tools-for-thought-but-not-for-search

      Searching for two ingredients in an effort to find a recipe that will allow their use should be de rigueur in a personal knowledge manager, sadly it doesn't appear to be the case.


      This sort of simple search not working in these tools is just silly.

      They should be able to search across blocks, pages, and even provide graph views to help in this process. Where are all the overlaps of these words within one's database?

  22. Jul 2021
    1. databases is an async SQL query builder that works on top of the SQLAlchemy Core expression language.

      databases Python package

  23. Jun 2021
    1. This is where off-site backups come into play. For this purpose, I recommend Borg backup. It has sophisticated features for compression and encryption, and allows you to mount any version of your backups as a filesystem to recover the data from. Set this up on a cronjob as well for as frequently as you feel the need to make backups, and send them off-site to another location, which itself should have storage facilities following the rest of the recommendations from this article. Set up another cronjob to run borg check and send you the results on a schedule, so that their conspicuous absence may indicate that something fishy is going on. I also use Prometheus with Pushgateway to make a note every time that a backup is run, and set up an alarm which goes off if the backup age exceeds 48 hours. I also have periodic test alarms, so that the alert manager’s own failures are noticed.

      Solution for human failures and existential threads:

      • Borg backup on a cronjob
      • Prometheus with Pushgateway
    2. RAID is complicated, and getting it right is difficult. You don’t want to wait until your drives are failing to learn about a gap in your understanding of RAID. For this reason, I recommend ZFS to most. It automatically makes good decisions for you with respect to mirroring and parity, and gracefully handles rebuilds, sudden power loss, and other failures. It also has features which are helpful for other failure modes, like snapshots. Set up Zed to email you reports from ZFS. Zed has a debug mode, which will send you emails even for working disks — I recommend leaving this on, so that their conspicuous absence might alert you to a problem with the monitoring mechanism. Set up a cronjob to do monthly scrubs and review the Zed reports when they arrive. ZFS snapshots are cheap - set up a cronjob to take one every 5 minutes, perhaps with zfs-auto-snapshot.

      ZFS is recommended (not only for the beginners) over the complicated RAID

    3. these days hardware RAID is almost always a mistake. Most operating systems have software RAID implementations which can achieve the same results without a dedicated RAID card.

      According to the author software RAID is preferable over hardware RAID

    4. Failing disks can show signs of it in advance — degraded performance, or via S.M.A.R.T reports. Learn the tools for monitoring your storage medium, such as smartmontools, and set it up to report failures to you (and test the mechanisms by which the failures are reported to you).

      Preventive maintenance of disk failures

    5. RAID gets more creative with three or more hard drives, utilizing parity, which allows it to reconstruct the contents of failed hard drives from still-online drives.

      If you are using RAID and one of the 3 drives fail, you can still recover its content thanks to XOR operation

    6. A more reliable solution is to store the data on a hard drive1. However, hard drives are rated for a limited number of read/write cycles, and can be expected to fail eventually.

      Hard drives are a better lifetime option than microSD cards but still not ideal

    7. The worst way I can think of is to store it on a microSD card. These fail a lot. I couldn’t find any hard data, but anecdotally, 4 out of 5 microSD cards I’ve used have experienced failures resulting in permanent data loss.

      microSD cards aren't recommended for storing lifetime data

  24. Mar 2021
    1. The console is a killer SQLite feature for data analysis: more powerful than Excel and more simple than pandas. One can import CSV data with a single command, the table is created automatically

      SQLite makes it fairly easy to import and analyse data. For example:

      • import --csv city.csv city
      • select count(*) from city;
    1. This is not a problem if your DBMS supports SQL recursion: lots of data can be generated with a single query. The WITH RECURSIVE clause comes to the rescue.

      WITH RECURSIVE can help you quickly generate a series of random data.

  25. Oct 2020
    1. Queries became impractically slow around the 500,000 cell mark, but were still below 2 seconds for a 100,000 cell query. Therefore, if you anticipate a dataset larger than a few hundred thousand cells, it would probably be smart to choose a more scalable option.

      Scalability of Google Sheets. They have a hard limit of 5,000,000 cells (including blank ones)

  26. Sep 2020
    1. DuckDB is an embeddable SQL OLAP database management system

      Database not requiring a server like SQLite and offering advantages of PostgreSQL

  27. Aug 2020
    1. The Splitgraph DDN is a single SQL endpoint that lets you query over 40,000 public datasets hosted on or proxied by Splitgraph.You can connect to it from most PostgreSQL clients and BI tools without having to install anything else. It supports all read-only SQL constructs, including filters and aggregations. It even lets you run joins across distinct datasets.

      Splitgraph - efficient DDN (Data Delivery Network):

      • connect to it from most PostgreSQL clients and BI tools without having to install anything else
      • you can queory +40k public datasets hosten on or proxied by Splitgraph
      • supports all SQL constructs (even SQL joins between tables)
  28. Jul 2020
    1. So in brief, for our application service, if we understand the access patterns very well, they’re repeatable, they’re consistent, and scalability is a big factor, then NoSQL is a perfect choice.

      When NoSQL is a perfect choice

    2. Comparison Time … 🤞

      Brief comparison of 8 aspects between SQL vs NoSQL

  29. May 2020
    1. Which database technology to choose

      Which database to choose (advice from an Amazon employee):

      • SQL - ad hoc queries and/or support of ACID and transactions
      • NoSQL - otherwise. NoSQL is getting better with transactions and PostgreSQL is getting better with availability, scalability, durability
  30. Apr 2020
    1. From a narratological perspective, it would probably be fair to say that most databases are tragic. In their design, the configuration of their user interfaces, the selection of their contents, and the indexes that manage their workings, most databases are limited when set against the full scope of the field of information they seek to map and the knowledge of the people who created them. In creating a database, we fight against the constraints of the universe – the categories we use to sort out the world; the limitations of time and money and technology – and succumb to them.

      databases are tragic!

    1. I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain.

      Insights on databases from a Google engineer:

      1. You are lucky if 99.999% of the time network is not a problem.
      2. ACID has many meanings.
      3. Each database has different consistency and isolation capabilities.
      4. Optimistic locking is an option when you can’t hold a lock.
      5. There are anomalies other than dirty reads and data loss.
      6. My database and I don’t always agree on ordering.
      7. Application-level sharding can live outside the application.
      8. AUTOINCREMENT’ing can be harmful.
      9. Stale data can be useful and lock-free.
      10. Clock skews happen between any clock sources.
      11. Latency has many meanings.
      12. Evaluate performance requirements per transaction.
      13. Nested transactions can be harmful.
      14. Transactions shouldn’t maintain application state.
      15. Query planners can tell a lot about databases.
      16. Online migrations are complex but possible.
      17. Significant database growth introduces unpredictability.
    1. 1) Redash and Falcon focus on people that want to do visualizations on top of SQL2) Superset, Tableau and PowerBI focus on people that want to do visualizations with a UI3) Metabase and SeekTable focus on people that want to do quick analysis (they are the closest to an Excel replacement)

      Comparison of data analysis tools:

      1) Redash & Falcon - SQL focus

      2) Superset, Tableau & PowerBI - UI workflow

      3) Metabase & SeekTable - Excel like experience

  31. Mar 2020
    1. supporting this field is extremely easy If you keep raw data, it's just a matter of adding a getter method to the Article class.

      Way of supporting a new field in JSON is much easier than in a relational database:

      @property
      def highlights(self) -> Sequence[Highlight]:
          default = [] # defensive to handle older export formats that had no annotations
          jsons = self.json.get('annotations', default)
          return list(map(Highlight, jsons))
      
    2. query language doesn't necessarily mean a database. E.g. see pandas which is capable of what SQL is capable of, and even more convenient than SQL for our data exploration purposes.

      Query language, not always = database. For example, see pandas

    3. cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to functools.lru_cache). The difference from functools.lru_cache is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.

      cachew tool isolates the complexity of database access patterns in a Python library

  32. Feb 2020
    1. Imagine that you're using a database to export them, so your schema is: TABLE Article(STRING id, STRING url, STRING title, DATETIME added). One day, the developers expose highlights (or annotations) from the private API and your export script stats receiving it in the response JSON. It's quite useful data to have! However, your database can't just magically change to conform to the new field.

      Relational model can be sometimes hand tying, unlike JSON

    2. Storage saved by using a database instead of plaintext is marginal and not worth the effort.

      Databases save some space used by data, but it's marginal

    3. if necessary use databases as an intermediate layer to speed access up and as an additional interface to your data Nothing wrong with using databases for caching if you need it!

      You may want to use databases for:

      • speeding access up
      • creating additional layer
      • caching
    4. I want to argue very strongly against forcing the data in the database, unless it's really inevitable.

      After scraping some data, don't go immediately to databases, unless it's a great stream of data

  33. Jan 2020
  34. Nov 2019
    1. FKs don't work well with online schema migrations.

      3rd reason why at GitHub they don't rely on Foreign Keys: Working with online schema migrations.

      FKs impose a lot of constraints on what's possible and what's not possible

    2. FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead.

      2nd reason why at GitHub they don't rely on Foreign Keys: FK performance impact

    3. FKs are in your way to shard your database. Your app is accustomed to rely on FK to maintain integrity, instead of doing it on its own. It may even rely on FK to cascade deletes (shudder). When eventually you want to shard or extract data out, you need to change & test the app to an unknown extent.

      1st reason why at GitHub they don't rely on Foreign Keys: Relying on FK to maintain integrity, instead of doing it on its own

  35. Sep 2019
    1. To address the availability concern, new architectures were developed to minimize the impact of partitions. For instance, splitting data sets into smaller ranges called shards can minimize the amount of data that is unavailable during partitions. Furthermore, mechanisms to automatically alter the roles of various cluster members based on network conditions allow them to regain availability quickly

      Qualities of NewSQL - mainly minimisation of the impact of partitions

    2. typically less flexible and generalized than their more conventional relational counterparts. They also usually only offer a subset of full SQL and relational features, which means that they might not be able to handle certain kinds of usage. Many NewSQL implementations also store a large part of or their entire dataset in the computer's main memory. This improves performance at the cost of greater risk to unpersisted changes

      Differences between NewSQL and relational databases:

      • typically less flexible and generalized
      • usually only offer a subset of full SQL and relational features, which means that they might not be able to handle certain kinds of usage.
      • many NewSQL implementations also store a large part of or their entire dataset in the computer's main memory. This improves performance at the cost of greater risk to unpersisted changes.
    3. using a mixture of different database types is the best approach for handling the data of your projects

      Many times mixing different databases is a good approach.

      For example:

      • store user information - relational databases
      • configuration values - in-memory key-value store
    4. best suited for use cases with high volumes of relational data in distributed, cloud-like environments

      Best suit of NewSQL

    5. CAP theorem is a statement about the trade offs that distributed databases must make between availability and consistency. It asserts that in the event of a network partition, a distributed database can choose either to remain available or remain consistent, but it cannot do both. Cluster members in a partitioned network can continue operating, leading to at least temporary inconsistency. Alternatively, at least some of the disconnected members must refuse to alter their data during the partition to ensure data consistency

      CAP Theorem relating to distributed databases

      CAP

    6. NewSQL databases: bringing modern scalability and performance to the traditional relational pattern

      NewSQL databases - designed with scalability and modern performance requirements. Follow the relational structure and semantics, but are built using more modern, scalable design. Rise in popularity in 2010s.

      Examples:

      • MemSQL
      • VoltDB
      • Spanner
      • Calvin
      • CockroachDB
      • FaunaDB
      • yugabyteDB
    7. aggregate queries like summing, averaging, and other analytics-oriented processes can be difficult or impossible

      Disadvantage of column databases

    8. Column-family databases are good when working with applications that requires great performance for row-based operations and highly scalability

      Advantage of column databases. They also collect row data in a cluster on the same machine, simplifying data sharding and scaling

    9. it helps to think of column family databases as key-value databases where each key (row identifier) returns a dictionary of arbitrary attributes and their values (the column names and their values)

      Tip to remember the idea of column databases

    10. Column-family databases: databases with flexible columns to bridge the gap between relational and document databases

      Column-family databases - also called as non-relational column stores, wide-column databases or column databases. Rise in popularity in 2000s. Look highly similar to relational databases. They have structure called column families, which contain rows of data, each of which define their own format. Therefore, each row in a column family defines its own schema.

      Examples:

      • Cassandra
      • HBase

      Diagram of column-family database structure

    11. querying for the connection between two users of a social media site in a relational database is likely to require multiple table joins and therefore be rather resource intensive. This same query would be straightforward in a graph database that directly maps connections

      Social media prefers graph databases over relational ones

    12. Graph databases are most useful when working with data where the relationships or connections are highly important

      Major use of graph databases

    13. network databases require step-by-step traversal to travel between items and are limited in the types of relationships they can represent.

      Difference between network databases (SQL) and graph databases (NoSQL)

    14. Graph databases: mapping relationships by focusing on how connections between data are meaningful

      Graph databases - establishes connections using the concepts of nodes, edges, and properties. Rise in popularity in 2000s.

      Examples:

      • Neo4j
      • JanusGraph
      • Dgraph

      Diagram of a graph database structure

    15. Document databases: Storing all of an item's data in flexible, self-describing structures

      Document databases - also known as document-oriented databases or document stores, share the basic access and retrieval semantics of key-value stores. Rise in popularity in 2009.

      They also used keys to uniquely identify data, therefore the line between advanced key-value stores and document databases can be fairly unclear.

      Instead of storing arbitrary blobs of data, document databases store data in structured formats called documents, often using formats like JSON, BSON, or XML.

      Examples:

      • MongoDB
      • RethinkDB
      • Couchbase

      Diagram of document database

    16. Document databases are a good choice for rapid development because you can change the properties of the data you want to save at any point without altering existing structures or data. You only need to backfill records if you want to. Each document within the database stands on its own with its own system of organization. If you're still figuring out your data structure and your data is mainly composed discrete entries that don't include a lot of cross references, a document database might be a good place to start. Be careful, however, as the extra flexibility means that you are responsible for maintaining the consistency and structure of your data, which can be extremely challenging

      Pros and cons of document databases

    17. Though the data within documents is organized within a structure, document databases do not prescribe any specific format or schema

      Therefore, unlike in key-value stores, the content stored in document databases can be queried and analysed

    18. Key-value stores are often used to store configuration data, state information, and any data that might be represented by a dictionary or hash in a programming language. Key-value stores provide fast, low-complexity access to this type of data

      Use and advantages of of key-value stores

    19. Key-value databases: simple, dictionary-style lookups for basic storage and retrieval

      Key-value databases - one of the simplest database types. Initially introduced in 1970s (rise in popularity: 2000-2010). Work by storing arbitrary data accessible through a specific key.

      • to store data, you provide a key and the blob of data you wish to save, for example a JSON object, an image, or plain text.
      • to retrieve data, you provide the key and will then be given the blob of data back.

      Examples:

      • Redis
      • memcached
      • etcd

      Diagram of key-value store

    20. NoSQL databases: modern alternatives for data that doesn't fit the relational paradigm

      NoSQL databases - stands for either non-SQL or not only SQL to clarify that sometimes they allow SQL-like querying.

      4 types:

      • Key-value
      • Document
      • Graph
      • Column-family
    21. relational databases are often a good fit for any data that is regular, predictable, and benefits from the ability to flexibly compose information in various formats. Because relational databases work off of a schema, it can be more challenging to alter the structure of data after it is in the system. However, the schema also helps enforce the integrity of the data, making sure values match the expected formats, and that required information is included. Overall, relational databases are a solid choice for many applications because applications often generate well-ordered, structured data

      Pros and cons of relational database

    22. querying language called SQL, or structured query language, was created to access and manipulate data stored with that format

      SQL was created for relational databases

    23. Relational databases: working with tables as a standard solution to organize well-structured data

      Relational databases - oldest general purpose database type still widely used today. They comprise the majority of databases currently used in production. Initially introduced in 1969.

      They organise data using tables - structures that impose a schema on the records that they hold.

      • each column has a name and a data type
      • each row represents an individual record

      Examples:

      • MySQL
      • MariaDB
      • PostgreSQL
      • SQLite

      Diagram of relational schema used to map entities for a school

    24. database schema is a description of the logical structure of a database or the elements it contains. Schemas often include declarations for the structure of individual entries, groups of entries, and the individual attributes that database entries are comprised of. These may also define data types and additional constraints to control the type of data that may be added to the structure

      Database schema

    25. Network databases: mapping more flexible connections with non-hierarchical links

      Network databases - built on the foundation provided by hierarchical databases by adding additional flexibility. Initially introduced in late 1960s. Instead of always having a single parent, as in hierarchical databases, network database entries can have more than one parent, which effectively allows them to model more complex relationships.

      Examples:

      • IDMS

      Have graph-like structure Diagram of a network database

    26. Hierarchical databases: using parent-child relationships to map data into trees

      Hierarchical databases - the next evolution in database development. Initially introduced in 1960s. They encode a relationship between items where every record has a single parent.

      Examples:

      • Filesystems
      • DNS
      • LDAP directories

      Have tree-like structure Diagram of a hierarchical database

    27. Hierarchical databases are not used much today due to their limited ability to organize most data and because of the overhead of accessing data by traversing the hierarchy

      Hierarchical databases aren't used as much anymore

    28. The first flat file databases represented information in regular, machine parse-able structures within files. Data is stored in plain text, which limits the type of content that can be represented within the database itself. Sometimes, a special character or other indicator is chosen to use as a delimiter, or marker for when one field ends and the next begins. For example, a comma is used in CSV (comma-separated values) files, while colons or white-space are used in many data files in Unix-like systems

      Flat-file databases - 1st type of databases with a simple data structure for organising small amounts of local data.

      Examples:

      • /etc/passwd and /etc/fstab on Linux and Unix-like systems
      • CSV files
    29. Some advantages of this format

      Advantages of flat-file format:

      • has robust, flexible toolkit
      • easily managed without specialised software
      • easy to understand and work with
    30. While flat file databases are simple, they are very limited in the level of complexity they can handle

      Disadvantages of flat-file databases:

      • system that reads or manipulates the data cannot make easy connections between the data represented
      • usually don't have any type of user or data concurrency features either
      • usually only practical for systems with small read or write requirements. For example, many operating systems use flat-files to store configuration data
  36. Dec 2018
  37. Aug 2018
  38. May 2018
  39. Oct 2017
    1. MySQL’s replication architecture means that if bugs do cause table corruption, the problem is unlikely to cause a catastrophic failure.

      I can't follow the reasoning here. I guess it's not guaranteed to replicate the corruption like Postgres would, but it seems totally possible to trigger similar or identical corruption because the implementation of the logical statement would be similar on the replica.

    2. The bug we ran into only affected certain releases of Postgres 9.2 and has been fixed for a long time now. However, we still find it worrisome that this class of bug can happen at all. A new version of Postgres could be released at any time that has a bug of this nature, and because of the way replication works, this issue has the potential to spread into all of the databases in a replication hierarchy.

      Not really a criticism of Postgres so much as it is a criticism of software in general.

  40. Aug 2017
  41. Jun 2016
    1. If the RRID is well-formed, and if the lookup found the right record, a human validator tags it a valid RRID — one that can now be associated mechanically with occurrences of the same resource in other contexts. If the RRID is not well-formed, or if the lookup fails to find the right record, a human validator tags the annotation as an exception and can discuss with others how to handle it. If an RRID is just missing, the validator notes that with another kind of exception tag.

      Sounds a lot like the way reference managers work. In many cases, people keep the invalid or badly-formed results.

  42. Apr 2016
  43. Jan 2016
  44. Dec 2015
    1. Data gathering is ubiquitous in science. Giant databases are currently being minedfor unknown patterns, but in fact there are many (many) known patterns that simplyhave not been catalogued. Consider the well-known case of medical records. A patient’smedical history is often known by various individual doctor-offices but quite inadequatelyshared between them. Sharing medical records often means faxing a hand-written noteor a filled-in house-created form between offices.
    2. I will use a mathematical tool calledologs, or ontology logs, to givesome structure to the kinds of ideas that are often communicated in pictures like theone on the cover. Each olog inherently offers a framework in which to record data aboutthe subject. More precisely it encompasses adatabase schema, which means a system ofinterconnected tables that are initially empty but into which data can be entered.
  45. May 2015
  46. Oct 2014
    1. This in turn means that Redis Cluster does not have to take meta data in the data structures in order to attempt a value merge, and that the fancy commands and data structures supported by Redis are also supported by Redis Cluster. So no additional memory overhead, no API limits, no limits in the amount of elements a value can contain, but less safety during partitions.

      A solid trade-off, I think, and says a lot about the intended use cases.

  47. Sep 2014
    1. Fast restart. If a server is temporarily taken down, this capability restores the index from a saved copy, eliminating delays due to index rebuilding.

      This point seems to be in direct contradiction to the claim above that "Indexes (primary and secondary) are always stored in DRAM for fast access and are never stored on Solid State Drives (SSDs) to ensure low wear."

    2. Unlike other databases that use the linux file system that was built for rotational drives, Aerospike has implemented a log structured file system to access flash – raw blocks on SSDs – directly.

      Does this really mean to suggest that Aerospike bypasses the linux block device layer? Is there a kernel driver? Does this mean I can't use any filesystem I want and know how to administrate? Is the claim that the "linux file system" (which I take to mean, I guess, the virtual file system layer) "built for rotation drives" even accurate? We've had ram disks for a long, long time. And before that we've had log structured filesystems, too, and even devices that aren't random access like tape drives. Seems like dubious claims all around.