78 Matching Annotations
  1. Jul 2021
    1. databases is an async SQL query builder that works on top of the SQLAlchemy Core expression language.

      databases Python package

  2. Jun 2021
    1. This is where off-site backups come into play. For this purpose, I recommend Borg backup. It has sophisticated features for compression and encryption, and allows you to mount any version of your backups as a filesystem to recover the data from. Set this up on a cronjob as well for as frequently as you feel the need to make backups, and send them off-site to another location, which itself should have storage facilities following the rest of the recommendations from this article. Set up another cronjob to run borg check and send you the results on a schedule, so that their conspicuous absence may indicate that something fishy is going on. I also use Prometheus with Pushgateway to make a note every time that a backup is run, and set up an alarm which goes off if the backup age exceeds 48 hours. I also have periodic test alarms, so that the alert manager’s own failures are noticed.

      Solution for human failures and existential threads:

      • Borg backup on a cronjob
      • Prometheus with Pushgateway
    2. RAID is complicated, and getting it right is difficult. You don’t want to wait until your drives are failing to learn about a gap in your understanding of RAID. For this reason, I recommend ZFS to most. It automatically makes good decisions for you with respect to mirroring and parity, and gracefully handles rebuilds, sudden power loss, and other failures. It also has features which are helpful for other failure modes, like snapshots. Set up Zed to email you reports from ZFS. Zed has a debug mode, which will send you emails even for working disks — I recommend leaving this on, so that their conspicuous absence might alert you to a problem with the monitoring mechanism. Set up a cronjob to do monthly scrubs and review the Zed reports when they arrive. ZFS snapshots are cheap - set up a cronjob to take one every 5 minutes, perhaps with zfs-auto-snapshot.

      ZFS is recommended (not only for the beginners) over the complicated RAID

    3. these days hardware RAID is almost always a mistake. Most operating systems have software RAID implementations which can achieve the same results without a dedicated RAID card.

      According to the author software RAID is preferable over hardware RAID

    4. Failing disks can show signs of it in advance — degraded performance, or via S.M.A.R.T reports. Learn the tools for monitoring your storage medium, such as smartmontools, and set it up to report failures to you (and test the mechanisms by which the failures are reported to you).

      Preventive maintenance of disk failures

    5. RAID gets more creative with three or more hard drives, utilizing parity, which allows it to reconstruct the contents of failed hard drives from still-online drives.

      If you are using RAID and one of the 3 drives fail, you can still recover its content thanks to XOR operation

    6. A more reliable solution is to store the data on a hard drive1. However, hard drives are rated for a limited number of read/write cycles, and can be expected to fail eventually.

      Hard drives are a better lifetime option than microSD cards but still not ideal

    7. The worst way I can think of is to store it on a microSD card. These fail a lot. I couldn’t find any hard data, but anecdotally, 4 out of 5 microSD cards I’ve used have experienced failures resulting in permanent data loss.

      microSD cards aren't recommended for storing lifetime data

  3. Mar 2021
    1. The console is a killer SQLite feature for data analysis: more powerful than Excel and more simple than pandas. One can import CSV data with a single command, the table is created automatically

      SQLite makes it fairly easy to import and analyse data. For example:

      • import --csv city.csv city
      • select count(*) from city;
    1. This is not a problem if your DBMS supports SQL recursion: lots of data can be generated with a single query. The WITH RECURSIVE clause comes to the rescue.

      WITH RECURSIVE can help you quickly generate a series of random data.

  4. Oct 2020
    1. Queries became impractically slow around the 500,000 cell mark, but were still below 2 seconds for a 100,000 cell query. Therefore, if you anticipate a dataset larger than a few hundred thousand cells, it would probably be smart to choose a more scalable option.

      Scalability of Google Sheets. They have a hard limit of 5,000,000 cells (including blank ones)

  5. Sep 2020
    1. DuckDB is an embeddable SQL OLAP database management system

      Database not requiring a server like SQLite and offering advantages of PostgreSQL

  6. Aug 2020
    1. The Splitgraph DDN is a single SQL endpoint that lets you query over 40,000 public datasets hosted on or proxied by Splitgraph.You can connect to it from most PostgreSQL clients and BI tools without having to install anything else. It supports all read-only SQL constructs, including filters and aggregations. It even lets you run joins across distinct datasets.

      Splitgraph - efficient DDN (Data Delivery Network):

      • connect to it from most PostgreSQL clients and BI tools without having to install anything else
      • you can queory +40k public datasets hosten on or proxied by Splitgraph
      • supports all SQL constructs (even SQL joins between tables)
  7. Jul 2020
    1. So in brief, for our application service, if we understand the access patterns very well, they’re repeatable, they’re consistent, and scalability is a big factor, then NoSQL is a perfect choice.

      When NoSQL is a perfect choice

    2. Comparison Time … 🤞

      Brief comparison of 8 aspects between SQL vs NoSQL

  8. May 2020
    1. Which database technology to choose

      Which database to choose (advice from an Amazon employee):

      • SQL - ad hoc queries and/or support of ACID and transactions
      • NoSQL - otherwise. NoSQL is getting better with transactions and PostgreSQL is getting better with availability, scalability, durability
  9. Apr 2020
    1. From a narratological perspective, it would probably be fair to say that most databases are tragic. In their design, the configuration of their user interfaces, the selection of their contents, and the indexes that manage their workings, most databases are limited when set against the full scope of the field of information they seek to map and the knowledge of the people who created them. In creating a database, we fight against the constraints of the universe – the categories we use to sort out the world; the limitations of time and money and technology – and succumb to them.

      databases are tragic!

    1. I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain.

      Insights on databases from a Google engineer:

      1. You are lucky if 99.999% of the time network is not a problem.
      2. ACID has many meanings.
      3. Each database has different consistency and isolation capabilities.
      4. Optimistic locking is an option when you can’t hold a lock.
      5. There are anomalies other than dirty reads and data loss.
      6. My database and I don’t always agree on ordering.
      7. Application-level sharding can live outside the application.
      8. AUTOINCREMENT’ing can be harmful.
      9. Stale data can be useful and lock-free.
      10. Clock skews happen between any clock sources.
      11. Latency has many meanings.
      12. Evaluate performance requirements per transaction.
      13. Nested transactions can be harmful.
      14. Transactions shouldn’t maintain application state.
      15. Query planners can tell a lot about databases.
      16. Online migrations are complex but possible.
      17. Significant database growth introduces unpredictability.
    1. 1) Redash and Falcon focus on people that want to do visualizations on top of SQL2) Superset, Tableau and PowerBI focus on people that want to do visualizations with a UI3) Metabase and SeekTable focus on people that want to do quick analysis (they are the closest to an Excel replacement)

      Comparison of data analysis tools:

      1) Redash & Falcon - SQL focus

      2) Superset, Tableau & PowerBI - UI workflow

      3) Metabase & SeekTable - Excel like experience

  10. Mar 2020
    1. supporting this field is extremely easy If you keep raw data, it's just a matter of adding a getter method to the Article class.

      Way of supporting a new field in JSON is much easier than in a relational database:

      def highlights(self) -> Sequence[Highlight]:
          default = [] # defensive to handle older export formats that had no annotations
          jsons = self.json.get('annotations', default)
          return list(map(Highlight, jsons))
    2. query language doesn't necessarily mean a database. E.g. see pandas which is capable of what SQL is capable of, and even more convenient than SQL for our data exploration purposes.

      Query language, not always = database. For example, see pandas

    3. cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to functools.lru_cache). The difference from functools.lru_cache is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.

      cachew tool isolates the complexity of database access patterns in a Python library

  11. Feb 2020
    1. Imagine that you're using a database to export them, so your schema is: TABLE Article(STRING id, STRING url, STRING title, DATETIME added). One day, the developers expose highlights (or annotations) from the private API and your export script stats receiving it in the response JSON. It's quite useful data to have! However, your database can't just magically change to conform to the new field.

      Relational model can be sometimes hand tying, unlike JSON

    2. Storage saved by using a database instead of plaintext is marginal and not worth the effort.

      Databases save some space used by data, but it's marginal

    3. if necessary use databases as an intermediate layer to speed access up and as an additional interface to your data Nothing wrong with using databases for caching if you need it!

      You may want to use databases for:

      • speeding access up
      • creating additional layer
      • caching
    4. I want to argue very strongly against forcing the data in the database, unless it's really inevitable.

      After scraping some data, don't go immediately to databases, unless it's a great stream of data

  12. Jan 2020
  13. Nov 2019
    1. FKs don't work well with online schema migrations.

      3rd reason why at GitHub they don't rely on Foreign Keys: Working with online schema migrations.

      FKs impose a lot of constraints on what's possible and what's not possible

    2. FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead.

      2nd reason why at GitHub they don't rely on Foreign Keys: FK performance impact

    3. FKs are in your way to shard your database. Your app is accustomed to rely on FK to maintain integrity, instead of doing it on its own. It may even rely on FK to cascade deletes (shudder). When eventually you want to shard or extract data out, you need to change & test the app to an unknown extent.

      1st reason why at GitHub they don't rely on Foreign Keys: Relying on FK to maintain integrity, instead of doing it on its own

  14. Sep 2019
    1. To address the availability concern, new architectures were developed to minimize the impact of partitions. For instance, splitting data sets into smaller ranges called shards can minimize the amount of data that is unavailable during partitions. Furthermore, mechanisms to automatically alter the roles of various cluster members based on network conditions allow them to regain availability quickly

      Qualities of NewSQL - mainly minimisation of the impact of partitions

    2. typically less flexible and generalized than their more conventional relational counterparts. They also usually only offer a subset of full SQL and relational features, which means that they might not be able to handle certain kinds of usage. Many NewSQL implementations also store a large part of or their entire dataset in the computer's main memory. This improves performance at the cost of greater risk to unpersisted changes

      Differences between NewSQL and relational databases:

      • typically less flexible and generalized
      • usually only offer a subset of full SQL and relational features, which means that they might not be able to handle certain kinds of usage.
      • many NewSQL implementations also store a large part of or their entire dataset in the computer's main memory. This improves performance at the cost of greater risk to unpersisted changes.
    3. using a mixture of different database types is the best approach for handling the data of your projects

      Many times mixing different databases is a good approach.

      For example:

      • store user information - relational databases
      • configuration values - in-memory key-value store
    4. best suited for use cases with high volumes of relational data in distributed, cloud-like environments

      Best suit of NewSQL

    5. CAP theorem is a statement about the trade offs that distributed databases must make between availability and consistency. It asserts that in the event of a network partition, a distributed database can choose either to remain available or remain consistent, but it cannot do both. Cluster members in a partitioned network can continue operating, leading to at least temporary inconsistency. Alternatively, at least some of the disconnected members must refuse to alter their data during the partition to ensure data consistency

      CAP Theorem relating to distributed databases


    6. NewSQL databases: bringing modern scalability and performance to the traditional relational pattern

      NewSQL databases - designed with scalability and modern performance requirements. Follow the relational structure and semantics, but are built using more modern, scalable design. Rise in popularity in 2010s.


      • MemSQL
      • VoltDB
      • Spanner
      • Calvin
      • CockroachDB
      • FaunaDB
      • yugabyteDB
    7. aggregate queries like summing, averaging, and other analytics-oriented processes can be difficult or impossible

      Disadvantage of column databases

    8. Column-family databases are good when working with applications that requires great performance for row-based operations and highly scalability

      Advantage of column databases. They also collect row data in a cluster on the same machine, simplifying data sharding and scaling

    9. it helps to think of column family databases as key-value databases where each key (row identifier) returns a dictionary of arbitrary attributes and their values (the column names and their values)

      Tip to remember the idea of column databases

    10. Column-family databases: databases with flexible columns to bridge the gap between relational and document databases

      Column-family databases - also called as non-relational column stores, wide-column databases or column databases. Rise in popularity in 2000s. Look highly similar to relational databases. They have structure called column families, which contain rows of data, each of which define their own format. Therefore, each row in a column family defines its own schema.


      • Cassandra
      • HBase

      Diagram of column-family database structure

    11. querying for the connection between two users of a social media site in a relational database is likely to require multiple table joins and therefore be rather resource intensive. This same query would be straightforward in a graph database that directly maps connections

      Social media prefers graph databases over relational ones

    12. Graph databases are most useful when working with data where the relationships or connections are highly important

      Major use of graph databases

    13. network databases require step-by-step traversal to travel between items and are limited in the types of relationships they can represent.

      Difference between network databases (SQL) and graph databases (NoSQL)

    14. Graph databases: mapping relationships by focusing on how connections between data are meaningful

      Graph databases - establishes connections using the concepts of nodes, edges, and properties. Rise in popularity in 2000s.


      • Neo4j
      • JanusGraph
      • Dgraph

      Diagram of a graph database structure

    15. Document databases: Storing all of an item's data in flexible, self-describing structures

      Document databases - also known as document-oriented databases or document stores, share the basic access and retrieval semantics of key-value stores. Rise in popularity in 2009.

      They also used keys to uniquely identify data, therefore the line between advanced key-value stores and document databases can be fairly unclear.

      Instead of storing arbitrary blobs of data, document databases store data in structured formats called documents, often using formats like JSON, BSON, or XML.


      • MongoDB
      • RethinkDB
      • Couchbase

      Diagram of document database

    16. Document databases are a good choice for rapid development because you can change the properties of the data you want to save at any point without altering existing structures or data. You only need to backfill records if you want to. Each document within the database stands on its own with its own system of organization. If you're still figuring out your data structure and your data is mainly composed discrete entries that don't include a lot of cross references, a document database might be a good place to start. Be careful, however, as the extra flexibility means that you are responsible for maintaining the consistency and structure of your data, which can be extremely challenging

      Pros and cons of document databases

    17. Though the data within documents is organized within a structure, document databases do not prescribe any specific format or schema

      Therefore, unlike in key-value stores, the content stored in document databases can be queried and analysed

    18. Key-value stores are often used to store configuration data, state information, and any data that might be represented by a dictionary or hash in a programming language. Key-value stores provide fast, low-complexity access to this type of data

      Use and advantages of of key-value stores

    19. Key-value databases: simple, dictionary-style lookups for basic storage and retrieval

      Key-value databases - one of the simplest database types. Initially introduced in 1970s (rise in popularity: 2000-2010). Work by storing arbitrary data accessible through a specific key.

      • to store data, you provide a key and the blob of data you wish to save, for example a JSON object, an image, or plain text.
      • to retrieve data, you provide the key and will then be given the blob of data back.


      • Redis
      • memcached
      • etcd

      Diagram of key-value store

    20. NoSQL databases: modern alternatives for data that doesn't fit the relational paradigm

      NoSQL databases - stands for either non-SQL or not only SQL to clarify that sometimes they allow SQL-like querying.

      4 types:

      • Key-value
      • Document
      • Graph
      • Column-family
    21. relational databases are often a good fit for any data that is regular, predictable, and benefits from the ability to flexibly compose information in various formats. Because relational databases work off of a schema, it can be more challenging to alter the structure of data after it is in the system. However, the schema also helps enforce the integrity of the data, making sure values match the expected formats, and that required information is included. Overall, relational databases are a solid choice for many applications because applications often generate well-ordered, structured data

      Pros and cons of relational database

    22. querying language called SQL, or structured query language, was created to access and manipulate data stored with that format

      SQL was created for relational databases

    23. Relational databases: working with tables as a standard solution to organize well-structured data

      Relational databases - oldest general purpose database type still widely used today. They comprise the majority of databases currently used in production. Initially introduced in 1969.

      They organise data using tables - structures that impose a schema on the records that they hold.

      • each column has a name and a data type
      • each row represents an individual record


      • MySQL
      • MariaDB
      • PostgreSQL
      • SQLite

      Diagram of relational schema used to map entities for a school

    24. database schema is a description of the logical structure of a database or the elements it contains. Schemas often include declarations for the structure of individual entries, groups of entries, and the individual attributes that database entries are comprised of. These may also define data types and additional constraints to control the type of data that may be added to the structure

      Database schema

    25. Network databases: mapping more flexible connections with non-hierarchical links

      Network databases - built on the foundation provided by hierarchical databases by adding additional flexibility. Initially introduced in late 1960s. Instead of always having a single parent, as in hierarchical databases, network database entries can have more than one parent, which effectively allows them to model more complex relationships.


      • IDMS

      Have graph-like structure Diagram of a network database

    26. Hierarchical databases: using parent-child relationships to map data into trees

      Hierarchical databases - the next evolution in database development. Initially introduced in 1960s. They encode a relationship between items where every record has a single parent.


      • Filesystems
      • DNS
      • LDAP directories

      Have tree-like structure Diagram of a hierarchical database

    27. Hierarchical databases are not used much today due to their limited ability to organize most data and because of the overhead of accessing data by traversing the hierarchy

      Hierarchical databases aren't used as much anymore

    28. The first flat file databases represented information in regular, machine parse-able structures within files. Data is stored in plain text, which limits the type of content that can be represented within the database itself. Sometimes, a special character or other indicator is chosen to use as a delimiter, or marker for when one field ends and the next begins. For example, a comma is used in CSV (comma-separated values) files, while colons or white-space are used in many data files in Unix-like systems

      Flat-file databases - 1st type of databases with a simple data structure for organising small amounts of local data.


      • /etc/passwd and /etc/fstab on Linux and Unix-like systems
      • CSV files
    29. Some advantages of this format

      Advantages of flat-file format:

      • has robust, flexible toolkit
      • easily managed without specialised software
      • easy to understand and work with
    30. While flat file databases are simple, they are very limited in the level of complexity they can handle

      Disadvantages of flat-file databases:

      • system that reads or manipulates the data cannot make easy connections between the data represented
      • usually don't have any type of user or data concurrency features either
      • usually only practical for systems with small read or write requirements. For example, many operating systems use flat-files to store configuration data
  15. Dec 2018
  16. Aug 2018
  17. May 2018
  18. Oct 2017
    1. MySQL’s replication architecture means that if bugs do cause table corruption, the problem is unlikely to cause a catastrophic failure.

      I can't follow the reasoning here. I guess it's not guaranteed to replicate the corruption like Postgres would, but it seems totally possible to trigger similar or identical corruption because the implementation of the logical statement would be similar on the replica.

    2. The bug we ran into only affected certain releases of Postgres 9.2 and has been fixed for a long time now. However, we still find it worrisome that this class of bug can happen at all. A new version of Postgres could be released at any time that has a bug of this nature, and because of the way replication works, this issue has the potential to spread into all of the databases in a replication hierarchy.

      Not really a criticism of Postgres so much as it is a criticism of software in general.

  19. Aug 2017
  20. Jun 2016
    1. If the RRID is well-formed, and if the lookup found the right record, a human validator tags it a valid RRID — one that can now be associated mechanically with occurrences of the same resource in other contexts. If the RRID is not well-formed, or if the lookup fails to find the right record, a human validator tags the annotation as an exception and can discuss with others how to handle it. If an RRID is just missing, the validator notes that with another kind of exception tag.

      Sounds a lot like the way reference managers work. In many cases, people keep the invalid or badly-formed results.

  21. Apr 2016
  22. Jan 2016
  23. Dec 2015
    1. Data gathering is ubiquitous in science. Giant databases are currently being minedfor unknown patterns, but in fact there are many (many) known patterns that simplyhave not been catalogued. Consider the well-known case of medical records. A patient’smedical history is often known by various individual doctor-offices but quite inadequatelyshared between them. Sharing medical records often means faxing a hand-written noteor a filled-in house-created form between offices.
    2. I will use a mathematical tool calledologs, or ontology logs, to givesome structure to the kinds of ideas that are often communicated in pictures like theone on the cover. Each olog inherently offers a framework in which to record data aboutthe subject. More precisely it encompasses adatabase schema, which means a system ofinterconnected tables that are initially empty but into which data can be entered.
  24. May 2015
  25. Oct 2014
    1. This in turn means that Redis Cluster does not have to take meta data in the data structures in order to attempt a value merge, and that the fancy commands and data structures supported by Redis are also supported by Redis Cluster. So no additional memory overhead, no API limits, no limits in the amount of elements a value can contain, but less safety during partitions.

      A solid trade-off, I think, and says a lot about the intended use cases.

  26. Sep 2014
    1. Fast restart. If a server is temporarily taken down, this capability restores the index from a saved copy, eliminating delays due to index rebuilding.

      This point seems to be in direct contradiction to the claim above that "Indexes (primary and secondary) are always stored in DRAM for fast access and are never stored on Solid State Drives (SSDs) to ensure low wear."

    2. Unlike other databases that use the linux file system that was built for rotational drives, Aerospike has implemented a log structured file system to access flash – raw blocks on SSDs – directly.

      Does this really mean to suggest that Aerospike bypasses the linux block device layer? Is there a kernel driver? Does this mean I can't use any filesystem I want and know how to administrate? Is the claim that the "linux file system" (which I take to mean, I guess, the virtual file system layer) "built for rotation drives" even accurate? We've had ram disks for a long, long time. And before that we've had log structured filesystems, too, and even devices that aren't random access like tape drives. Seems like dubious claims all around.