2,548 Matching Annotations
  1. Apr 2020
    1. Sometimes, we don't want to keep the changes that were introduced by certain commits. Unlike a soft reset, we shouldn't need to have access to them any more.

      Hard reset - points HEAD to the specified commit.

      Discards changes that have been made since the new commit that HEAD points to, and deletes changes in working directory

      git reset --hard HEAD~2
      git status
      

      hard reset animation

    2. git rebase copies the commits from the current branch, and puts these copied commits on top of the specified branch.

      Rebasing - copies commits on top of another branch without creating a commit, which keeps a linear history.

      Changes the history as new hashes are created for the copied commits.

      git rebase master
      

      A big difference compared to merging, is that Git won't try to find out which files to keep and not keep. The branch that we're rebasing always has the latest changes that we want to keep! You won't run into any merging conflicts this way, and keep a nice linear Git history.

      Rebasing is great whenever you're working on a feature branch, and the master branch has been updated.

      rebasing animation

    3. This can happen when the two branches we're trying to merge have changes on the same line in the same file, or if one branch deleted a file that another branch modified, and so on.

      Merge conflict - you have to decide from which branch to keep the change.

      After:

      git merge dev
      

      Git will notify you about the merge conflict so you can manually remove the changes you don't want to keep, save them, and then:

      git add updated_file.md
      git commit -m "Merge..."
      

      merge conflict animation

    4. If we committed changes on the current branch that the branch we want to merge doesn't have, git will perform a no-fast-forward merge.

      No-fast-forward merge - default behavior when current branch contains commits that the merging branch doesn't have.

      Create a new commit which merges two branches together without modifying existing branches.

      git merge dev
      

      no-fast-forward merge animation

    5. fast-forward merge can happen when the current branch has no extra commits compared to the branch we’re merging.

      Fast-forward merge - default behavior when the branch has all of the current branch's commits.

      Doesn't create a new commit, thus doesn't modify existing branches.

      git merge dev
      

      fast-forward merge animation

    6. soft reset moves HEAD to the specified commit (or the index of the commit compared to HEAD), without getting rid of the changes that were introduced on the commits afterward!

      Soft reset - points HEAD to the specified commit.

      Keeps changes that have been made since the new commit the HEAD points to, and keeps the modifications in the working directory

      git reset --soft HEAD~2
      git status
      

      soft reset animation

    7. git reset gets rid of all the current staged files and gives us control over where HEAD should point to.

      Reset - way to get rid of unwanted commits. We have soft and hard reset

    8. There are 6 actions we can perform on the commits we're rebasing

      Interactive rebase - makes it possible to edit commits before rebasing.

      Creates new commits for the edited commits which history has been changed.

      6 actions (options) of interactive rebase:

      • reword: Change the commit message
      • edit: Amend this commit
      • squash: Meld commit into the previous commit
      • fixup: Meld commit into the previous commit, without keeping the commit's log message
      • exec: Run a command on each commit we want to rebase
      • drop: Remove the commit
      git rebase -i HEAD~3
      

      drop animation

      squash animation

    1. Pharma, which is one of the biggest, richest, most rewarding and promising industries in the world. Especially now, when the pharmaceutical industry, including the FDA, allows R to be used the domain occupied in 110% by SAS.

      Pharma industry is one of the most rewarding industries, especially now

    2. CR is one of the most controlled industries in this world. It's insanely conservative in both used statistical methods and programming. Once a program is written and validated, it may be used for decades. There are SAS macros written in 1980 working still by today without any change. That's because of brilliant backward compatibility of the SAS macro-language. New features DO NOT cause the old mechanisms to be REMOVED. It's here FOREVER+1 day.

      Clinical Research is highly conservative, making SAS macros applicable for decades. Unfortunately, that's not the same case with R

  2. Mar 2020
    1. wearing simple face masks which exert a barrier function that blocks those big projectile droplets that land in the nose or throat may substantially reduce the production rate R, to an extent that may be comparable to social distancing and washing hands.

      Most important message of the article

    2. avoiding large droplets, which cannot enter the lung anyway but land in the upper respiratory tracts, could be the most effective means to prevent infection. Therefore, surgical masks, perhaps even your ski-mask, bandanas or scarf

      Wear a mask!

    3. Surprisingly, ACE2 expression in the lung is very low: it is limited to a few molecules per cell in the alveolar cells (AT2 cells) deep in the lung. But a just published paper by the Human Cell Atlas (HCA) consortium reports that ACE2 is highly expressed in some type of (secretory) cells of the inner nose!

      Major route of viral entry is likely via large droplets that land in the nose — where expression of the viral entry receptor, ACE2 is highest. This is the transmission route that could be effectively blocked already by simple masks that provide a physical barrier.

    4. SARS-Cov-2 virus, like any virus, must dock onto human cells using a key-lock principle, in which the virus presents the key and the cell the lock that is complementary to the key to enter the cell and replicate. For the SARS-Cov-2 virus, the viral surface protein “Spike protein S” is the “key” and it must fit snugly into the “lock” protein that is expressed (=molecularly presented) on the surface of the host cells. The cellular lock protein that the SARS-Cov-2 virus uses is the ACE2 protein

      SARS-Cov-2 enters the host cell by docking with its Spike protein to the ACE2 (blue) protein in cell surfaces:

    5. Filtering effect for small droplets (aerosols) by various masks; home-made of tea cloth, surgical mask (3M “Tie-on”) and a FFP2 (N95) respirator mask. The numbers are scaled to the reference of 100 (source of droplets) for illustrative purposes, calculated from the PF (protection factor) values

    6. The tacit notion at the CDC that the alveolae are the destination site for droplets to deliver the virus load (the alveolae are after all the anatomical site of life-threatening pneumonia), has elevated the apparent importance of N95 masks and led to the dismissal of surgical masks.

      Why N95 masks are much better over the surgical masks

    7. droplets of a typical cough expulsion have a size distribution such that approximately half of the droplet are in the categories of aerosols, albeit they collectively represent only less than 1/100,000 of the expelled volume

      Droplets of a typical cough

    8. For airborne particles to be inspired and reach deep into the lung, through all the air ducts down to the alveolar cells where gas-exchange takes place, it has to be small

      Only droplets < 10 um can reach to alveolae (deep into lung). Larger droplets stuck in the nose, throat, upper air ducts of the lung, trachea and large bronchia.

    9. Droplets can (for this discussion) be crudely divided in two large categories based on size

      2 categories of droplets:

      a) Droplets < 10 um: upper size limit of aerosol. Can float in the air/rooms by ventilation or winds and can be filtered (to 95%) by N95 favial masks (droplets < than 0.3 um). Here the surgical masks cannot help.

      b) Droplets > 10 um (reaching 100+ um): called as spray droplets. Can be even visible by human from coughing/sneezing (0.1+ um).

    10. Droplet larger than aerosols, when exhaled (at velocity of <1m/s), evaporate or fall to the ground less than 1.5 m away. When expelled at high velocity through coughing or sneezing, especially larger droplets (> 0.1 micrometers), can be carried by the jet more than 2m or 6m, respectively, away.

    11. The official recommendation by CDC, FDA and others that masks worn by the non-health-care professionals are ineffective is incorrect at three levels: In the logic, in the mechanics of transmission, and in the biology of viral entry.
    12. Flattening the curve”. Effect of mitigating interventions that would decrease the initial reproduction rate R0 by 50% when implemented at day 25. Red curve is the course of numbers of infected individuals (”case”) without intervention. Green curve reflects the changed (”flattened”) curve after intervention. Day 0 (March 3, 2020) is the time at which 100 cases of infections were confirmed (d100 = 0).

      If people would start wearing a mask:

    1. Ancestor of all animals identified in Australian fossils

      Summary:

      • First ancestor of most animals, including humans, has been discovered—Ikaria wariootia had a mouth, anus, gut, and a bilaterian body plan.
      • Bilateral symmetry was a critical step in evolution, enabling organisms to move purposefully, but so far the first organism to develop it wasn’t known.
      • Ikaria wariootia was discovered through careful analysis of 555 million-year-old samples.
      • It was a wormlike creature, up to 7mm (0.27in) long, with a distinct head and tail, as well as faintly grooved musculature.
      • This discovery confirms what evolutionary biologists previously predicted.
    1. Połączona sieć komputerów działających w ramach inicjatywy Folding@Home przewyższyła swoją mocą obliczeniową najwydajniejsze siedem superkomputerów na świecie. Dobrze przeczytaliście: połączone w ramach F@H urządzenia dysponują mocą obliczeniową na poziomie 470 PetaFLOPS - wyższą od siedmiu najwydajniejszych superkomputerów na świecie razem wziętych! To trzy razy większa wydajność od tej, którą dysponuje najwydajniejszy obecnie superkomputer SUMMIT (149 PFLOPS).

      Internauts build a supercomputer network stronger than 7 most efficient local supercomputers interconnected. The reason is to fight against COVID-19.

      You can also join them by using Folding@home software

    1. This denotes the factorial of a number. It is the product of numbers starting from 1 to that number.

      Exclamation in Python: $$x!$$ is written as:

      x = 5
      fact = 1
      for i in range(x, 0, -1):
          fact = fact * i
      print(fact)
      

      it can be shortened as:

      import math
      math.factorial(x)
      

      and the output is:

      # 5*4*3*2*1
      120
      
    2. The hat gives the unit vector. This means dividing each component in a vector by it’s length(norm).

      Hat in Python: $$\hat{x}$$ is written as:

      x = [1, 2, 3]
      length = math.sqrt(sum([e**2 for e in x]))
      x_hat = [e/length for e in x]
      

      This makes the magnitude of the vector 1 and only keeps the direction:

      math.sqrt(sum([e**2 for e in x_hat]))
      # 1.0
      
    3. It gives the sum of the products of the corresponding entries of the two sequences of numbers.

      Dot Product in Python: $$X.Y$$ is written as:

      X = [1, 2, 3]
      Y = [4, 5, 6]
      dot = sum([i*j for i, j in zip(X, Y)])
      # 1*4 + 2*5 + 3*6
      # 32
      
    4. It means multiplying the corresponding elements in two tensors. In Python, this would be equivalent to multiplying the corresponding elements in two lists.

      Element wise multiplication in Python: $$z=x\odot y$$ is written as:

      x = [[1, 2], 
          [3, 4]]
      y = [[2, 2], 
          [2, 2]]
      z = np.multiply(x, y)
      

      and results in:

      [[2, 4]],
      [[6, 8]]
      
    5. This denotes a function which takes a domain X and maps it to range Y. In Python, it’s equivalent to taking a pool of values X, doing some operation on it to calculate pool of values Y.

      Function in Python: $$f:X \rightarrow Y$$ is written as:

      def f(X):
          Y = ...
          return Y
      

      Using R instead of X or Y means we're dealing with real numbers: $$f:R \rightarrow R$$ then, R^2 means we're dealing with d-dimensional vector of real numbers (in this case, example of d=2 is X = [1,2]

    6. The norm is used to calculate the magnitude of a vector. In Python, this means squaring each element of an array, summing them and then taking the square root.

      Norm of vector in Python (it's like Pythagorean theorem): $$| x|$$ is written as:

      x = [1, 2, 3]
      math.sqrt(x[0]**2 + x[1]**2 + x[2]**2)
      
    7. In Python, it is equivalent to looping over a vector from index 0 to index N-1 and multiplying them.

      PI in Python is the same as sigma, but you multiply (*) the numbers inside the for loop. $$\prod_{i=1}^Nx^i$$

    8. we reuse the sigma notation and divide by the number of elements to get an average.

      Average in Python: $$\frac{1}{N}\sum_{i=1}^Nx_i$$ is written as:

      x = [1, 2, 3, 4, 5]
      result = 0
      N = len(x)
      for i in range(n):
       result = result + x[i]
      average = result / N
      print(average)
      

      or it can be shortened:

      x = [1, 2, 3, 4, 5]
      result = sum(x) / len(x)
      
    9. In Python, it is equivalent to looping over a vector from index 0 to index N-1

      Sigma in Python: $$\sum_{i=1}^Nx_i$$ is written as:

      x = [1, 2, 3, 4, 5]
      result = 0
      N = len(x)
      for i in range(N):
       result = result + x[i]
      print(result)
      

      or it can be shortened:

      x = [1, 2, 3, 4, 5]
      result = sum(x)
      
    1. 1–9–90 rule (sometimes 90–9–1 principle or the 89:10:1 ratio),[1] which states that in a collaborative website such as a wiki, 90% of the participants of a community only consume content, 9% of the participants change or update content, and 1% of the participants add content.

      1% rule = 1% of the users create and 99% watch the content.

      1-9-90 rule = 1% create, 9% modify and 90% watch

    1. Another nice SQL script paired with CRON jobs was the one that reminded people of carts that was left for more than 48 hours. Select from cart where state is not empty and last date is more than or equal to 48hrs.... Set this as a CRON that fires at 2AM everyday, period with less activity and traffic. People wake up to emails reminding them about their abandoned carts. Then sit watch magic happens. No AI/ML needed here. Just good 'ol SQL + Bash.

      Another example of using SQL + CRON job + Bash to remind customers of cart that was left (again no ML needed here)

    2. I will write a query like select from order table where last shop date is 3 or greater months. When we get this information, we will send a nice "we miss you, come back and here's X Naira voucher" email. The conversation rate for this one was always greater than 50%.

      Sometimes SQL is much more than enough (you don't need ML)

    1. This volume of paper should be the same as the coaxial plug of paper on the roll.

      Calculating volume of the paper roll: $$\mathbf{Lwt = \pi w(R^2 - r^2)} \~\ L = \text{length of the paper} \ w = \text{width of the paper} \ t = \text{thickness} \ R = \text{outer radius} \ r = \text{inner radius}$$ And that simplifies into a formula for R: $$\color{red} {\bf R = \sqrt{\frac{Lt}{\pi}+r^2}}$$

    2. This shows the nonlinear relationship and how the consumption accelerates. The first 10% used makes just a 5% change in the diameter of the roll. The last 10% makes an 18.5% change.

      Consumption of a toilet paper roll has a nonlinear relationship between the:

      • y-axis (outer Radius of the roll (measured as a percentage of a full roll))
      • x-axis (% of the roll consumed)
    3. Toilet paper is typically supplied in rolls of perforated material wrapped around a central cardboard tube. There’s a little variance between manufacturers, but a typical roll is approximately 4.5” wide with an 5.25” external diameter, and a central tube of diameter 1.6” Toilet paper is big business (see what I did there?) Worldwide, approximately 83 million rolls are produced per day; that’s a staggering 30 billion rolls per year. In the USA, about 7 billion rolls a year are sold, so the average American citizen consumes two dozen rolls a year (two per month). Americans use 24 rolls per capita a year of toilet paper Again, it depends on the thickness and luxuriousness of the product, but the perforations typically divide the roll into approximately 1,000 sheets (for single-ply), or around 500 sheets (for double-ply). Each sheet is typically 4” long so the length of a (double-ply) toilet roll is approximately 2,000” or 167 feet (or less, if your cat gets to it).

      Statistics on the type and use of toilet paper in the USA.

      1" (inch) = 2.54 cm

    1. In the interval scale, there is no true zero point or fixed beginning. They do not have a true zero even if one of the values carry the name “zero.” For example, in the temperature, there is no point where the temperature can be zero. Zero degrees F does not mean the complete absence of temperature. Since the interval scale has no true zero point, you cannot calculate Ratios. For example, there is no any sense the ratio of 90 to 30 degrees F to be the same as the ratio of 60 to 20 degrees. A temperature of 20 degrees is not twice as warm as one of 10 degrees.

      Interval data:

      • show not only order and direction, but also the exact differences between the values
      • the distances between each value on the interval scale are meaningful and equal
      • no true zero point
      • no fixed beginning
      • no possibility to calculate ratios (only add and substract)
      • e.g.: temperature in Fahrenheit or Celsius (but not Kelvin) or IQ test
    2. As the interval scales, Ratio scales show us the order and the exact value between the units. However, in contrast with interval scales, Ratio ones have an absolute zero that allows us to perform a huge range of descriptive statistics and inferential statistics. The ratio scales possess a clear definition of zero. Any types of values that can be measured from absolute zero can be measured with a ratio scale. The most popular examples of ratio variables are height and weight. In addition, one individual can be twice as tall as another individual.

      Ratio data is like interval data, but with:

      • absolute zero
      • possibility to calculate ratio (e.g. someone can be twice as tall)
      • possibility to not only add and subtract, but multiply and divide values
      • e.g.: weight, height, Kelvin scale (50K is 2x hot as 25K)
    1. Javascript, APIs and Markup — this stack is all about finding middleground from the chaos of SSR+SPA. It is about stepping back and asking yourself, what parts of my page change and what parts don’t change?

      JavaScript, APIs and Markup (JAM Stack) - middleground between SSR + SPA.

      Advantages:

      • The parts that don’t change often are pre-rendered on the server and saved to static HTML files. Anything else is implemented in JS and run on the client using API calls.
      • Avoids too much data transfer (like the hydration data for SSR), therefore finds a good tradeoff to ship web content
      • Allows to leverage the power and cost of Content delivery networks (CDNs) to effectively serve your content
      • With serverless apps your APIs will never need a server to SSH into and manage
    2. Somewhere on this path to render pages on the fly (SSR) and render pages on the client (SPA) we forgot about the performance of our webpages. We were trying to build apps. But the web is about presenting content first and foremost!

      Website performance break with Client-side Rendering (SSR) and Single-page App (SPA)

    3. We were not satisfied with the basic capabilities like bold and italics so we built CSS. Now, we wanted to modify some parts of the HTML/CSS in response to things like clicking things, so we implemented a scripting language to quickly specify such relations and have then run within the browser itself instead of a round trip to the server.

      Birth of CSS - advanced styling

      (history of websites)

    4. And so was born PHP, it feels like a natural extension to HTML itself. You write your code between your HTML file itself and then be able to run those parts on the server, which further generate HTML and the final HTML gets send to the browser.This was extremely powerful. We could serve completely different pages to different users even though all of them access the same URL like Facebook. We could use a database on a server and store some data there, then based on some conditions use this data to modify the generated HTML and technically have an infinite number of pages available to serve (e-commerce).

      Birth of PHP - way to serve different content under the same URL

    1. TL;DR;

      Don't use:

      • No global vars.
      • Declare variables using "var".
      • Declare functions using "function" keyword.
      • Avoid use "for" in loops.
      • Array push, inmutability.
      • Class.
      • Use delete to remove a object atribute.
      • Avoid nested if.
      • else if.
      • Heavy nesting.
      • Avoid to add prototype to functions that could be used in a module.

      Use:

      • Common code in functions, follow D.R.Y principle.
      • Shorcut notation.
      • Spread operator over Object.assign (airbnb 3.8).
      • Pascal case naming.
      • Modularize your code in modules.
      • const and let!.
      • Literal syntax for object creation (airbnb 3.1).
      • Computed property names when creating objects (airbnb 3.2).
      • Property value shorthand (airbnb 3.4).
      • Group your shorthand properties at the beginning of your objec (airbnb 3.5).
      • use the literal syntax for array creation (airnbnb 4.1).
      • Use array spreads ... to copy arrays. (airbnb 4.3).
      • use spreads ... instead of, Array.from. (airbnb 4.4).
      • Use return statements in array method callbacks (airbnb 4.7).
    2. Don’t use iterators, prefer js higher-order functions instead of for / for..in
      // bad
      const increasedByOne = [];
      for (let i = 0; i < numbers.length; i++) {
        increasedByOne.push(numbers[i] + 1);
      }
      
      // good
      const increasedByOne = [];
      numbers.forEach((num) => {
        increasedByOne.push(num + 1);
      });
      
    3. Ternaries should not be nested and generally be single line expressions.

      e.g.

      const foo = maybe1 > maybe2 ? 'bar' : maybeNull;
      

      Also:

      // bad
      const foo = a ? a : b;
      const bar = c ? true : false;
      const baz = c ? false : true;
      
      // good
      const foo = a || b;
      const bar = !!c;
      const baz = !c;
      
    4. Use JSDOC https://jsdoc.app/about-getting-started.html format.

      Standardise your JavaScript comments:

      1. Use block comment
        /** This is a description of the foo function. */
        function foo() {
        }
        
      2. Use JSDOC tag to describe a function: ```javascript /**
      • Represents a book.
      • @constructor
      • @param {string} title - The title of the book.
      • @param {string} author - The author of the book. */ function Book(title, author) { } ```
    5. When JavaScript encounters a line break without a semicolon, it uses a set of rules called Automatic Semicolon Insertion to determine whether or not it should regard that line break as the end of a statement, and (as the name implies) place a semicolon into your code before the line break if it thinks so. ASI contains a few eccentric behaviors, though, and your code will break if JavaScript misinterprets your line break.

      Better place a semicolon (;) in the right place and do not rely on the Automatic Semicolon Insertion

      e.g.

      const luke = {};
      const leia = {};
      [luke, leia].forEach((jedi) => {
        jedi.father = 'vader';
      });
      

      ```

    1. Wykazali, iż u szczurów znajdujących się na diecie o niskiej zawartości tłuszczu wolniej dochodzi do niekorzystnych zmian strukturalnych i genetycznych w obrębie poszczególnych tkanek. Gryzonie, które jadły mniej, dłużej zachowywały młodość.

      Low-calorie diet reduces inflammation, delays the onset of old age diseases, and generally prolongs life.

    1. If sentences contain eight words or less, readers understand 100 percent of the information.If sentences contain 43 words or longer, the reader’s comprehension drops to less than 10 percent.

      <= 8 words <--- 100% understanding

      .>= 43 words <--- <10% understanding

    1. if what you care about is downtime, your first thought shouldn’t be “how do I reduce deployment downtime from 1 second to 1ms”, it should be “how can I ensure database schema changes don’t prevent rollback if I screw something up.”

      Caring about downtime

    2. The features Kubernetes provides for reliability (health checks, rolling deploys), can be implemented much more simply, or already built-in in many cases. For example, nginx can do health checks on worker processes, and you can use docker-autoheal or something similar to automatically restart those processes.

      Kubernetes' health checks can be replaced with nginx on worker processes + docker-autoheal to automatically restart those processes

    3. Distributed applications are really hard to write correctly. Really. The more moving parts, the more these problems come in to play. Distributed applications are hard to debug. You need whole new categories of instrumentation and logging to getting understanding that isn’t quite as good as what you’d get from the logs of a monolithic application.

      Microservices stay as a hard nut to crack.

      They are fine for an organisational scaling technique: when you have 500 developers working on one live website (so they can work independently). For example, each team of 5 developers can be given one microservice

    4. “Kubernetes is a large system with significant operational complexity. The assessment team found configuration and deployment of Kubernetes to be non-trivial, with certain components having confusing default settings, missing operational controls, and implicitly defined security controls.”

      Deployment of Kubernetes is non-trivial

    5. the Kubernetes codebase has significant room for improvement. The codebase is large and complex, with large sections of code containing minimal documentation and numerous dependencies, including systems external to Kubernetes.

      As of March 2020, the Kubernetes code base has more than 580 000 lines of Go code

    1. That makes sense, the new file gets created in the upper directory.

      If you add a new file, such as with:

      $ echonew file> merged/new_file

      It will be created in the upper directory

    2. Combining the upper and lower directories is pretty easy: we can just do it with mount!

      Combining lower and upper directories using mount:

      $ sudo mount -t overlay overlay -o lowerdir=/home/bork/test/lower,upperdir=/home/bork/test/upper,workdir=/home/bork/test/work /home/bork/test/merged

    3. Overlay filesystems, also known as “union filesystems” or “union mounts” let you mount a filesystem using 2 directories: a “lower” directory, and an “upper” directory.

      Docker doesn't make copies of images, but instead uses an overlay.

      Overlay filesystems, let you mount a system using 2 directories:

      • the lower directory (read-only)
      • the upper directory (read and write).

      When a process:

      • reads a file, the overlayfs filesystem driver looks into the upper directory and if it's not present, it looks into the lower one
      • writes a file, overlayfs will just use the upper directory
    1. Using Facebook ads, the researchers recruited 2,743 users who were willing to leave Facebook for one month in exchange for a cash reward. They then randomly divided these users into a Treatment group, that followed through with the deactivation, and a Control group, that was asked to keep using the platform. 

      The effects of not using Facebook for a month:

      • on average another 60 free mins per day
      • small but significant improvement in well-being, and in particular in self-reported happiness, life satisfaction, depression and anxiety
      • participants were less willing to use Facebook from now
      • the group was less likely to follow politics
      • deactivation significantly reduced polarization of views on policy issues and a measure of exposure to polarizing news
      • 80% agreed that the deactivation was good for them
    1. While the cognitive benefits of caffeine — increased alertness, improved vigilance, enhanced focus and improved motor performance — are well established, she said, the stimulant’s affect on creativity is less known.

      Coffee: .+ alertness

      .+ vigilance

      .+ focus

      .+ motor performance

      ? creativity

    1. The new virus is genetically 96% identical to a known coronavirus in bats and 86-92% identical to a coronavirus in pangolin. Therefore, the transmission of a mutated virus from animals to humans is the most likely cause of the appearance of the new virus.

      Source of COVID-19

    1. When asked why Wuhan was so much higher than the national level, the Chinese official replied that it was for lack of resources, citing as an example that there were only 110 critical care beds in the three designated hospitals where most of the cases were sent.

      Wuhan's rate then = 4.9%

      National rate = 2.1%

    2. The South Korean government is delivering food parcels to those in quarantine. Our national and local governments need to quickly organise the capacity and resources required to do this.Japanese schools are scheduled to be closed for march.

      Food delivery in South Korea and closing schools in Japan

    3. The limited availability of beds in Wuhan raised their mortality rate from 0.16% to 4.9%This is why the Chinese government built a hospital in a week. Are our governments capable of doing the same?

      Case of Wuhan

    4. The UK population is 67 million people, that’s 5.4 million infected.Currents predictions are that 80% of the cases will be mild.If 20% of those people require hospitalization for 3–6 weeks?That’s 1,086,176 People.Do you know how many beds the NHS has?140,000

      There will be a great lack of beds

    5. Evolving to be observant of direct dangers to ourselves seems to have left us terrible at predicting second and third-order effects of events.When worrying about earthquakes we think first of how many people will die from collapsing buildings and falling rubble. Do we think of how many will die due to destroyed hospitals?

      Thinking of second and third-order effects of events

    6. Can you guess the number of people that have contracted the flu this year that needed hospitalisation in the US? 0.9%

      0.9% of flu cases that required hospitalisation vs 20% of COVID-19

    7. The UK has 2.8 million people over the age of 85.The US has 12.6 million people over the Age of 80.Trump told people not to worry because 60,000 people a year die of the flu. If just 25% of the US over 80’s cohort get infected, given current mortality rates that’s 466,200 deaths in that age group alone with the assumption that the healthcare system has the capacity to handle all of the infected.

      Interesting calculation of probabilistic deaths of people > 80. Basically, with at least 25% infected people in the US who are > 80, we might have almost x8 more deaths than by flu

    1. dictionary (a piece of structured data) can be converted into n different possible documents (XML, PDF, paper or otherwise), where n is the number of possible permutations of the elements in the dictionary

      Dictionary

    2. The correct way to express a dictionary in XML is something like this

      Correct dictionary in XML:

      <root>
        <item>
          <key>Name</key>
          <value>John</value>
        </item>
        <item>
          <key>City</key>
          <value>London</value>
        </item>
      </root>
      
    1. JSON’s origins as a subset of JavaScript can be seen with how easily it represents key/value object data. XML, on the other hand, optimizes for document tree structures, by cleanly separating node data (attributes) from child data (elements)

      JSON for key/value object data

      XML for document tree structures (clearly separating node data (attributes) from child data (elements))

    2. The advantages of XML over JSON for trees becomes more pronounced when we introduce different node types. Assume we wanted to introduce departments into the org chart above. In XML, we can just use an element with a new tag name
    3. JSON is well-suited for representing lists of objects with complex properties. JSON’s key/value object syntax makes it easy. By contrast, XML’s attribute syntax only works for simple data types. Using child elements to represent complex properties can lead to inconsistencies or unnecessary verbosity.

      JSON works well for list of objects with complex properties. XML not so much

    4. UI layouts are represented as component trees. And XML is ideal for representing tree structures. It’s a match made in heaven! In fact, the most popular UI frameworks in the world (HTML and Android) use XML syntax to define layouts.

      XML works great for displaying UI layouts

    5. XML may not be ideal to represent generic data structures, but it excels at representing one particular structure: the tree. By separating node data (attributes) from parent/child relationships, the tree structure of the data shines through, and the code to process the data can be quite elegant.

      XML is good for representing tree structured data

    1. Hive is now trying to address consistency and usability. It facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.

      Apache Hive offers:

      • Streaming ingest of data - allowing readers to get a consistent view of data and avoiding too many files
      • Slow changing dimensions - dimensions of table change slowly over time
      • Data restatement - supported via INSERT, UPDATE, and DELETE
      • Bulk updates with SQL MERGE
    2. Delta Lake is an open-source platform that brings ACID transactions to Apache Spark™. Delta Lake is developed by Spark experts, Databricks. It runs on top of your existing storage platform (S3, HDFS, Azure) and is fully compatible with Apache Spark APIs.

      Delta Lake offers:

      • ACID transactions on Spark
      • Scalable metadata handling
      • Streaming and batch unification
      • Schema enforcement
      • Time travel
      • Upserts and deletes
    3. Apache Iceberg is an open table format for huge analytic data sets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table. Iceberg is focussed towards avoiding unpleasant surprises, helping evolve schema and avoid inadvertent data deletion.

      Apache Iceberg offers:

      • Schema evolution (add, drop, update, rename)
      • Hidden partitioning
      • Partition layout evolution
      • Time travel (reproducible queries)
      • Version rollback (resetting tables)
    4. rise of Hadoop as the defacto Big Data platform and its subsequent downfall. Initially, HDFS served as the storage layer, and Hive as the analytics layer. When pushed really hard, Hadoop was able to go up to few 100s of TBs, allowed SQL like querying on semi-structured data and was fast enough for its time.

      Hadoop's HDFS and Hive became unprepared for even larger sets of data

    5. Disaggregated model means the storage system sees data as a collection of objects or files. But end users are not interested in the physical arrangement of data, they instead want to see a more logical view of their data.

      File or Tables problem of disaggregated models

    6. ACID stands for Atomicity (an operation either succeeds completely or fails, it does not leave partial data), Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation), Isolation (an incomplete operation by one user does not cause unexpected side effects for other users), and Durability (once an operation is complete it will be preserved even in the face of machine or system failure).

      ACID definition

    7. Currently this may be possible using version management of object store, but that as we saw earlier is at a lower layer of physical detail which may not be useful at higher, logical level.

      Change management issue of disaggregated models

    8. Traditionally Data Warehouse tools were used to drive business intelligence from data. Industry then recognized that Data Warehouses limit the potential of intelligence by enforcing schema on write. It was clear that all the dimensions of data-set being collected could not be thought of at the time of data collection.

      Data Warehouses were later being replaced with Data Lakes to face the amount of big data

    9. As explained above, users are no longer willing to consider inefficiencies of underlying platforms. For example, data lakes are now also expected to be ACID compliant, so that the end user doesn’t have the additional overhead of ensuring data related guarantees.

      SQL Interface issue of disaggregated models

    10. Commonly used Storage platforms are object storage platforms like AWS S3, Azure Blob Storage, GCS, Ceph, MinIO among others. While analytics platforms vary from simple Python & R based notebooks to Tensorflow to Spark, Presto to Splunk, Vertica and others.

      Commonly used storage platforms:

      • AWS S3
      • Azure Blob Storage
      • GCS
      • Ceph
      • MinlO

      Commonly used analytics platforms:

      • Python & R based notebooks
      • TensorFlow
      • Spark
      • Presto
      • Splunk
      • Vertica
    11. Data Lakes that are optimized for unstructured and semi-structured data, can scale to PetaBytes easily and allowed better integration of a wide range of tools to help businesses get the most out of their data.

      Data Lake definitions / what do offer us:

      • support for unstructured and semi-structured data.
      • scalability to PetaBytes and higher
      • SQL like interface to interact with the stored data
      • ability to connect various analytics tools as seamlessly as possible
      • modern data lakes are generally a combination of decoupled storage and analytics tools
    1. We save all of this code, the ui object, the server function, and the call to the shinyApp function, in an R script called app.R

      The same basic structure for all Shiny apps:

      1. ui object.
      2. server function.
      3. call to the shinyApp function.

      ---> examples <---

    2. server

      Server example of a Shiny app (check the code below):

      • random distribution is plotted as a histogram with the requested number of bins
      • code that generates the plot is wrapped in a call to renderPlot
    3. I want to get the selected number of bins from the slider and pass that number into a python method and do some calculation/manipulation (return: “You have selected 30bins and I came from a Python Function”) inside of it then return some value back to my R Shiny dashboard and view that result in a text field.

      Using Python scripts inside R Shiny (in 6 steps):

      1. In ui.R create textOutput: textOutput("textOutput") (after plotoutput()).
      2. In server.R create handler: output$textOutput <- renderText({ }].
      3. Create python_ref.py and insert this code:
      4. Import reticulate library: library(reticulate).
      5. source_python() function will make Python available in R:
      6. Make sure you've these files in your directory:
      • app.R
      • python_ref.py and that you've imported the reticulate package to R Environment and sourced the script inside your R code.

      Hit run.

    4. Currently Shiny is far more mature than Dash. Dash doesn’t have a proper layout tool yet, and also not build in theme, so if you are not familiar with Html and CSS, your application will not look good (You must have some level of web development knowledge). Also, developing new components will need ReactJS knowledge, which has a steep learning curve.

      Shiny > Dash:

      • Dash isn't yet as stabilised
      • Shiny has much more layout options, whereas in Dash you need to utilise HTML and CSS
      • developing new components in Dash needs ReactJS knowledge (not so easy)
    5. You can host standalone apps on a webpage or embed them in R Markdown documents or build dashboards. You can also extend your Shiny apps with CSS themes, Html widgets, and JavaScript actions.

      Typical tools used for working with Shiny

    1. Vaex supports Just-In-Time compilation via Numba (using LLVM) or Pythran (acceleration via C++), giving better performance. If you happen to have a NVIDIA graphics card, you can use CUDA via the jit_cuda method to get even faster performance.

      Tools supported by Vaex

    2. displaying a Vaex DataFrame or column requires only the first and last 5 rows to be read from disk

      Vaex tries to go over the entire dataset with as few passes as possible

    3. Why is it so fast? When you open a memory mapped file with Vaex, there is actually no data reading going on. Vaex only reads the file metadata

      Vaex only reads the file metadata:

      • location of the data on disk
      • data structure (number of rows, columns...)
      • file description
      • and so on...
    4. When filtering a Vaex DataFrame no copies of the data are made. Instead only a reference to the original object is created, on which a binary mask is applied

      Filtering Vaex DataFrame works on reference to the original data, saving lots of RAM

    5. Vaex is an open-source DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your hard-drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms and lazy evaluations.

      Vaex - library to manage as large datasets as your HDD, thanks to:

      • memory mapping
      • efficient out-of-core algorithms
      • lazy evaluations.

      All wrapped in a Pandas-like API

    6. The describe method nicely illustrates the power and efficiency of Vaex: all of these statistics were computed in under 3 minutes on my MacBook Pro (15", 2018, 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods would require either distributed computing or a cloud instance with over 100GB to preform the same computations.

      Possibilities of Vaex

    7. AWS offers instances with Terabytes of RAM. In this case you still have to manage cloud data buckets, wait for data transfer from bucket to instance every time the instance starts, handle compliance issues that come with putting data on the cloud, and deal with all the inconvenience that come with working on a remote machine. Not to mention the costs, which although start low, tend to pile up as time goes on.

      AWS as a solution to analyse data too big for RAM (like 30-50 GB range). In this case, it's still uncomfortable:

      • managing cloud data buckets
      • waiting for data transfer from bucket to instance every time the instance starts
      • handling compliance issues coming by putting data on the cloud
      • dealing with remote machines
      • costs
    1. git config --global alias.s status

      Replace git status with git s:

      git config --global alias.s status
      

      It will modify config in .gitconfig file.

      Other set of useful aliases:

      [alias]
        s = status
        d = diff
        co = checkout
        br = branch
        last = log -1 HEAD
        cane = commit --amend --no-edit
        lo = log --oneline -n 10
        pr = pull --rebase
      

      You can apply them (^) with:

      git config --global alias.s status
      git config --global alias.d diff
      git config --global alias.co checkout
      git config --global alias.br branch
      git config --global alias.last "log -1 HEAD"
      git config --global alias.cane "commit --amend --no-edit"
      git config --global alias.pr "pull --rebase"
      git config --global alias.lo "log --oneline -n 10"
      
    1. The best commit messages I’ve seen don’t just explain what they’ve changed: they explain why

      Proper commits:

      • explains the reason for the change
      • is searchable (contains the error message)
      • tells a story (explains investigation process)
      • makes everyone a little smarter
      • builds compassion and trust (adds an extra bit of human context)
    1. If you use practices like pair or mob programming, don't forget to add your coworkers names in your commit messages

      It's good to give a shout-out to developers who collaborated on the commit. For example:

      $ git commit -m "Refactor usability tests.
      >
      >
      Co-authored-by: name <name@example.com>
      Co-authored-by: another-name <another-name@example.com>"
      
    2. I'm fond of gitmoji commit convention. It lies on categorizing commits using emojies. I'm a visual person so it fits well to me but I understand this convention is not made for everyone.

      You can add gitmojis (emojis) in your commits, such as:

      :recycle: Make core independent from the git client (#171)
      :whale: Upgrade Docker image version (#167)
      

      which will transfer on GitHub/GitLab to:

      ♻️ Make core independent from the git client (#171)
      🐳 Upgrade Docker image version (#167)
      
    3. Separate subject from body with a blank line Limit the subject line to 50 characters Capitalize the subject line Do not end the subject line with a period Use the imperative mood in the subject line Wrap the body at 72 characters Use the body to explain what and why vs. how

      7 rules of good commit messages.

      >more info<

    1. Don’t commit directly to the master or development branches. Don’t hold up work by not committing local branch changes to remote branches. Never commit application secrets in public repositories. Don’t commit large files in the repository. This will increase the size of the repository. Use Git LFS for large files.  Learn more about what Git LFS is and how to utilize it in this advanced Learning Git with GitKraken tutorial. Don’t create one pull request addressing multiple issues. Don’t work on multiple issues in the same branch. If a feature is dropped, it will be difficult to revert changes. Don’t reset a branch without committing/stashing your changes. If you do so, your changes will be lost. Don’t do a force push until you’re extremely comfortable performing this action. Don’t modify or delete public history. 

      Git Don'ts

    2. Create a Git repository for every new project. Learn more about what a Git repo is in this beginner Learning Git with GitKraken tutorial. Always create a new branch for every new feature and bug. Regularly commit and push changes to the remote branch to avoid loss of work. Include a gitignore file in your project to avoid unwanted files being committed. Always commit changes with a concise and useful commit message.  Utilize git-submodule for large projects. Keep your branch up to date with development branches. Follow a workflow like Gitflow. There are many workflows available, so choose the one that best suits your needs. Always create a pull request for merging changes from one branch to another. Learn more about what a pull request is and how to create them in this intermediate Learning Git with GitKraken tutorial. Always create one pull request addressing one issue. Always review your code once by yourself before creating a pull request. Have more than one person review a pull request. It’s not necessary, but is a best practice. Enforce standards by using pull request templates and adding continuous integrations. Learn more about enhancing the pull request process with templates.  Merge changes from the release branch to master after each release. Tag the master sources after every release. Delete branches if a feature or bug fix is merged to its intended branches and the branch is no longer required. Automate general workflow checks using Git hooks. Learn more about how to trigger Git hooks in this intermediate Learning Git with GitKraken tutorial. Include read/write permission access control to repositories to prevent unauthorized access. Add protection for special branches like master and development to safeguard against accidental deletion.

      Git Dos

    1. To add the .gitattributes to the repo first you need to create a file called .gitattributes into the root folder for the repo.

      With such a content of .gitattributes:

      *.js    eol=lf
      *.jsx   eol=lf
      *.json  eol=lf
      

      the end of line will be the same for everyone

    2. On the Windows machine the default for the line ending is a Carriage Return Line Feed (CRLF), whereas on Linux/MacOS it's a Line Feed (LF).

      Thar is why you might want to use .gitattributes to prevent such differences.

      On Windows Machine if endOfLine property is set to lf

      {
        "endOfLine": "lf"
      }
      

      On the Windows machine the developer will encounter linting issues from prettier:

    3. The above commands will now update the files for the repo using the newly defined line ending as specified in the .gitattributes.

      Use these lines to update the current repo files:

      git rm --cached -r .
      git reset --hard
      
    1. I feel great that all of my posts are now safely saved in version control and markdown. It’s a relief for me to know that they’re no longer an HTML mess inside of a MySQL database, but markdown files which are easy to read, write, edit, share, and backup.

      Good feeling of switching to GatsbyJS

    2. However, I realized that a static site generator like Gatsby utilizes the power of code/data splitting, pre-loading, pre-caching, image optimization, and all sorts of performance enhancements that would be difficult or impossible to do with straight HTML.

      Benefits of mixing HTML/CSS with some JavaScript (GatsbyJS):

      • code/data splitting
      • pre-loading
      • pre-caching
      • image optimisation
      • performance enhancements impossible with HTML
    3. A few things I really like about Gatsby

      Main benefits of GatsbyJS:

      • No page reloads
      • Image optimisation
      • Pre-fetch resources
      • Bundling and minification
      • Server-side rendered, at build time
      • Articles are saved in beautiful Markdown
      • Using Netlify your sites automatically updates while pushing the repo
    4. I had over 100 guides and tutorials to migrate, and in the end I was able to move everything in 10 days, so it was far from the end of the world.

      If you're smart, you can move from WordPress to GatsbyJS in ~ 10 days

    5. There is a good amount of prerequisite knowledge required to set up a Gatsby site - HTML, CSS, JavaScript, ES6, Node.js development environment, React, and GraphQL are the major ones.

      There's a bit of technologies to be familiar with before setting up a GatsbyJS blog:

      • HTML
      • CSS
      • JavaScript
      • ES6
      • Node.js
      • React
      • GraphQL

      but you can be fine with the Gatsby Getting Started Tutorial

    1. Gatsby is a React based framework which utilises the powers of Webpack and GraphQL to bundle real React components into static HTML, CSS and JS files. Gatsby can be plugged into and used straight away with any data source you have available, whether that is your own API, Database or CMS backend (Spoiler Alert!).

      Good GatsbyJS explanation in a single paragraph

    1. Using either SRS has already given you a huge edge over not using any SRS: No SRS: 70 hours Anki: 10 hours SuperMemo: 6 hours The difference between using any SRS (whether it’s Anki or SM) and not using is huge, but the difference between Anki or SM is not

      It doesn't matter as much which SRS you're using. It's most important to use one of them at least

    1. And for the last three years, I've added EVERYTHING to Anki. Bash aliases, IDE Shortcuts, programming APIs, documentation, design patterns, etc. Having done that, I wouldn't recommend adding EVERYTHING

      Put just the relevant information into Anki

    2. Kyle had a super hero ability. Photographic memory in API syntax and documentation. I wanted that and I was jealous. My career was stuck and something needed to change. And so I began a dedicated journey into spaced repetition. Every day for three years, I spent one to three hours in spaced repetition

      Spaced repetition as a tool for photographic memory in API syntax and documentation

    1. First up, regular citizens who download copyrighted content from illegal sources will not be criminalized. This means that those who obtain copies of the latest movies from the Internet, for example, will be able to continue doing so without fear of reprisals. Uploading has always been outlawed and that aspect has not changed.

      In Switzerland you will be able to download, but not upload pirate content

    1. Several cryptocurrencies use DAGs rather than blockchain data structures in order to process and validate transactions.

      DAG vs Blockchain:

      • DAG transactions are linked to each other rather than grouped into blocks
      • DAG transactions can be processed simultaneously with others
      • DAG results in a lessened bottleneck on transaction throughput. In blockchain it's limited, such as transactions that can fit in a single block
    1. In multi-class model, we can plot N number of AUC ROC Curves for N number classes using One vs ALL methodology. So for Example, If you have three classes named X, Y and Z, you will have one ROC for X classified against Y and Z, another ROC for Y classified against X and Z, and a third one of Z classified against Y and X.

      Using AUC ROC curve for multi-class model

    1. LR is nothing but the binomial regression with logit link (or probit), one of the numerous GLM cases. As a regression - itself it doesn't classify anything, it models the conditional (to linear predictor) expected value of the Bernoulli/binomially distributed DV.

      Linear Regression - the ultimate definition (it's not a classification algorithm!)

      It's used for classification when we specify a 50% threshold.

    1. BOW is often used for Natural Language Processing (NLP) tasks like Text Classification. Its strengths lie in its simplicity: it’s inexpensive to compute, and sometimes simpler is better when positioning or contextual info aren’t relevant

      Usefulness of BOW:

      • simplicity
      • low on computing requirements
    2. Notice that we lose contextual information, e.g. where in the document the word appeared, when we use BOW. It’s like a literal bag-of-words: it only tells you what words occur in the document, not where they occurred

      The analogy behind using bag term in the bag-of-words (BOW) model.

    1. Softmax turns arbitrary real values into probabilities

      Softmax function -

      • outputs of the function are in range [0,1] and add up to 1. Hence, they form a probability distribution
      • the calcualtion invloves e (mathematical constant) and performs operation on n numbers: $$s(x_i) = \frac{e^{xi}}{\sum{j=1}^n e^{x_j}}$$
      • the bigger the value, the higher its probability
      • lets us answer classification questions with probabilities, which are more useful than simpler answers (e.g. binary yes/no)
    1. 1. Logistic regression IS a binomial regression (with logit link), a special case of the Generalized Linear Model. It doesn't classify anything *unless a threshold for the probability is set*. Classification is just its application. 2. Stepwise regression is by no means a regression. It's a (flawed) method of variable selection. 3. OLS is a method of estimation (among others: GLS, TLS, (RE)ML, PQL, etc.), NOT a regression. 4. Ridge, LASSO - it's a method of regularization, NOT a regression. 5. There are tens of models for the regression analysis. You mention mainly linear and logistic - it's just the GLM! Learn the others too (link in a comment). STOP with the "17 types of regression every DS should know". BTW, there're 270+ statistical tests. Not just t, chi2 & Wilcoxon

      5 clarifications to common misconceptions shared over data science cheatsheets on LinkedIn

    1. 400 sized probability sample (a small random sample from the whole population) is often better than a millions sized administrative sample (of the kind you can download from gov sites). The reason is that an arbitrary sample (as opposed to a random one) is very likely to be biased, and, if large enough, a confidence interval (which actually doesn't really make sense except for probability samples) will be so narrow that, because of the bias, it will actually rarely, if ever, include the true value we are trying to estimate. On the other hand, the small, random sample will be very likely to include the true value in its (wider) confidence interval

      Summary of Lecture 01 (Data Science Lifecycle, Study Design) - Data 100 Su19

    1. Here’s a very simple example of how a VQA system might answer the question “what color is the triangle?”
      1. Look for shapes and colours using CNN.
      2. Understand the question type with NLP.
      3. Determine strength for each possible answer.
      4. Convert each answer strength to % probability
    1. Script mode takes a function/class, reinterprets the Python code and directly outputs the TorchScript IR. This allows it to support arbitrary code, however it essentially needs to reinterpret Python

      Script mode in PyTorch

    2. In 2019, the war for ML frameworks has two remaining main contenders: PyTorch and TensorFlow. My analysis suggests that researchers are abandoning TensorFlow and flocking to PyTorch in droves. Meanwhile in industry, Tensorflow is currently the platform of choice, but that may not be true for long
      • in research: PyTorch > TensorFlow
      • in industry: TensorFlow > PyTorch
    3. Why do researchers love PyTorch?
      • simplicity <--- pythonic like, easily integrates with its ecosystem
      • great API <--- TensorFlow used to switch API many times
      • performance <--- it's not so clear if it's faster than TensorFlow
    4. Researchers care about how fast they can iterate on their research, which is typically on relatively small datasets (datasets that can fit on one machine) and run on <8 GPUs. This is not typically gated heavily by performance considerations, but by their ability to quickly implement new ideas. On the other hand, industry considers performance to be of the utmost priority. While 10% faster runtime means nothing to a researcher, that could directly translate to millions of savings for a company

      Researchers value how fast they can implement tools on their research.

      Industry considers value performance as it brings money.

    1. For the application of machine learning in finance, it’s still very early days. Some of the stuff people have been doing in finance for a long time is simple machine learning, and some people were using neural networks back in the 80s and 90s.   But now we have a lot more data and a lot more computing power, so with our creativity in machine learning research, “We are so much in the beginning that we can’t even picture where we’re going to be 20 years from now”

      We are just in time to apply modern ML techniques to financial industry

    2. ability to learn from data e.g. OpenAI and the Rubik’s Cube and DeepMind with AlphaGo required the equivalent of thousands of years of gameplay to achieve those milestones

      Even while making the perfect algorithm, we have to expect long hours of learning

    3. Pedro’s book “The Master Algorithm” takes readers on a journey through the five dominant paradigms of machine learning research on a quest for the master  algorithm. Along the way, Pedro wanted to abstract away from the mechanics so that a broad audience, from the CXO to the consumer, can understand how machine learning is shaping our lives

      "The Master Algorithm" book seems to be too abstract in such a case; however, it covers the following 5 paradigms:

      • Rule based learning (Decision trees, Random Forests, etc)
      • Connectivism (neural networks, etc)
      • Bayesian (Naive Bayes, Bayesian Networks, Probabilistic Graphical Models)
      • Analogy (KNN & SVMs)
      • Unsupervised Learning (Clustering, dimensionality reduction, etc)
    1. team began its analysis on YouTube 8M, a publicly available dataset of YouTube videos

      YouTube 8M - public dataset of YouTube videos. With this, we can analyse video features like: *color

      • illumination
      • many types of faces
      • thousands of objects
      • several landscapes
    2. The trailer release for a new movie is a highly anticipated event that can help predict future success, so it behooves the business to ensure the trailer is hitting the right notes with moviegoers. To achieve this goal, the 20th Century Fox data science team partnered with Google’s Advanced Solutions Lab to create Merlin Video, a computer vision tool that learns dense representations of movie trailers to help predict a specific trailer’s future moviegoing audience

      Merlin Video - computer vision tool to help predict a specific trailer's moviegoing audience

    3. pipeline also includes a distance-based “collaborative filtering” (CF) model and a logistic regression layer that combines all the model outputs together to produce the movie attendance probability

      other elements of pipeline:

      • collaborative filtering (CF) model
      • logistic regression layer