227 Matching Annotations
  1. Dec 2020
    1. In addition, for music and movies, we also normalize the resulting scores (akin to "grading on a curve" in college), which prevents scores from clumping together.
    1. Stuaert Rtchie [@StuartJRitchie] (2020) This encapsulates the problem nicely. Sure, there’s a paper. But actually read it & what do you find? p-values mostly juuuust under .05 (a red flag) and a sample size that’s FAR less than “25m”. If you think this is in any way compelling evidence, you’ve totally been sold a pup. Twitter. Retrieved from:https://twitter.com/StuartJRitchie/status/1305963050302877697

    1. Inferential statistics are the statistical procedures that are used to reach conclusions aboutassociations between variables. They differ from descriptive statistics in that they are explicitly designed to test hypotheses.

      Descriptive statistics are used specifically to test hypotheses.

  2. Nov 2020
  3. Oct 2020
    1. CDC reverses course on testing for asymptomatic people who had Covid-19 contact

      Take Away

      Transmission of viable SARS-CoV-2 RNA can occur even from an infected but asymptomatic individual. Some people never become symptomatic. That group usually becomes non-infectious after 14 days from initial infection. For persons displaying symptoms , the SARS-CoV-2 RNA can be detected for 1 to 2 days prior to symptomatology. (1)

      The Claim

      Asymptomatic people who had SARS-CoV-2 contact should be tested.

      The Evidence

      Yes, this is a reversal of August 2020 advice. What is the importance of asymptomatic testing?

      Studies show that asymptomatic individuals have infected others prior to displaying symptoms. (1)

      According to the CDC’s September 10th 2020 update approximately 40% of infected Americans are asymptomatic at time of testing. Those persons are still contagious and are estimated to have already transmitted the virus to some of their close contacts. (2)

      In a report appearing in the July 2020 Journal of Medical Virology, 15.6% of SARS-CoV-2 positive patients in China are asymptomatic at time of testing. (3)

      Asymptomatic infection also varies by age group as older persons often have more comorbidities causing them to be susceptible to displaying symptoms earlier. A larger percentage of children remain asymptomatic but are still able to transmit the virus to their contacts. (1) (3)

      Transmission modes

      Droplet transmission is the primary proven mode of transmission of the SARS-CoV-2 virus, although it is believed that touching a contaminated surface then touching mucous membranes, for example, the mouth and nose can also serve to transmit the virus. (1)

      It is still unclear how big or small a dose of exposure to viable viral particles is needed for transmission; more research is needed to elucidate this. (1)


      (1) https://www.who.int/news- room/commentaries/detail/transmission-of-sars-cov-2- implications-for-infection-prevention-precautions

      (2) https://www.cdc.gov/coronavirus/2019- ncov/hcp/planning-scenarios.html

      (3) He J, Guo Y, Mao R, Zhang J. Proportion of asymptomatic coronavirus disease 2019: A systematic review and metaanalysis. J Med Virol. 2020;1– 11.https://doi.org/10.1002/jmv.26326

  4. Sep 2020
    1. The lowest value for false positive rate was 0.8%. Allow me to explain the impact of a false positive rate of 0.8% on Pillar 2. We return to our 10,000 people who’ve volunteered to get tested, and the expected ten with virus (0.1% prevalence or 1:1000) have been identified by the PCR test. But now we’ve to calculate how many false positives are to accompanying them. The shocking answer is 80. 80 is 0.8% of 10,000. That’s how many false positives you’d get every time you were to use a Pillar 2 test on a group of that size.

      Take Away: The exact frequency of false positive test results for COVID-19 is unknown. Real world data on COVID-19 testing suggests that rigorous testing regimes likely produce fewer than 1 in 10,000 (<0.01%) false positives, orders of magnitude below the frequency proposed here.

      The Claim: The reported numbers for new COVID-19 cases are overblown due to a false positive rate of 0.8%

      The Evidence: In this opinion article, the author correctly conveys the concern that for large testing strategies, case rates could become inflated if there is (a) a high false positive rate for the test and (b) there is a very low prevalence of the virus within the population. The false positive rate proposed by the author is 0.8%, based on the "lowest value" for similar tests given by a briefing to the UK's Scientific Advisory Group for Emergencies (1).

      In fact, the briefing states that, based on another analysis, among false positive rates for 43 external quality assessments, the interquartile range for false positive rate was 0.8-4.0%. The actual lowest value for false positive rate from this study was 0% (2).

      An upper limit for false positive rate can also be estimated from the number of tests conducted per confirmed COVID-19 case. In countries with low infection rates that have conducted widespread testing, such as Vietnam and New Zealand, at multiple periods throughout the pandemic they have achieved over 10,000 tests per positive case (3). Even if every single positive was false, the false positive rate would be below 0.01%.

      The prevalence of the virus within a population being tested can affect the positive predictive value of a test, which is the likelihood that a positive result is due to a true infection. The author here assumes the current prevalence of COVID-19 in the UK is 1 in 1,000 and the expected rate of positive results is 0.1%. Data from the University of Oxford and the Global Change Data Lab show that the current (Sept. 22, 2020) share of daily COVID-19 tests that are positive in the UK is around 1.7% (4). Therefore, based on real world data, the probability that a patient is positive for the test and does have the disease is 99.4%.

      Sources: (1) https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/895843/S0519_Impact_of_false_positives_and_negatives.pdf

      (2) https://www.medrxiv.org/content/10.1101/2020.04.26.20080911v3.full.pdf+html

      (3) https://ourworldindata.org/coronavirus-data-explorer?yScale=log&zoomToSelection=true&country=USA~DEU~IND~ITA~AUS~VNM~FIN~NZL~GBR&region=World&testsPerCaseMetric=true&interval=smoothed&aligned=true&smoothing=7&pickerMetric=location&pickerSort=asc

      (4) https://ourworldindata.org/coronavirus-data-explorer?zoomToSelection=true&country=USA~DEU~IND~ITA~AUS~VNM~FIN~NZL~GBR&region=World&positiveTestRate=true&interval=smoothed&aligned=true&smoothing=7&pickerMetric=location&pickerSort=asc

    1. H not

      I'm sorry but this is kind of lazy from the author. Either write H0, \(H_0\) or H naught. H not sounds like you're saying H "not" (negation)

  5. Aug 2020
  6. Jul 2020
    1. Adjiwanou, V., Alam, N., Alkema, L., Asiki, G., Bawah, A., Béguy, D., Cetorelli, V., Dube, A., Feehan, D., Fisker, A. B., Gage, A., Garcia, J., Gerland, P., Guillot, M., Gupta, A., Haider, M. M., Helleringer, S., Jasseh, M., Kabudula, C., … You, D. (2020). Measuring excess mortality during the COVID-19 pandemic in low- and lower-middle income countries: The need for mobile phone surveys [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/4bu3q

  7. Jun 2020
    1. higher when Ericksen conflict was present (Figure 2A)

      Yeah, in single neurons you can show the detection of general conflict this way, and it was not partitionable into different responses...

    2. G)

      Very clear effect! suspicious? how exactly did they even select the pseudo-populations, its not clear exactly from the methods to me

    3. pseudotrial vector x

      one trial for all different neurons in the current pseudopopulation matrix?

    4. The separating hyperplane for each choice i is the vector (a) that satisfies: 770 771 772 773 Meaning that βi is a vector orthogonal to the separating hyperplane in neuron-774 dimensional space, along which position is proportional to the log odds of that correct 775 response: this is the the coding dimension for that correct response

      Makes sense: If Beta is proportional to the log-odds of a correct response, a is the hyperplane that provides the best cutoff, which must be orthogonal. Multiplying two orthogonal vectors yields 0.

    5. X is the trials by neurons pseudopopulation matrix of firing rates

      So these pseudopopulations were random agglomerates of single neurons that were recorded, so many fits for random groups, and the best were kept?

    6. Within each neuron, 719 we calculated the expected firing rate for each task condition, marginalizing over 720 distractors, and for each distractor, marginalizing over tasks.

      Distractor = specific stimulus / location (e.g. '1' or 'left')?

      Task = conflict condition (e.g. Simon or Ericksen)?

    7. condition-averaged within neurons (9 data points per 691 neuron, reflecting all combinations of the 3 correct response, 3 Ericksen distractors, and 3 692 Simon distractors)

      How do all combinations of 3 responses lead to only 9 data points per neuron? 3x2x2 = 12.

  8. May 2020
    1. For comparisons between 3 or more groups that typically employ analysis of variance (ANOVA) methods, one can use the Cumming estimation plot, which can be considered a variant of the Gardner-Altman plot.

      Cumming estimation plot

    2. Efron developed the bias-corrected and accelerated bootstrap (BCa bootstrap) to account for the skew whilst obtaining the central 95% of the distribution.

      Bias-corrected and accelerated bootstrap (BCa boostrap) deals with skewed sample distributions. However; it must be noted that it "may not give very accurate coverage in a small-sample non-parametric situation" (simply said, take caution with small datasets)

    3. We can calculate the 95% CI of the mean difference by performing bootstrap resampling.

      Bootstrap - simple but powerful technique that creates multiple resamples (with replacement) from a single set of observations, and computes the effect size of interest on each of these resamples. It can be used to determine the 95% CI (Confidence Interval).

      We can use bootstrap resampling to obtain measure of precision and confidence about our estimate. It gives us 2 important benefits:

      1. Non-parametric statistical analysis - no need to assume normal distribution of our observations. Thanks to Central Limit Theorem, the resampling distribution of the effect size will approach normality
      2. Easy construction of the 95% CI from the resampling distribution. For 1000 bootstrap resamples of the mean difference, 25th value and 975th value can be used as boundaries of the 95% CI.

      Bootstrap resampling can be used for such an example:

      Computers can easily perform 5000 resamples:

  9. Apr 2020
    1. the limitations of the PPS

      Limitations of the PPS:

      1. Slower than correlation
      2. Score cannot be interpreted as easily as the correlation (it doesn't tell you anything about the type of relationship). PPS is better for finding patterns and correlation is better for communicating found linear relationships
      3. You cannot compare the scores for different target variables in a strict math way because they're calculated using different evaluation metrics
      4. There are some limitations of the components used underneath the hood
      5. You've to perform forward and backward selection in addition to feature selection
    2. How to use the PPS in your own (Python) project

      Using PPS with Python

      • Download ppscore: pip install ppscoreshell
      • Calculate the PPS for a given pandas dataframe:
        import ppscore as pps
        pps.score(df, "feature_column", "target_column")
      • Calculate the whole PPS matrix:
    3. The PPS clearly has some advantages over correlation for finding predictive patterns in the data. However, once the patterns are found, the correlation is still a great way of communicating found linear relationships.


      • good for finding predictive patterns
      • can be used for feature selection
      • can be used to detect information leakage between variables
      • interpret PPS matrix as a directed graph to find entity structures Correlation:
      • good for communicating found linear relationships
    4. Let’s compare the correlation matrix to the PPS matrix on the Titanic dataset.

      Comparing correlation matrix and the PPS matrix of the Titanic dataset:

      findings about the correlation matrix:

      1. Correlation matrix is smaller because it doesn't work for categorical data
      2. Correlation matrix shows a negative correlation between TicketPrice and Class. For PPS, it's a strong predictor (0.9 PPS), but not the other way Class to TicketPrice (ticket of 5000-10000$ is most likely the highest class, but the highest class itself cannot determine the price)

      findings about the PPS matrix:

      1. First row of the matrix tells you that the best univariate predictor of the column Survived is the column Sex (Sex was dropped for correlation)
      2. TicketID uncovers a hidden pattern as well as it's connection with the TicketPrice

    5. Let’s use a typical quadratic relationship: the feature x is a uniform variable ranging from -2 to 2 and the target y is the square of x plus some error.

      In this scenario:

      • we can predict y using x
      • we cannot predict x using y as x might be negative or positive (for y=4, x=2 or -2
      • the correlation is 0. Both from x to y and from y to x because the correlation is symmetric (more often relationships are assymetric!). However, the PPS from x to y is 0.88 (not 1 because of existing error)
      • PPS from y to x is 0 because there's no relationship that y can predict if it only knows its own value

    6. how do you normalize a score? You define a lower and an upper limit and put the score into perspective.

      Normalising a score:

      • you need to put a lower and upper limit
      • upper limit can be F1 = 1, and a perfect MAE = 0
      • lower limit depends on the evaluation metric and your data set. It's the value that a naive predictor achieves
    7. For a classification problem, always predicting the most common class is pretty naive. For a regression problem, always predicting the median value is pretty naive.

      What is a naive model:

      • predicting common class for a classification problem
      • predicting median value for a regression problem
    8. Let’s say we have two columns and want to calculate the predictive power score of A predicting B. In this case, we treat B as our target variable and A as our (only) feature. We can now calculate a cross-validated Decision Tree and calculate a suitable evaluation metric.

      If the target (B) variable is:

      • numeric - we can use a Decision Tree Regressor and calculate the Mean Absolute Error (MAE)
      • categoric - we can use a Decision Tree Classifier and calculate the weighted F1 (or ROC)
    9. More often, relationships are asymmetric

      a column with 3 unique values will never be able to perfectly predict another column with 100 unique values. But the opposite might be true

    10. there are many non-linear relationships that the score simply won’t detect. For example, a sinus wave, a quadratic curve or a mysterious step function. The score will just be 0, saying: “Nothing interesting here”. Also, correlation is only defined for numeric columns.


      • doesn't work with non-linear data
      • doesn't work for categorical values


    1. Suppose you have only two rolls of dice. then your best strategy would be to take the first roll if its outcome is more than its expected value (ie 3.5) and to roll again if it is less.

      Expected payoff of a dice game:

      Description: You have the option to throw a die up to three times. You will earn the face value of the die. You have the option to stop after each throw and walk away with the money earned. The earnings are not additive. What is the expected payoff of this game?

      Rolling twice: $$\frac{1}{6}(6+5+4) + \frac{1}{2}3.5 = 4.25.$$

      Rolling three times: $$\frac{1}{6}(6+5) + \frac{2}{3}4.25 = 4 + \frac{2}{3}$$

    1. Therefore, En=2n+1−2=2(2n−1)

      Simplified formula for the expected number of tosses (e) to get n consecutive heads (n≥1):


      For example, to get 5 consecutive heads, we've to toss the coin 62 times:


      We can also start with the longer analysis of the 5 scenarios:

      1. If we get a tail immediately (probability 1/2) then the expected number is e+1.
      2. If we get a head then a tail (probability 1/4), then the expected number is e+2.
      3. If we get two head then a tail (probability 1/8), then the expected number is e+2.
      4. If we get three head then a tail (probability 1/16), then the expected number is e+4.
      5. If we get four heads then a tail (probability 1/32), then the expected number is e+5.
      6. Finally, if our first 5 tosses are heads, then the expected number is 5.



      We can also generalise the formula to:

      $$e_n=\frac{1}{2}(e_n+1)+\frac{1}{4}(e_n+2)+\frac{1}{8}(e_n+3)+\frac{1}{16}\\(e_n+4)+\cdots +\frac{1}{2^n}(e_n+n)+\frac{1}{2^n}(n) $$

    1. Repeated measures involves measuring the same cases multiple times. So, if you measured the chips, then did something to them, then measured them again, etc it would be repeated measures. Replication involves running the same study on different subjects but identical conditions. So, if you did the study on n chips, then did it again on another n chips that would be replication.

      Difference between repeated measures and replication