66 Matching Annotations
 Jan 2020

www.sthda.com www.sthda.com

Logistic regression assumptions


www.sthda.com www.sthda.com

Make sure that the predictor variables are normally distributed. If not, you can use log, root, BoxCox transformation.

You can also fit generalized additive models (Chapter @ref(polynomialandsplineregression)), when linearity of the predictor cannot be assumed. This can be done using the mgcv package:

For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor.

If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).


online.stat.psu.edu online.stat.psu.edu

An outlier is a data point whose response y does not follow the general trend of the rest of the data. A data point has high leverage if it has "extreme" predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be "unusual" combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).


jbhender.github.io jbhender.github.io

As shown in the Residuals vs Fitted plot, there is a megaphone shape, which indicates that nonconstant variance is likely to be an issue.



Add this code snippet to the bottom of the backend/settings.py file:
Need to add more ports: https://github.com/adamchainz/djangocorsheaders/issues/403

 Dec 2019

www.interviewcake.com www.interviewcake.com

So what's our total time cost? O(nlog2n)O(n\log_{2}{n})O(nlog2n). The log2n\log_{2}{n}log2n comes from the number of times we have to cut nnn in half to get down to sublists of just 1 element (our base case). The additional nnn comes from the time cost of merging all nnn items together each time we merge two sorted sublists.

 Aug 2019

www.pysurvival.io www.pysurvival.io

so that instead of predicting the time of event, we are predicting the probability that an event happens at a particular time .
Tags
Annotators
URL

 Jul 2019

archimede.mat.ulaval.ca archimede.mat.ulaval.ca

In practice, we found that it is not appropriate to use Aalen’s additive hazardsmodel for all datasets, because when we estimate cumulativeregression functionsB(t),they are restricted to the time interval where X (X has been defined in Chapter 3) is offull rank, that meansX0Xis invertible. Sometimes we found that X is not of full rank,which was not a problem with the Cox model.

An overall conclusion is that the two models give different pieces of informationand should not be viewed as alternatives to each other, but ascomplementary methodsthat may be used together to give a fuller and more comprehensive understanding ofdata

The effect ofthe covariates on survival is to act multiplicatively on some unknown baseline hazardrate, which makes it difficult to model covariate effects that change over time. Secondly,if covariates are deleted from a model or measured with a different level of precision, theproportional hazards assumption is no longer valid. These weaknesses in the Cox modelhave generated interest in alternative models. One such alternative model is Aalen’s(1989) additive model. This model assumes that covariates act in an additive manneron an unknown baseline hazard rate. The unknown risk coefficients are allowed to befunctions of time, so that the effect of a covariate may vary over time.


www.sthda.com www.sthda.com

Note that, three often used transformations can be specified using the argument fun: “log”: log transformation of the survivor function, “event”: plots cumulative events (f(y) = 1y). It’s also known as the cumulative incidence, “cumhaz” plots the cumulative hazard function (f(y) = log(y))

Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to followup or alive at the end of followup. Thus, it may be sensible to shorten plots before the end of followup on the xaxis (Pocock et al, 2002).


fs.blog fs.blog

1. Direction Over Speed



summarizing and visualizing a complex data table in which individuals are described by several sets of variables (quantitative and /or qualitative) structured into groups.


www.sthda.com www.sthda.com

it acts as PCA quantitative variables and as MCA for qualitative variables.


factominer.free.fr factominer.free.fr

MFA is a weighted PCA

Study the similarity between individuals with respect to thewhole set of variables AND the relationships between variables


sebastianraschka.com sebastianraschka.com

in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over MinMax scaling, since we are interested in the components that maximize the variance
Use standardization, not minmax scaling, for clustering and PCA.

As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.


www.csd.uwo.ca www.csd.uwo.ca

Implication means cooccurrence, not causality!

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other itemsin the transaction


www.cs.ubc.ca www.cs.ubc.caL11.pdf2

Cluster features and only consider rules within clusters

Support Set Pruning


spark.apache.org spark.apache.org

Given a dataset of transactions, the first step of FPgrowth is to calculate item frequencies and identify frequent items. Different from Apriorilike algorithms designed for the same purpose, the second step of FPgrowth uses a suffix tree (FPtree) structure to encode transactions without generating candidate sets explicitly, which are usually expensive to generate. After the second step, the frequent itemsets can be extracted from the FPtree.


livebook.datascienceheroes.com livebook.datascienceheroes.com

2.1.8 Automatic data frame discretization


lifelines.readthedocs.io lifelines.readthedocs.io

Nonproportional hazards is a case of model misspecification.

The idea behind the model is that the loghazard of an individual is a linear function of their static covariates and a populationlevel baseline hazard that changes over time.


livebook.datascienceheroes.com livebook.datascienceheroes.com

However, the gain ratio is the most important metric here, ranged from 0 to 1, with higher being better.

en: entropy measured in bits mi: mutual information ig: information gain gr: gain ratio



Balloon plot
Balloon plot


rdrr.io rdrr.io

Feature predictive power will be calculated for all features contained in a dataset along with the outcome feature. Works for binary classification, multiclass classification and regression problems. Can also be used when exploring a feature of interest to determine correlations of independent features with the outcome feature. When the outcome feature is continuous of nature or is a regression problem, correlation calculations are performed. When the outcome feature is categorical of nature or is a classification problem, the Kolmogorov Smirnov distance measure is used to determine predictive power. For multiclass classification outcomes, a one vs all approach is taken which is then averaged to arrive at the mean KS distance measure. The predictive power is sensitive towards the manner in which the data has been prepared and will differ should the manner in which the data has been prepared changes.


www.scholarpedia.org www.scholarpedia.org

Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.


nbviewer.jupyter.org nbviewer.jupyter.org

Sidenote: Visually comparing estimated survival curves in order to assess whether there is a difference in survival between groups is usually not recommended, because it is highly subjective. Statistical tests such as the logrank test are usually more appropriate.


pedroconcejero.wordpress.com pedroconcejero.wordpress.com

RF is now a standard to effectively analyze a large number of variables, of many different types, with no previous variable selection process. It is not parametric, and in particular for survival target it does not assume the proportional risks assumption.


www.cscu.cornell.edu www.cscu.cornell.edu

Thesurvival function gives,for every time,the probability of surviving(or not experiencing the event) up to that time.The hazard function gives the potential that the event will occur, per time unit, given that an individual has survived up to the specified time.


www.statisticshowto.datasciencecentral.com www.statisticshowto.datasciencecentral.com

Sturge’s rule works best for continuous data that is normally distributed and symmetrical.


towardsdatascience.com towardsdatascience.com

how the features are all on the same relative scale. The relative spaces between each feature’s values have been maintained.


www.kaggle.com www.kaggle.com

"in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data."


stats.stackexchange.com stats.stackexchange.com

The FreedmanDiaconis rule is very robust and works well in practice. The binwidth is set to h=2×IQR×n−1/3h=2×IQR×n−1/3h=2\times\text{IQR}\times n^{1/3}. So the number of bins is (max−min)/h(max−min)/h(\max\min)/h, where nnn is the number of observations, max is the maximum value and min is the minimum value.
How to determine the number of bins to use in a histogram.


scikitlearn.org scikitlearn.org

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values.
Binning

many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.


www.willmcginnis.com www.willmcginnis.com

we want to code categorical variables into numbers, but we are concerned about this dimensionality problem


machinelearningmastery.com machinelearningmastery.com

Ensemble Machine Learning Algorithms in Python with scikitlearn
Read on July 4, 2019


www.oreilly.com www.oreilly.com

Machine learning models are basically mathematical functions that represent the relationship between different aspects of data.


jmlr.csail.mit.edu jmlr.csail.mit.edu

Compared with neural networks configured by a pure grid search,we find that random search over the same domain is able to find models that are as good or betterwithin a small fraction of the computation time.

 Jun 2019

towardsdatascience.com towardsdatascience.com

To interpret a model, we require the following insights :Features in the model which are most important.For any single prediction from a model, the effect of each feature in the data on that particular prediction.Effect of each feature over a large number of possible predictions
Machine learning interpretability


christophm.github.io christophm.github.io

Instability means that it is difficult to trust the explanations, and you should be very critical.


imbalancedlearn.readthedocs.io imbalancedlearn.readthedocs.io

When dealing with a mixed of continuous and categorical features, SMOTENC is the only method which can handle this case.
SMOTENC


imbalancedlearn.readthedocs.io imbalancedlearn.readthedocs.io

In addition, RandomOverSampler allows to sample heterogeneous data (e.g. containing some strings):
RandomOverSampler

The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples.
Naive random oversampling


varsellcm.rforge.rproject.org varsellcm.rforge.rproject.org

missing values are managed, without any preprocessing, by the model used to cluster with the assumption that values are missing completely at random.
VarSelLCM
package



Success ina data science project comes not from access to any one exotic tool, but from having quantifiablegoals, good methodology, crossdiscipline interactions, and a repeatable workflow.



Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.

The cos2 values are used to estimate the quality of the representation The closer a variable is to the circle of correlations, the better its representation on the factor map (and the more important it is to interpret these components) Variables that are closed to the center of the plot are less important for the first components.

Taken together, the main purpose of principal component analysis is to: identify hidden pattern in a data set, reduce the dimensionnality of the data by removing the noise and redundancy in the data, identify correlated variables

the amount of variance retained by each principal component is measured by the socalled eigenvalue.

These new variables correspond to a linear combination of the originals. The number of principal components is less than or equal to the number of original variables.

Principal component analysis is used to extract the important information from a multivariate data table and to express this information as a set of few new variables called principal components.


www.octoberraindrops.com www.octoberraindrops.com

Thus, when we say that PCA can reduce dimensionality, we mean that PCA can compute principal components and the user can choose the smallestnumberKof them that explain 0.95 of the variance.A subjectively satisfactory result would be whenKis small relative to the original number of featuresD.


sebastianraschka.com sebastianraschka.com

However, this doesn’t mean that MinMax scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 01 scale.
Use minmax scaling for image processing & neural networks.

The result of standardization (or Zscore normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0μ=0\mu = 0 and σ=1σ=1\sigma = 1 where μμ\mu is the mean (average) and σσ\sigma is the standard deviation from the mean


link.springer.com link.springer.com

Threshold values of 0.80.9 are recommended for well separated clusters; to allow for overlapping clusters, we chose a threshold of 0.6.

 Jan 2019

stackoverflow.com stackoverflow.com

Changing chunk background color in RMarkdown
Change the background colour of code chunks in Rmarkdown using CSS.
