Hypothesis

139 Matching Annotations

Apr 2024
lethain.com lethain.com

Ex-technology companies.

1
1. pyxelr 02 Apr 2024
  
  in Public
  
  I recently chatted with a data science leader who described their company reaching this state. They couldn’t show any business impact from the past two years of their product releases, so the finance team identified a surefire way for R&D to make a business impact: laying off much of the R&D team.
  
  :D
  
  fun work DataScience
Visit annotations in context

Tags

work

DataScience

fun

Annotators

pyxelr

URL

lethain.com/ex-technology-companies/
Mar 2023
medium.com medium.com

The best orchestration tool for MLOps: a real story about difficult choices

1
1. pyxelr 20 Mar 2023
  
  in Public
  
  To this day, the field of machine learning does not have a single generally accepted approach to solving problems in terms of practical use of models.
  
  Business ¯\_(ツ)_/¯
  
  DataScience MLOps
Visit annotations in context

Tags

DataScience

MLOps

Annotators

pyxelr

URL

medium.com/exness-blog/the-best-orchestration-tool-for-mlops-a-real-story-about-difficult-choices-5ee6a087c9e3
Dec 2022
ryxcommar.com ryxcommar.com

Goodbye, Data Science

1
1. pyxelr 26 Dec 2022
  
  in Public
  
  The main reason I soured on data science is that the work felt like it didn’t matter, in multiple senses of the words “didn’t matter”:
  
  The main reasons why Data Science work feels pointless
  
  DataScience
Visit annotations in context

Tags

DataScience

Annotators

pyxelr

URL

ryxcommar.com/2022/11/27/goodbye-data-science/
Aug 2022
metaphysic.ai metaphysic.ai

To Uncover a Deepfake Video Call, Ask the Caller to Turn Sideways - Metaphysic.ai

1
1. pyxelr 14 Aug 2022
  
  in Public
  
  To Uncover a Deepfake Video Call, Ask the Caller to Turn Sideways
  
  DataScience Deepfakes
Visit annotations in context

Tags

Deepfakes

DataScience

Annotators

pyxelr

URL

metaphysic.ai/to-uncover-a-deepfake-video-call-ask-the-caller-to-turn-sideways/
May 2022
blog.gopheracademy.com blog.gopheracademy.com

Go and Apache Arrow: building blocks for data science

1
1. SamRose 21 May 2022
  
  in Public
  
  golang datascience arrow
Visit annotations in context

Tags

golang

arrow

datascience

Annotators

SamRose

URL

blog.gopheracademy.com/advent-2018/go-arrow/
Apr 2022
mayt.substack.com mayt.substack.com

GPT-3 can run code

1
1. pyxelr 05 Apr 2022
  
  in Public
  
  # Input Input: 123, Output: Input: 121, Output: Input: 111, Output: Input: 123454321, Output: Input 123123, Output: # Instruction Output true if input is a palindrome # Output Input: 123, Output: false Input: 121, Output: true Input: 111, Output: true Input: 123454321, Output: true Input 123123, Output: false
  
  Example of using GPT-3 for programming
  
  DataScience programming GPT-3
Visit annotations in context

Tags

DataScience

programming

GPT-3

Annotators

pyxelr

URL

mayt.substack.com/p/gpt-3-can-run-code
Jan 2022
commoncog.com commoncog.com

Why Tacit Knowledge is More Important Than Deliberate Practice

1
1. pyxelr 01 Jan 2022
  
  in Public
  
  Gary Klein himself has made a name developing techniques for extracting pieces of tacit knowledge and making it explicit. (The technique is called ‘The Critical Decision Method’, but it is difficult to pull off because it demands expertise in CDM itself).
  
  AI can help with turning tacit knowledge into explicit
  
  learning ArtificialIntelligence DataScience
Visit annotations in context

Tags

DataScience

ArtificialIntelligence

learning

Annotators

pyxelr

URL

commoncog.com/blog/tacit-knowledge-is-a-real-thing/
Nov 2021
buttondown.email buttondown.email

Interview with Kaggle Expert Mani Sarkar, PyDataGlobal starts Thursday

1
1. pyxelr 15 Nov 2021
  
  in Public
  
  There are a handful of tools that I used to use and now it’s narrowed down to just one or two: pandas-profiling and Dataiku for columnar or numeric data - here’s some getting started tips. I used to also load the data into bamboolib but the purpose of such a tool is different. For text data I have written my own profiler called nlp-profiler.
  
  Tools to help with data exploration:
  
  pandas-profiling
  
  Dataiku
  
  bamboolib
  
  nlp-profiler
  
  DataScience
Visit annotations in context

Tags

DataScience

Annotators

pyxelr

URL

buttondown.email/ianozsvald/archive/interview-with-kaggle-expert-mani-sarkar/
Oct 2021
www.kaggle.com www.kaggle.com

State of Data Science and Machine Learning 2021

1
1. pyxelr 15 Oct 2021
  
  in Public
  
  State of Data Science and Machine Learning 2021
  
  Survey from Kaggle
  
  data DataScience Kaggle
Visit annotations in context

Tags

DataScience

data

Kaggle

Annotators

pyxelr

URL

kaggle.com/kaggle-survey-2021
Sep 2021
mattturck.com mattturck.com

Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape

1
1. pyxelr 29 Sep 2021
  
  in Public
  
  It’s been a hot, hot year in the world of data, machine learning and AI.
  
  Summary of data tools in October 2021: http://46eybw2v1nh52oe80d3bi91u-wpengine.netdna-ssl.com/wp-content/uploads/2021/09/ML-AI-Data-Landscape-2021.pdf
  
  DataScience MachineLearning MLOps
Visit annotations in context

Tags

MachineLearning

DataScience

MLOps

Annotators

pyxelr

URL

mattturck.com/data2021/
Jul 2021
mtszkw.medium.com mtszkw.medium.com

Is MLOps Built Upon a Lie?

1
1. pyxelr 29 Jul 2021
  
  in Public
  
  Why do 87% of data science projects never make it into production?
  
  It turns out that this phrase doesn't lead to an existing research. If one goes down the rabbit hole, it all ends up with dead links
  
  MLOps DataScience R&D
Visit annotations in context

Tags

R&D

DataScience

MLOps

Annotators

pyxelr

URL

mtszkw.medium.com/is-mlops-built-upon-a-lie-8282948b41ae
venthur.de venthur.de

The State of Python Packaging in 2021 | Bastian Venthur's Blog

1
1. pyxelr 26 Jul 2021
  
  in Public
  
  The main selling point for Anaconda back then was that it provided pre-compiled binaries. This was especially useful for data-science related packages which depend on libatlas, -lapack, -openblas, etc. and need to be compiled for the target system.
  
  Reason why Anaconda got so popular
  
  Python Anaconda DataScience
Visit annotations in context

Tags

Anaconda

Python

DataScience

Annotators

pyxelr

URL

venthur.de/2021-06-26-python-packaging.html
Jun 2021
towardsdatascience.com towardsdatascience.com

Why You Should Consider Being a Data Engineer Instead of a Data Scientist.

1
1. elbaso 01 Jun 2021
  
  in Public
  
  Simply put, there can be no data science without data engineering. Data engineering is the foundation for a successful data-driven company.
  
  Why Data Engineering could be more important than Data Science
  
  datascience
Visit annotations in context

Tags

datascience

Annotators

elbaso

URL

towardsdatascience.com/why-you-should-consider-being-a-data-engineer-instead-of-a-data-scientist-2cf4e19dc019
Mar 2021
antonz.org antonz.org

SQLite is not a toy database

1
1. pyxelr 28 Mar 2021
  
  in Public
  
  The console is a killer SQLite feature for data analysis: more powerful than Excel and more simple than pandas. One can import CSV data with a single command, the table is created automatically
  
  SQLite makes it fairly easy to import and analyse data. For example:
  
  import --csv city.csv city
  
  select count(*) from city;
  
  SQLite DataScience databases
Visit annotations in context

Tags

databases

DataScience

SQLite

Annotators

pyxelr

URL

antonz.org/sqlite-is-not-a-toy-database/
antonz.org antonz.org

How to create a 1M record table with a single query

1
1. pyxelr 28 Mar 2021
  
  in Public
  
  This is not a problem if your DBMS supports SQL recursion: lots of data can be generated with a single query. The WITH RECURSIVE clause comes to the rescue.
  
  WITH RECURSIVE can help you quickly generate a series of random data.
  
  SQL DataScience databases
Visit annotations in context

Tags

databases

DataScience

SQL

Annotators

pyxelr

URL

antonz.org/random-table/
Jan 2021
www.tecton.ai www.tecton.ai

Why We Need DevOps for ML Data - Tecton

1
1. pyxelr 31 Jan 2021
  
  in Public
  
  …Well, deploying ML is still slow and painful
  
  How the typical ML production pipeline may look like:
  
  Unfortunately, it ties hands of Data Scientists and takes a lot of time to experiment and eventually ship the results to production
  
  DataScience DevOps
Visit annotations in context

Tags

DataScience

DevOps

Annotators

pyxelr

URL

tecton.ai/blog/devops-ml-data/
www.mihaileric.com www.mihaileric.com

We Don't Need Data Scientists, We Need Data Engineers - Mihail Eric

7
1. pyxelr 30 Jan 2021
  
  in Public
  
  Downloading a pretrained model off the Tensorflow website on the Iris dataset probably is no longer enough to get that data science job. It’s clear, however, with the large number of ML engineer openings that companies often want a hybrid data practitioner: someone that can build and deploy models. Or said more succinctly, someone that can use Tensorflow but can also build it from source.
  
  Who the industry really needs
  
  DataScience career
2. pyxelr 30 Jan 2021
  
  in Public
  
  When machine learning become hot 🔥 5-8 years ago, companies decided they need people that can make classifiers on data. But then frameworks like Tensorflow and PyTorch became really good, democratizing the ability to get started with deep learning and machine learning. This commoditized the data modelling skillset. Today, the bottleneck in helping companies get machine learning and modelling insights to production center on data problems.
  
  Why Data Engineering became more important
  
  DataScience DataEngineering career
3. pyxelr 30 Jan 2021
  
  in Public
  
  Overall the consolidation made the differences even more pronounced! There are ~70% more open data engineer than data scientist positions. In addition, there are ~40% more open ML engineer than data scientist positions. There are also only ~30% as many ML scientist as data scientist positions.
  
  Takeaway from the analysis:
  
  ~70% more open data engineer than data scientist positions
  
  ~40% more open ML engineer than data scientist positions
  
  only ~30% as many ML scientist as data scientist positions
  
  DataScience career
4. pyxelr 30 Jan 2021
  
  in Public
  
  Data scientist: Use various techniques in statistics and machine learning to process and analyse data. Often responsible for building models to probe what can be learned from some data source, though often at a prototype rather than production level. Data engineer: Develops a robust and scalable set of data processing tools/platforms. Must be comfortable with SQL/NoSQL database wrangling and building/maintaining ETL pipelines. Machine Learning (ML) Engineer: Often responsible for both training models and productionizing them. Requires familiarity with some high-level ML framework and also must be comfortable building scalable training, inference, and deployment pipelines for models. Machine Learning (ML) Scientist: Works on cutting-edge research. Typically responsible for exploring new ideas that can be published at academic conferences. Often only needs to prototype new state-of-the-art models before handing off to ML engineers for productionization.
  
  4 different data profiles (and more): consolidated:
  
  DataScience career
5. pyxelr 30 Jan 2021
  
  in Public
  
  I scraped the homepage URLs of every YC company since 2012, producing an initial pool of ~1400 companies. Why stop at 2012? Well, 2012 was the year that AlexNet won the ImageNet competition, effectively kickstarting the machine learning and data-modelling wave we are now living through. It’s fair to say that this birthed some of the earliest generations of data-first companies. From this initial pool, I performed keyword filtering to reduce the number of relevant companies I would have to look through. In particular, I only considered companies whose websites included at least one of the following terms: AI, CV, NLP, natural language processing, computer vision, artificial intelligence, machine, ML, data. I also disregarded companies whose website links were broken.
  
  How data was collected
  
  DataScience career
6. pyxelr 30 Jan 2021
  
  in Public
  
  There are 70% more open roles at companies in data engineering as compared to data science. As we train the next generation of data and machine learning practitioners, let’s place more emphasis on engineering skills.
  
  The resulting 70% is based on a real analysis
  
  DataScience DataEngineering career
7. pyxelr 30 Jan 2021
  
  in Public
  
  When you think about it, a data scientist can be responsible for any subset of the following: machine learning modelling, visualization, data cleaning and processing (i.e. SQL wrangling), engineering, and production deployment.
  
  What tasks can Data Scientist be responsible for
  
  DataScience career
Visit annotations in context

Tags

DataScience

DataEngineering

career

Annotators

pyxelr

URL

mihaileric.com/posts/we-need-data-engineers-not-data-scientists/
www.forbes.com www.forbes.com

A Very Short History Of Data Science

1
1. jswan 24 Jan 2021
  
  in Public
  
  The term that seemed to fit best was data scientist: those who use both data and science to create something new.
  
  Even though data science might have a very fluid definition, it is know that data scientists strive to find what cannot be seen in raw data, putting in the work to see what the competition cannot is a very valuable asset
  
  #datascience
Visit annotations in context

Tags

#datascience

Annotators

jswan

URL

forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/
www.nytimes.com www.nytimes.com

A Different Way to Chart the Spread of Coronavirus

1
1. mromanello 15 Jan 2021
  
  in Public
  
  why plotting on a log scale matters. a very contemporary example.
  
  datascience
Visit annotations in context

Tags

datascience

Annotators

mromanello

URL

nytimes.com/2020/03/20/health/coronavirus-data-logarithm-chart.html
nulldata.substack.com nulldata.substack.com

Data Science 2020 - Highlights

1
1. pyxelr 04 Jan 2021
  
  in Public
  
  2020 is a strong year for NLP. Companies like Hugging Face 🤗 , spaCy, Rasa became stronger and also more educational which ultimately drove a huge NLP revolution (at even Industry-level which is quite hard usually).
  
  Up till 2020, CV had the most attention. Nowadays, it's NLP.
  
  Other hot topics of 2020:
  
  Data Science / AI / ML Web App Building (streamlit...)
  
  GPT-3
  
  Auto ML
  
  ML Ops (ML Flow, KubeFlow...)
  
  FastAI (Pytorch Library)
  
  Interpretable Machine Learning (fancily known as eXplainable AI)
  
  GANs
  
  First-Order Motion
  
  On-Device ML (tensorflow.js / coreML)
  
  DataScience NLP CV 2020
Visit annotations in context

Tags

CV

2020

DataScience

NLP

Annotators

pyxelr

URL

nulldata.substack.com/p/data-science-2020-highlights
www.hbrtaiwan.com www.hbrtaiwan.com

網路效應，跟你想的不一樣！資料競爭力七大驗證

1
1. larryjclai 04 Jan 2021
  
  in Public
  
  要模仿以顧客資料為基礎的產品改善，有多困難？即使那些資料是獨一無二或專屬的，而且能產生寶貴的見解，但如果競爭對手就算沒有類似資料也可以複製同樣的改善成果，那麼你仍很難建立持久的競爭優勢。
  
  UI 很容易被模仿，但 UX 會比較難，所以精進 UX 會很重要。
  
  datascience
Visit annotations in context

Tags

datascience

Annotators

larryjclai

URL

hbrtaiwan.com/article_content_AR0009347.html
Dec 2020
stackoverflow.blog stackoverflow.blog

How to put machine learning models into production - Stack Overflow Blog

14
1. pyxelr 04 Dec 2020
  
  in Public
  
  TFX and Tensorflow run anywhere Python runs, and that’s a lot of places
  
  You can run your Tensorflow models on:
  
  web pages with Tensorflow.js
  
  mobile/IoT devices with Tensorflow lite
  
  in the cloud
  
  even on-prem
  
  MachineLearning DataScience DataEngineering TensorFlow
2. pyxelr 04 Dec 2020
  
  in Public
  
  After consideration, you decide to use Python as your programming language, Tensorflow for model building because you will be working with a large dataset that includes images, and Tensorflow Extended (TFX), an open-source tool released and used internally at Google, for building your pipelines.
  
  Sample tech stack of a ML project:
  
  Python <--- programming
  
  Tensorflow <--- model building (large with images)
  
  TFX (Tensorflow Extended) <--- pipeline builder (model analysis, monitoring, serving, ...)
  
  DataScience DataEngineering MachineLearning TensorFlow
3. pyxelr 04 Dec 2020
  
  in Public
  
  These components has built-in support for ML modeling, training, serving, and even managing deployments to different targets.
  
  Components of TFX:
  
  DataScience DataEngineering MachineLearning TensorFlow
4. pyxelr 04 Dec 2020
  
  in Public
  
  Most data scientists feel that model deployment is a software engineering task and should be handled by software engineers because the required skills are more closely aligned with their day-to-day work. While this is somewhat true, data scientists who learn these skills will have an advantage, especially in lean organizations. Tools like TFX, Mlflow, Kubeflow can simplify the whole process of model deployment, and data scientists can (and should) quickly learn and use them.
  
  As a Data Scientist, you shall think of practicing TFX, Mlflow or Kubeflow
  
  MachineLearning DataEngineering DataScience TensorFlow
5. pyxelr 04 Dec 2020
  
  in Public
  
  TFX Component called TensorFlow Model Analysis (TFMA) allows you to easily evaluate new models against current ones before deployment.
  
  TFMA component of TFX seems to be its core functionality
  
  MachineLearning DataEngineering DataScience TensorFlow
6. pyxelr 04 Dec 2020
  
  in Public
  
  This has its pros and cons and may depend on your use case as well. Some of the pros to consider when considering using managed cloud services are:
  
  Pros of using cloud services:
  
  They are cost-efficient
  
  Quick setup and deployment
  
  Efficient backup and recovery Cons of using cloud services:
  
  Security issue, especially for sensitive data
  
  Internet connectivity may affect work since everything runs online
  
  Recurring costs
  
  Limited control over tools
  
  DataScience MachineLearning DataEngineering cloud
7. pyxelr 04 Dec 2020
  
  in Public
  
  The data is already in the cloud, so it may be better to build your ML system in the cloud. You’ll get better latency for I/O, easy scaling as data becomes larger (hundreds of gigabytes), and quick setup and configuration for any additional GPUs and TPUs.
  
  If the data for your project is already in the cloud, try to stick to cloud solutions
  
  MachineLearning DataScience DataEngineering
8. pyxelr 04 Dec 2020
  
  in Public
  
  ML projects are never static. This is part of engineering and design that must be considered from the start. Here you should answer questions like:
  
  We need to prepare for these 2 things in ML projects:
  
  How do we get feedback from a model in production?
  
  How do you set up CD?
  
  MachineLearning DataScience DataEngineering
9. pyxelr 04 Dec 2020
  
  in Public
  
  The choice of framework is very important, as it can decide the continuity, maintenance, and use of a model. In this step, you must answer the following questions:
  
  Questions to ask before choosing a particular tool/framework:
  
  What is the best tool for the task at hand?
  
  Are the choice of tools open-source or closed?
  
  How many platforms/targets support the tool?
  
  Try to compare the tools based on:
  
  Efficiency
  
  Popularity
  
  Support
  
  MachineLearning DataScience DataEngineering
10. pyxelr 04 Dec 2020
  
  in Public
  
  These questions are important as they will guide you on what frameworks or tools to use, how to approach your problem, and how to design your ML model.
  
  Critical questions for ML projects:
  
  How is your training data stored?
  
  How large is your data?
  
  How will you retrieve the data for training?
  
  How will you retrieve data for prediction?
  
  MachineLearning DataScience DataEngineering
11. pyxelr 04 Dec 2020
  
  in Public
  
  you should not invest in an ML project if you have no plan to put it in production, except of course when doing pure research.
  
  Tip #1 when applying ML in business
  
  MachineLearning DataScience
12. pyxelr 04 Dec 2020
  
  in Public
  
  The difficulties in model deployment and management have given rise to a new, specialized role: the machine learning engineer. Machine learning engineers are closer to software engineers than typical data scientists, and as such, they are the ideal candidate to put models into production.
  
  Why Machine Learning Engineer role exists
  
  MachineLearning DataScience
13. pyxelr 04 Dec 2020
  
  in Public
  
  The goal of building a machine learning model is to solve a problem, and a machine learning model can only do so when it is in production and actively in use by consumers. As such, model deployment is as important as model building.
  
  Model deployment is as important as model building
  
  MachineLearning DataScience
14. pyxelr 04 Dec 2020
  
  in Public
  
  Venturebeat reports that 87% of data science projects never make it to production, while redapt claims it is 90%.
  
  data MachineLearning DataScience
Visit annotations in context

Tags

MachineLearning

DataEngineering

TensorFlow

data

DataScience

cloud

Annotators

pyxelr

URL

stackoverflow.blog/2020/10/12/how-to-put-machine-learning-models-into-production/
Oct 2020
towardsdatascience.com towardsdatascience.com

Fraud detection: the problem, solutions and tools

3
1. pyxelr 31 Oct 2020
  
  in Public
  
  Statistical techniques: average, quantiles, probability distribution, association rulesSupervised ML algorithms: logistic regression, neural net, time-series analysisUnsupervised ML algorithms: Cluster analysis, Bayesian network, Peer group analysis, break point analysis, Benford’s law (law of anomalous numbers)
  
  Typical techniques used in financial fraud classification
  
  fraud MachineLearning DataScience
2. pyxelr 31 Oct 2020
  
  in Public
  
  In machine learning, parlance fraud detection is generally treated as a supervised classification problem, where observations are classified as “fraud” or “non-fraud” based on the features in those observations. It is also an interesting problem in ML research due to imbalanced data — i.e. there’s a very few cases of frauds in an extremely large amount of transactions.
  
  Financial fraud is generally solved as a supervised classification, but we've got the problem of imbalanced data
  
  fraud MachineLearning DataScience
3. pyxelr 31 Oct 2020
  
  in Public
  
  With ever-increasing online transactions and production of a large volume of customer data, machine learning has been increasingly seen as an effective tool to detect and counter frauds. However, there is no specific tool, the silver bullet, that works for all kinds of fraud detection problems in every single industry. The nature of the problem is different in every case and every industry. Therefore every solution is carefully tailored within the domain of each industry.
  
  Machine learning in fraud detection
  
  fraud MachineLearning DataScience
Visit annotations in context

Tags

MachineLearning

fraud

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/fraud-detection-the-problem-solutions-and-tools-dd8977b435c9
about.fb.com about.fb.com

Introducing the First AI Model That Translates 100 Languages Without Relying on English - About Facebook

1
1. pyxelr 26 Oct 2020
  
  in Public
  
  Facebook AI is introducing M2M-100, the first multilingual machine translation (MMT) model that can translate between any pair of 100 languages without relying on English data. It’s open sourced here. When translating, say, Chinese to French, most English-centric multilingual models train on Chinese to English and English to French, because English training data is the most widely available. Our model directly trains on Chinese to French data to better preserve meaning. It outperforms English-centric systems by 10 points on the widely used BLEU metric for evaluating machine translations. M2M-100 is trained on a total of 2,200 language directions — or 10x more than previous best, English-centric multilingual models. Deploying M2M-100 will improve the quality of translations for billions of people, especially those that speak low-resource languages. This milestone is a culmination of years of Facebook AI’s foundational work in machine translation. Today, we’re sharing details on how we built a more diverse MMT training data set and model for 100 languages. We’re also releasing the model, training, and evaluation setup to help other researchers reproduce and further advance multilingual models.
  
  Summary of the 1st AI model from Facebook that translates directly between languages (not relying on English data)
  
  DataScience AI Facebook
Visit annotations in context

Tags

DataScience

Facebook

AI

Annotators

pyxelr

URL

about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
towardsdatascience.com towardsdatascience.com

Data Science vs Decision Science

1
1. pyxelr 25 Oct 2020
  
  in Public
  
  For Decision Scientists, the business problem comes first. Analysis follows and is dependent on the question or business decision that needs to be made.
  
  Decision Scientists
  
  DataScience
Visit annotations in context

Tags

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/data-science-vs-decision-science-8f8d53ce25da
stats.stackexchange.com stats.stackexchange.com

How to choose the number of hidden layers and nodes in a feedforward neural network?

1
1. pyxelr 24 Oct 2020
  
  in Public
  
  The number of hidden neurons should be between the size of the input layer and the size of the output layer. The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. The number of hidden neurons should be less than twice the size of the input layer.
  
  3 rules of thumb while choosing the number of hidden layers and neurons
  
  DataScience MachineLearning NeuralNetworks
Visit annotations in context

Tags

MachineLearning

NeuralNetworks

DataScience

Annotators

pyxelr

URL

stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
blog.prokulski.science blog.prokulski.science

Pipeline w SciKit Learn

3
1. pyxelr 16 Oct 2020
  
  in Public
  
  Sprawdźmy który rodzaj modelu daje najlepszą skuteczność: Python sns.boxplot(data=models_df, x='score', y='model') 1 sns.boxplot(data=models_df, x='score', y='model')
  
  After comparing the pipelined ML models, we can easily display a comparison boxplot.
  
  Well working ML models: 1) XGBoost 2) LightGBM 3) CatBoost.
  
  Python DataScience MachineLearning scikit-learn
2. pyxelr 16 Oct 2020
  
  in Public
  
  Przy tych danych wygląda, że właściwie nie ma większej różnicy (nie bijemy się tutaj o 0.01 punktu procentowego poprawy accuracy modelu). Może więc czas treningu jest istotny? Python sns.boxplot(data=models_df, x='time_elapsed', y='model') 1 sns.boxplot(data=models_df, x='time_elapsed', y='model')
  
  Training time of some popular ML models. After considering the performance, it's worth using XGBoost and LightGBM.
  
  Python scikit-learn DataScience MachineLearning
3. pyxelr 16 Oct 2020
  
  in Public
  
  Teraz w zagnieżdżonych pętlach możemy sprawdzić każdy z każdym podmieniając klasyfikatory i transformatory (cała pętla trochę się kręci):
  
  Example (below) of when creating pipelines with scikit-learn makes sense. Basically, it's convenient to use it while comparing multiple models in a loop
  
  Python scikit-learn DataScience MachineLearning
Visit annotations in context

Tags

MachineLearning

Python

scikit-learn

DataScience

Annotators

pyxelr

URL

blog.prokulski.science/index.php/2020/10/10/pipeline-w-scikit-learn/
Sep 2020
news.ycombinator.com news.ycombinator.com

Data science interview questions with answers | Hacker News

1
1. pyxelr 26 Sep 2020
  
  in Public
  
  The best data scientists are just people who try to understand the ins and outs of business processes and look at problems with healthy suspicion and curiosity. The ability to explain the nuances of manifolds in SVMs is not something that comes into it outside these contrived interviews. I prefer to ask candidates how they would approach solving a problem I’m facing at that moment rather than these cookie cutter tests which are easy to game and tell me nothing
  
  Interesting approach from an experienced data scientist to interview new professionals
  
  DataScience interviews
Visit annotations in context

Tags

interviews

DataScience

Annotators

pyxelr

URL

news.ycombinator.com/item
www.datamine.com www.datamine.com

The Cost of Analytics | Datamine

1
1. SamRose 23 Sep 2020
  
  in Public
  
  food21 cdas21 datascience cost of ownership
Visit annotations in context

Tags

food21

cdas21

cost of ownership

datascience

Annotators

SamRose

URL

datamine.com/how-much-does-data-analytics-and-ai-cost
news.ycombinator.com news.ycombinator.com

Deep learning job postings have collapsed in the past six months | Hacker News

3
1. pyxelr 11 Sep 2020
  
  in Public
  
  What you actually need from an ML/Data Science person:- Experience with data cleaning (this is most of the gig)- A solid understanding of linear and logistic regression, along with cross-validation- Some reasonable coding skills (in both R and Python, with a side of SQL).
  
  Basic skills to seek for in Data Scientists
  
  career DataScience
2. pyxelr 11 Sep 2020
  
  in Public
  
  deep learning is good in some scenarios and not in others, so anyone who's just a deep learning developer is not going to be useful to most companies.
  
  career DataScience
3. pyxelr 11 Sep 2020
  
  in Public
  
  Imagine what it must be like for the senior leadership of an established company to actually become data-driven. All of a sudden the leadership is going to consent to having all of their strategic and tactical decision-making be questioned by a bunch of relatively new hires from way down the org chart, whose entire basis for questioning all that expertise and business acumen is that they know how to fiddle around with numbers in some program called R? And all the while, they're constantly whining that this same data is junk and unreliable and we need to upend a whole bunch of IT systems just so they can rock the boat even harder? Pffft.
  
  Reality of becoming a data-driven company
  
  business DataScience
Visit annotations in context

Tags

business

DataScience

career

Annotators

pyxelr

URL

news.ycombinator.com/item
github.com github.com

tradytics/surpriver

1
1. pyxelr 11 Sep 2020
  
  in Public
  
  This command will give you the top 25 stocks that had the highest anomaly score in the last 14 bars of 60 minute candles.
  
  Supriver - find high moving stocks before they move using anomaly detection and machine learning. Surpriver uses machine learning to look at volume + price action and infer unusual patterns which can result in big moves in stocks
  
  Python DataScience finance
Visit annotations in context

Tags

Python

finance

DataScience

Annotators

pyxelr

URL

github.com/tradytics/surpriver
Aug 2020
epistasislab.github.io epistasislab.github.io

Home - TPOT

2
1. pyxelr 29 Aug 2020
  
  in Public
  
  Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
  
  After all, magically you get the right Python snippet (based on scikit-learn)
  
  MachineLearning DataScience Python
2. pyxelr 29 Aug 2020
  
  in Public
  
  TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming
  
  TPOT automates the following:
  
  feature selection
  
  feature preprocessing
  
  feature construction
  
  model selection
  
  parameter optimisation
  
  MachineLearning DataScience Python
Visit annotations in context

Tags

MachineLearning

Python

DataScience

Annotators

pyxelr

URL

epistasislab.github.io/tpot/
www.splitgraph.com www.splitgraph.com

Port 5432 is open: introducing the Splitgraph Data Delivery Network

1
1. pyxelr 29 Aug 2020
  
  in Public
  
  The Splitgraph DDN is a single SQL endpoint that lets you query over 40,000 public datasets hosted on or proxied by Splitgraph.You can connect to it from most PostgreSQL clients and BI tools without having to install anything else. It supports all read-only SQL constructs, including filters and aggregations. It even lets you run joins across distinct datasets.
  
  Splitgraph - efficient DDN (Data Delivery Network):
  
  connect to it from most PostgreSQL clients and BI tools without having to install anything else
  
  you can queory +40k public datasets hosten on or proxied by Splitgraph
  
  supports all SQL constructs (even SQL joins between tables)
  
  databases DataScience
Visit annotations in context

Tags

databases

DataScience

Annotators

pyxelr

URL

splitgraph.com/
medium.com medium.com

Configuring Google Colab Like A Pro

1
1. pyxelr 24 Aug 2020
  
  in Public
  
  I learned and did to make it possible to do Automated Speech Recognition research on a Colab instance.
  
  It's possible to do Automated Speech Recognition with Google Colab.
  
  Notebook instance of this post
  
  MachineLearning DataScience Colab
Visit annotations in context

Tags

MachineLearning

DataScience

Colab

Annotators

pyxelr

URL

medium.com/@robertbracco1/configuring-google-colab-like-a-pro-d61c253f7573
Jul 2020
libradocs.github.io libradocs.github.io

About Libra

1
1. pyxelr 31 Jul 2020
  
  in Public
  
  Libra is a machine learning API designed for non-technical users. This means that it assumes that you have no background in ML whatsoever.
  
  With Libra you can write your ML code much faster:
  
  For example, that's how it compares to Keras.
  
  DataScience MachineLearning Python Libra
Visit annotations in context

Tags

MachineLearning

Python

DataScience

Libra

Annotators

pyxelr

URL

libradocs.github.io/
Jun 2020
nabeelqu.co nabeelqu.co

nabeelqu

1
1. pyxelr 26 Jun 2020
  
  in Public
  
  Most of the discourse is about how AI will “replace” humans. I prefer the Licklider school of thought: human-computer symbiosis. AI will make humans vastly more effective by automating tedious tasks. For example, humans can use text AI such as GPT-3 to generate ideas/boilerplate writing to get around the terror of the blank page, and then simply pick the best ones and refine/iterate on those. (AI Dril, which was based on GPT-2, was an early example of this). As AI gets better, “assistive creativity” will become a bigger thing, enabling humans to create sophisticated artifacts (including video games!) easier and better than ever.
  
  Why AI is important for human productivity
  
  DataScience
Visit annotations in context

Tags

DataScience

Annotators

pyxelr

URL

nabeelqu.co/education
May 2020
www.estimationstats.com www.estimationstats.com

Estimation Stats

9
1. pyxelr 09 May 2020
  
  in Public
  
  You can create estimation plots here at estimationstats.com, or with the DABEST packages which are available in R, Python, and Matlab.
  
  You can create estimation plots with:
  
  this website
  
  R package
  
  Python package
  
  MATLAB package
  
  DataScience R Python MATLAB
2. pyxelr 09 May 2020
  
  in Public
  
  Relative to conventional plots, estimation plots offer five key benefits:
  
  Estimation plots > bars-and-stars or boxplot & P.
  
  They:
  
  avoid false dictonomy
  
  display all observed values
  
  focus on intervention effect size
  
  visualise estimate precision
  
  show mean difference distribution
  
  DataScience datavisualisation
3. pyxelr 09 May 2020
  
  in Public
  
  For comparisons between 3 or more groups that typically employ analysis of variance (ANOVA) methods, one can use the Cumming estimation plot, which can be considered a variant of the Gardner-Altman plot.
  
  Cumming estimation plot
  
  DataScience datavisualisation statistics
4. pyxelr 09 May 2020
  
  in Public
  
  Efron developed the bias-corrected and accelerated bootstrap (BCa bootstrap) to account for the skew whilst obtaining the central 95% of the distribution.
  
  Bias-corrected and accelerated bootstrap (BCa boostrap) deals with skewed sample distributions. However; it must be noted that it "may not give very accurate coverage in a small-sample non-parametric situation" (simply said, take caution with small datasets)
  
  DataScience statistics
5. pyxelr 09 May 2020
  
  in Public
  
  We can calculate the 95% CI of the mean difference by performing bootstrap resampling.
  
  Bootstrap - simple but powerful technique that creates multiple resamples (with replacement) from a single set of observations, and computes the effect size of interest on each of these resamples. It can be used to determine the 95% CI (Confidence Interval).
  
  We can use bootstrap resampling to obtain measure of precision and confidence about our estimate. It gives us 2 important benefits:
  
  Non-parametric statistical analysis - no need to assume normal distribution of our observations. Thanks to Central Limit Theorem, the resampling distribution of the effect size will approach normality
  
  Easy construction of the 95% CI from the resampling distribution. For 1000 bootstrap resamples of the mean difference, 25th value and 975th value can be used as boundaries of the 95% CI.
  
  Bootstrap resampling can be used for such an example:
  
  Computers can easily perform 5000 resamples:
  
  DataScience statistics
6. pyxelr 09 May 2020
  
  in Public
  
  Shown above is a Gardner-Altman estimation plot.
  
  Gardner-Altman estimation plot shows all the relevant information:
  
  Datapoints presented as swarmplot
  
  Effect size is presented as bootstrap 95% confidence interval (95% CI) on a seperate but aligned axes
  
  datavisualisation DataScience
7. pyxelr 09 May 2020
  
  in Public
  
  Jitter plots avoid overlapping datapoints (i.e. datapoints with the same y-value) by adding a random factor to each point along the orthogonal x-axes.
  
  Jitter plots displays all datapoints but it might not accurately depict the underlying distribution of the data:
  
  datavisualisation DataScience
8. pyxelr 09 May 2020
  
  in Public
  
  Unfortunately, the boxplot still doesn't show all our data.
  
  Boxplots may be better than barplots (they introduce medians, quartiles, minima and maxima), but still doesn't show all the information:
  
  datavisualisation DataScience
9. pyxelr 09 May 2020
  
  in Public
  
  The barplot has several shortcomings, even though its use in academic journals is endemic.
  
  Barplots are not the best choice for data visualisation:
  
  datavisualisation DataScience
Visit annotations in context

Tags

MATLAB

R

statistics

Python

DataScience

datavisualisation

Annotators

pyxelr

URL

estimationstats.com/
mateuszgrzyb.pl mateuszgrzyb.pl

Segmentacja behawioralna klientów z użyciem metody RFM

5
1. pyxelr 07 May 2020
  
  in Public
  
  Kolejna zmienna - "Recency", czyli informacja o tym, jak dawno klient robił zakupy w sklepie.
  
  To calculate RFM we need recency value. Firstly, we shall specify the most recent transaction as "today" and then to find the latest transaction of a specific client:
  
  df['Recency'] = (today - df.InvoiceDate)/np.timedelta64(1,'D')
  
  Calculating frequency and aggregating data of each client may be done with the groupby method:
  
  abt = df.groupby(['CustomerID']).agg({'Recency':'min', 'MonetaryValue':'sum', 'InvoiceNo':'count'})
  
  lastly, we can update the column names and display RFM data:
  
  abt = df.groupby(['CustomerID']).agg({'Recency':'min', 'MonetaryValue':'sum', 'InvoiceNo':'count'}) abt.rename(columns = {'InvoiceNo':'Frequency'}, inplace = True) abt = abt[['Recency', 'Frequency', 'MonetaryValue']] abt.head()
  
  DataScience Python
2. pyxelr 07 May 2020
  
  in Public
  
  Będę oceniać klientów kryteriami, jakie zakłada metoda RFM.
  
  RFM is a method used for analyzing customer value:
  
  Recency – How recently did the customer purchase?
  
  Frequency – How often do they purchase?
  
  Monetary Value – How much do they spend?
  
  In the RFM method we usually analyse only the last 12 months since this is the most relevant data of our products.
  
  We also have other RFM variants:
  
  RFD – Recency, Frequency, Duration
  
  RFE – Recency, Frequency, Engagement
  
  RFM-I – Recency, Frequency, Monetary Value – Interactions
  
  RFMTC – Recency, Frequency, Monetary Value, Time, Churn rate
  
  DataScience
3. pyxelr 07 May 2020
  
  in Public
  
  Usuwam brakujące wartości w zmiennej "CustomerID".
  
  Deleting rows where value is null:
  
  df = df[~df.CustomerID.isnull()]
  
  Assigning different data types to columns:
  
  df['CustomerID'] = df.CustomerID.astype(int)
  
  Deleting irrelevant columns:
  
  df.drop(['Description', 'StockCode', 'Country'], axis = 1, inplace = True)
  
  DataScience Python
4. pyxelr 07 May 2020
  
  in Public
  
  Sprawdzam braki danych.
  
  Checking the % of missing data:
  
  print(str(round(df.isnull().any(axis=1).sum()/df.shape[0]*100,2))+'% obserwacji zawiera braki w danych.')
  
  Sample output:
  
  24.93% obserwacji zawiera braki w danych.
  
  DataScience Python
5. pyxelr 07 May 2020
  
  in Public
  
  Buduję prosty data frame z podstawowymi informacjami o zbiorze.
  
  Building a simple dataframe with a summary of our columns (data types, sum and % of nulls):
  
  summary = pd.DataFrame(df.dtypes, columns=['Dtype']) summary['Nulls'] = pd.DataFrame(df.isnull().any()) summary['Sum_of_nulls'] = pd.DataFrame(df.isnull().sum()) summary['Per_of_nulls'] = round((df.apply(pd.isnull).mean()*100),2) summary.Dtype = summary.Dtype.astype(str) print(summary)
  
  the output:
  
  Dtype Nulls Sum_of_nulls Per_of_nulls InvoiceNo object False 0 0.000 StockCode object False 0 0.000 Description object True 1454 0.270 Quantity int64 False 0 0.000 InvoiceDate datetime64[ns] False 0 0.000 UnitPrice float64 False 0 0.000 CustomerID float64 True 135080 24.930 Country object False 0 0.000
  
  DataScience Python
Visit annotations in context

Tags

Python

DataScience

Annotators

pyxelr

URL

mateuszgrzyb.pl/segmentacja-behawioralna-klientow-rfm/
towardsdatascience.com towardsdatascience.com

Why we do machine learning engineering with YAML, not notebooks

1
1. pyxelr 02 May 2020
  
  in Public
  
  For many data scientists, the finished product of a work session is a business analysis. They need to show team members—who oftentimes aren’t technical—how their data became a specific recommendation or insight.
  
  Usual final product in Data Science is the business analysis which is perfectly explained with notebooks
  
  DataScience career Jupyter
Visit annotations in context

Tags

Jupyter

career

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/why-we-do-machine-learning-engineering-with-yaml-not-notebooks-a2a97f5e04f8
www.datanami.com www.datanami.com

How COVID-19 Is Impacting the Market for Data Jobs

3
1. pyxelr 02 May 2020
  
  in Public
  
  COVID-19 has spurred a shift to analyze things like supply chain disruptions, speech analytics, and filtering out pandemic-related behavior, such as binge shopping, Burtch Works says. Data teams are being asked to create new simulations for the coming recession, to create scorecards to track pandemic-related behavior, and add COVID-19 control variables to marketing attribution models, the company says.
  
  How COVID-19 impacts activities of data positions
  
  COVID-19 DataScience career
2. pyxelr 02 May 2020
  
  in Public
  
  Data scientists and data engineers have built-in job security relative to other positions as businesses transition their operations to rely more heavily on data, data science, and AI. That’s a long-term trend that is not likely to change due to COVID-19, although momentum had started to slow in 2019 as venture capital investments ebbed.
  
  COVID-19 DataScience career
3. pyxelr 02 May 2020
  
  in Public
  
  According to a Dice Tech Jobs report released in February, demand for data engineers was up 50% and demand for data scientists was up 32% in 2019 compared to the prior year.
  
  Need for Data Scientist / Engineers in 2019 vs 2018
  
  DataScience career
Visit annotations in context

Tags

DataScience

COVID-19

career

Annotators

pyxelr

URL

datanami.com/2020/04/03/how-covid-19-is-impacting-the-market-for-data-jobs/
news.ycombinator.com news.ycombinator.com

Falcon is a free, open-source SQL editor with inline data visualization | Hacker News

1
1. pyxelr 02 May 2020
  
  in Public
  
  Truth be told, we found that most companies we worked with preferred to own the analytical backend.
  
  From the experience of Plotly Team
  
  DataScience career
Visit annotations in context

Tags

career

DataScience

Annotators

pyxelr

URL

news.ycombinator.com/item
Apr 2020
towardsdatascience.com towardsdatascience.com

RIP correlation. Introducing the Predictive Power Score

11
1. pyxelr 30 Apr 2020
  
  in Public
  
  the limitations of the PPS
  
  Limitations of the PPS:
  
  Slower than correlation
  
  Score cannot be interpreted as easily as the correlation (it doesn't tell you anything about the type of relationship). PPS is better for finding patterns and correlation is better for communicating found linear relationships
  
  You cannot compare the scores for different target variables in a strict math way because they're calculated using different evaluation metrics
  
  There are some limitations of the components used underneath the hood
  
  You've to perform forward and backward selection in addition to feature selection
  
  DataScience statistics
2. pyxelr 30 Apr 2020
  
  in Public
  
  Although the PPS has many advantages over the correlation, there is some drawback: it takes longer to calculate.
  
  PPS is slower to calculate the correlation.
  
  single PPS = 10-500 ms
  
  whole PPS matrix for 40 columns = 40*40 = 1600 individual calculations = 1-10 minutes
  
  DataScience
3. pyxelr 30 Apr 2020
  
  in Public
  
  How to use the PPS in your own (Python) project
  
  Using PPS with Python
  
  Download ppscore: pip install ppscoreshell
  
  Calculate the PPS for a given pandas dataframe:
  import ppscore as pps pps.score(df, "feature_column", "target_column")
  
  Calculate the whole PPS matrix:
  pps.matrix(df)
  
  DataScience statistics Python
4. pyxelr 30 Apr 2020
  
  in Public
  
  The PPS clearly has some advantages over correlation for finding predictive patterns in the data. However, once the patterns are found, the correlation is still a great way of communicating found linear relationships.
  
  PPS:
  
  good for finding predictive patterns
  
  can be used for feature selection
  
  can be used to detect information leakage between variables
  
  interpret PPS matrix as a directed graph to find entity structures Correlation:
  
  good for communicating found linear relationships
  
  DataScience statistics
5. pyxelr 30 Apr 2020
  
  in Public
  
  Let’s compare the correlation matrix to the PPS matrix on the Titanic dataset.
  
  Comparing correlation matrix and the PPS matrix of the Titanic dataset:
  
  findings about the correlation matrix:
  
  Correlation matrix is smaller because it doesn't work for categorical data
  
  Correlation matrix shows a negative correlation between TicketPrice and Class. For PPS, it's a strong predictor (0.9 PPS), but not the other way Class to TicketPrice (ticket of 5000-10000$ is most likely the highest class, but the highest class itself cannot determine the price)
  
  findings about the PPS matrix:
  
  First row of the matrix tells you that the best univariate predictor of the column Survived is the column Sex (Sex was dropped for correlation)
  
  TicketID uncovers a hidden pattern as well as it's connection with the TicketPrice
  
  DataScience statistics
6. pyxelr 30 Apr 2020
  
  in Public
  
  Let’s use a typical quadratic relationship: the feature x is a uniform variable ranging from -2 to 2 and the target y is the square of x plus some error.
  
  In this scenario:
  
  we can predict y using x
  
  we cannot predict x using y as x might be negative or positive (for y=4, x=2 or -2
  
  the correlation is 0. Both from x to y and from y to x because the correlation is symmetric (more often relationships are assymetric!). However, the PPS from x to y is 0.88 (not 1 because of existing error)
  
  PPS from y to x is 0 because there's no relationship that y can predict if it only knows its own value
  
  DataScience statistics
7. pyxelr 30 Apr 2020
  
  in Public
  
  how do you normalize a score? You define a lower and an upper limit and put the score into perspective.
  
  Normalising a score:
  
  you need to put a lower and upper limit
  
  upper limit can be F1 = 1, and a perfect MAE = 0
  
  lower limit depends on the evaluation metric and your data set. It's the value that a naive predictor achieves
  
  DataScience statistics
8. pyxelr 30 Apr 2020
  
  in Public
  
  For a classification problem, always predicting the most common class is pretty naive. For a regression problem, always predicting the median value is pretty naive.
  
  What is a naive model:
  
  predicting common class for a classification problem
  
  predicting median value for a regression problem
  
  DataScience statistics
9. pyxelr 30 Apr 2020
  
  in Public
  
  Let’s say we have two columns and want to calculate the predictive power score of A predicting B. In this case, we treat B as our target variable and A as our (only) feature. We can now calculate a cross-validated Decision Tree and calculate a suitable evaluation metric.
  
  If the target (B) variable is:
  
  numeric - we can use a Decision Tree Regressor and calculate the Mean Absolute Error (MAE)
  
  categoric - we can use a Decision Tree Classifier and calculate the weighted F1 (or ROC)
  
  DataScience statistics
10. pyxelr 30 Apr 2020
  
  in Public
  
  More often, relationships are asymmetric
  
  a column with 3 unique values will never be able to perfectly predict another column with 100 unique values. But the opposite might be true
  
  DataScience statistics
11. pyxelr 30 Apr 2020
  
  in Public
  
  there are many non-linear relationships that the score simply won’t detect. For example, a sinus wave, a quadratic curve or a mysterious step function. The score will just be 0, saying: “Nothing interesting here”. Also, correlation is only defined for numeric columns.
  
  Correlation:
  
  doesn't work with non-linear data
  
  doesn't work for categorical values
  
  Examples:
  
  DataScience statistics
Visit annotations in context

Tags

statistics

Python

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598
p.migdal.pl p.migdal.pl

How I Learned to Stop Worrying and Love the Types & Tests

1
1. pyxelr 29 Apr 2020
  
  in Public
  
  Creating meticulous tests before exploring the data is a big mistake, and will result in a well-crafted garbage-in, garbage-out pipeline. We need an environment flexible enough to encourage experiments, especially in the initial place.
  
  Overzealous nature of TDD may discourage from explorable data science
  
  DataScience programming testing
Visit annotations in context

Tags

testing

DataScience

programming

Annotators

pyxelr

URL

p.migdal.pl/2020/03/02/types-tests-typescript.html
towardsdatascience.com towardsdatascience.com

Why we do machine learning engineering with YAML, not notebooks

2
1. pyxelr 29 Apr 2020
  
  in Public
  
  The priorities in building a production machine learning pipeline—the series of steps that take you from raw data to product—are not fundamentally different from those of general software engineering.
  
  Your pipeline should be reproducible
  
  Collaborating on your pipeline should be easy
  
  All code in your pipeline should be testable
  
  DataScience programming
2. pyxelr 29 Apr 2020
  
  in Public
  
  A notebook, at a very basic level, is just a bunch of JSON that references blocks of code and the order in which they should be executed.But notebooks prioritize presentation and interactivity at the expense of reproducibility. YAML is the other side of that coin, ignoring presentation in favor of simplicity and reproducibility—making it much better for production.
  
  Summary of the article:
  
  Notebook = presentation + interactivity
  
  YAML = simplicity + reproducibility
  
  Jupyter YAML programming DataScience
Visit annotations in context

Tags

programming

YAML

Jupyter

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/why-we-do-machine-learning-engineering-with-yaml-not-notebooks-a2a97f5e04f8
stats.stackexchange.com stats.stackexchange.com

Difference between replication and repeated measurements

1
1. pyxelr 27 Apr 2020
  
  in Public
  
  Repeated measures involves measuring the same cases multiple times. So, if you measured the chips, then did something to them, then measured them again, etc it would be repeated measures. Replication involves running the same study on different subjects but identical conditions. So, if you did the study on n chips, then did it again on another n chips that would be replication.
  
  Difference between repeated measures and replication
  
  DataScience statistics
Visit annotations in context

Tags

statistics

DataScience

Annotators

pyxelr

URL

stats.stackexchange.com/questions/62312/difference-between-replication-and-repeated-measurements
github.com github.com

firmai/deltapy

2
1. pyxelr 17 Apr 2020
  
  in Public
  
  To take full advantage of tabular augmentation for time-series you would perform the techniques in the following order: (1) transforming, (2) interacting, (3) mapping, (4) extracting, and (5) synthesising
  
  DataScience TimeSeries
2. pyxelr 17 Apr 2020
  
  in Public
  
  process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set
  
  Tabular Data Augmentation
  
  DataScience
Visit annotations in context

Tags

DataScience

TimeSeries

Annotators

pyxelr

URL

github.com/firmai/deltapy
news.ycombinator.com news.ycombinator.com

Falcon is a free, open-source SQL editor with inline data visualization | Hacker News

1
1. pyxelr 17 Apr 2020
  
  in Public
  
  1) Redash and Falcon focus on people that want to do visualizations on top of SQL2) Superset, Tableau and PowerBI focus on people that want to do visualizations with a UI3) Metabase and SeekTable focus on people that want to do quick analysis (they are the closest to an Excel replacement)
  
  Comparison of data analysis tools:
  
  1) Redash & Falcon - SQL focus
  
  2) Superset, Tableau & PowerBI - UI workflow
  
  3) Metabase & SeekTable - Excel like experience
  
  DataScience databases SQL
Visit annotations in context

Tags

databases

DataScience

SQL

Annotators

pyxelr

URL

news.ycombinator.com/item
blog.jupyter.org blog.jupyter.org

A visual debugger for Jupyter

1
1. pyxelr 17 Apr 2020
  
  in Public
  
  JupyterLab project, which enables a richer UI including a file browser, text editors, consoles, notebooks, and a rich layout system.
  
  How JupyterLab differs from a traditional notebook
  
  Python DataScience Jupyter
Visit annotations in context

Tags

Jupyter

Python

DataScience

Annotators

pyxelr

URL

blog.jupyter.org/a-visual-debugger-for-jupyter-914e61716559
code.visualstudio.com code.visualstudio.com

Python and Data Science Tutorial in Visual Studio Code

1
1. pyxelr 10 Apr 2020
  
  in Public
  
  Open an Anaconda command prompt and run conda create -n myenv python=3.7 pandas jupyter seaborn scikit-learn keras tensorflow
  
  Command to quickly create a new Anaconda environment: conda create -n myenv python=3.7 pandas jupyter seaborn scikit-learn keras tensorflow
  
  DataScience Python
Visit annotations in context

Tags

Python

DataScience

Annotators

pyxelr

URL

code.visualstudio.com/docs/python/data-science-tutorial
dfrieds.com dfrieds.com

Data Science: Reality Doesn't Meet Expectations

13
1. pyxelr 10 Apr 2020
  
  in Public
  
  On the job, if you notice poor infrastructure, speak up to your manager early on. Clearly document the problem, and try to incorporate a data engineering, infrastructure, or devops team to help resolve the issue! I'd also encourage you to learn these skills too!
  
  Recommendation to Data Scientists dealing with poor infrastructure
  
  DataScience
2. pyxelr 10 Apr 2020
  
  in Public
  
  When I worked at Target HQ in 2012, employees would arrive to work early - often around 7am - in order to query the database at a time when few others were doing so. They hoped they’d get database results quicker. Yet, they’d still often wait several hours just to get results.
  
  What poor data infrastructure leads to
  
  DataScience
3. pyxelr 10 Apr 2020
  
  in Public
  
  The Data Scientist, in many cases, should be called the Data Janitor. ¯_(ツ)_/¯
  
  :D
  
  DataScience
4. pyxelr 10 Apr 2020
  
  in Public
  
  In regards to quality of data on the job, I’d often compare it to a garbage bag that ripped, had its content spewed all over the ground and your partner has asked you to find a beautiful earring that was accidentally inside.
  
  From my experience, I can only agree
  
  DataScience
5. pyxelr 10 Apr 2020
  
  in Public
  
  On the job, I’d recommend you document your work well and calculate the monetary value of your analyses based on factors like employee salary, capital investments, opportunity cost, etc. These analyses will come in handy for a promotion/review packet later too.
  
  Factors to keep a track of as a Data Scientist
  
  DataScience
6. pyxelr 10 Apr 2020
  
  in Public
  
  you as a Data Scientist should try to find a situation to be incredibly valuable on the job! It’s tough to find that from the outskirts of applying to jobs, but internally, you can make inroads supporting stakeholders with evidence for their decisions!
  
  Try showcasing the evidence of why the employer needs you
  
  DataScience
7. pyxelr 10 Apr 2020
  
  in Public
  
  As a Data Scientist in the org, are you essential to the business? Probably not. The business could go on for a while and survive without you. Sales will still be made, features will still get built, customer support will handle customer concerns, etc.
  
  As a Data Scientist, you are more of a "support" to the overall team
  
  DataScience
8. pyxelr 10 Apr 2020
  
  in Public
  
  As the resident Data Scientist, you may become easily inundated with requests from multiple teams at once. Be prepared to ask these teams to qualify and defend their requests, and be prepared to say “no” if their needs fall outside the scope of your actual priority queue. I’d recommend utilizing the RICE prioritization technique for projects.
  
  Being the only data person be sure to prioritise the requests
  
  DataScience
9. pyxelr 10 Apr 2020
  
  in Public
  
  Common questions are: How many users click this button; what % of users that visit a screen click this button; how many users have signed up by region or account type? However, the data needed to answer those questions may not exist! If the data does exists, it’s likely “dirty” - undocumented, tough to find or could be factually inaccurate. It’ll be tough to work with! You could spend hours or days attempting to answer a single question only to discover that you can’t sufficiently answer it for a stakeholder. In machine learning, you may be asked to optimize some process or experience for consumers. However, there’s uncertainty with how much, if at all, the experience can be improved!
  
  Common types of problems you might be working with in Data Science / Machine Learning industry
  
  DataScience MachineLearning
10. pyxelr 10 Apr 2020
  
  in Public
  
  In one experience, a fellow researcher spent over a month researching a particular value among our customers through qualitative and quantitative data. She presented a well-written and evidence-backed report. Yet, a few days later, a key head of product outlined a vision for the team and supported it with a claim that was antithetical to the researcher’s findings! Even if a data science project you advocate for is greenlighted, you may be on your own as the rare knowledgeable person to plan and execute it. It’s unlikely leadership will be hands-on to help you research and plan out the project.
  
  Data science leadership is sorely lacking
  
  DataScience
11. pyxelr 10 Apr 2020
  
  in Public
  
  Because people don’t know what data science does, you may have to support yourself with work in devops, software engineering, data engineering, etc.
  
  You might have multiple roles due to lack of clear "data science" terminology
  
  DataScience
12. pyxelr 10 Apr 2020
  
  in Public
  
  In 50+ interviews for data related jobs, I’ve been asked about AB testing, SQL analytics questions, optimizing SQL queries, how to code a game in Python, Logistic Regression, Gradient Boosted Trees, data structures and/or algorithms programming problems!
  
  Data science interviews lack a clarity of scope
  
  DataScience
13. pyxelr 10 Apr 2020
  
  in Public
  
  seven most common (and at times flagrant) ways that data science has failed to meet expectations in industry
  
  People don’t know what “data science” does.
  
  Data science leadership is sorely lacking.
  
  Data science can’t always be built to specs.
  
  You’re likely the only “data person.”
  
  Your impact is tough to measure — data doesn’t always translate to value.
  
  Data & infrastructure have serious quality problems.
  
  Data work can be profoundly unethical. Moral courage required.
  
  DataScience
Visit annotations in context

Tags

MachineLearning

DataScience

Annotators

pyxelr

URL

dfrieds.com/articles/data-science-reality-vs-expectations.html
datascience.stackexchange.com datascience.stackexchange.com

How can I provide an answer to Neural Network skeptics?

2
1. pyxelr 09 Apr 2020
  
  in Public
  
  In data science community the performance of the model on the test dataset is one of the most important things people look at. Just look at the competitions on kaggle.com. They are extremely focused on test dataset and the performance of these models is really good.
  
  In data science, performance of the model on the test model is the most important metric for the majority.
  
  It's not always the best measurement since the most efficient model can completely misperform while receiving a different type of a dataset
  
  DataScience MachineLearning
2. pyxelr 09 Apr 2020
  
  in Public
  
  It's basically a look up table, interpolating between known data points. Except, unlike other interpolants like 'linear', 'nearest neighbour' or 'cubic', the underlying functional form is determined to best represent the kind of data you have.
  
  You can describe AI/ML methods as a look up table that adjusts to your data points unlike other interpolants (linear,nearest neighbor or cubic)
  
  DataScience MachineLearning
Visit annotations in context

Tags

MachineLearning

DataScience

Annotators

pyxelr

URL

datascience.stackexchange.com/questions/68893/how-can-i-provide-an-answer-to-neural-network-skeptics
Mar 2020
intellspot.com intellspot.com

10 Interval Data Examples: Interval Scale Definition & Meaning

2
1. pyxelr 15 Mar 2020
  
  in Public
  
  In the interval scale, there is no true zero point or fixed beginning. They do not have a true zero even if one of the values carry the name “zero.” For example, in the temperature, there is no point where the temperature can be zero. Zero degrees F does not mean the complete absence of temperature. Since the interval scale has no true zero point, you cannot calculate Ratios. For example, there is no any sense the ratio of 90 to 30 degrees F to be the same as the ratio of 60 to 20 degrees. A temperature of 20 degrees is not twice as warm as one of 10 degrees.
  
  Interval data:
  
  show not only order and direction, but also the exact differences between the values
  
  the distances between each value on the interval scale are meaningful and equal
  
  no true zero point
  
  no fixed beginning
  
  no possibility to calculate ratios (only add and substract)
  
  e.g.: temperature in Fahrenheit or Celsius (but not Kelvin) or IQ test
  
  DataScience statistics
2. pyxelr 15 Mar 2020
  
  in Public
  
  As the interval scales, Ratio scales show us the order and the exact value between the units. However, in contrast with interval scales, Ratio ones have an absolute zero that allows us to perform a huge range of descriptive statistics and inferential statistics. The ratio scales possess a clear definition of zero. Any types of values that can be measured from absolute zero can be measured with a ratio scale. The most popular examples of ratio variables are height and weight. In addition, one individual can be twice as tall as another individual.
  
  Ratio data is like interval data, but with:
  
  absolute zero
  
  possibility to calculate ratio (e.g. someone can be twice as tall)
  
  possibility to not only add and subtract, but multiply and divide values
  
  e.g.: weight, height, Kelvin scale (50K is 2x hot as 25K)
  
  DataScience statistics
Visit annotations in context

Tags

statistics

DataScience

Annotators

pyxelr

URL

intellspot.com/interval-data-examples/
towardsdatascience.com towardsdatascience.com

How to analyse 100s of GBs of data on your laptop with Python

5
1. pyxelr 02 Mar 2020
  
  in Public
  
  If you are interested in exploring the dataset used in this article, it can be used straight from S3 with Vaex. See the full Jupyter notebook to find out how to do this.
  
  Example of EDA in Vaex ---> Jupyter Notebook
  
  Vaex DataScience
2. pyxelr 02 Mar 2020
  
  in Public
  
  Vaex is an open-source DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your hard-drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms and lazy evaluations.
  
  Vaex - library to manage as large datasets as your HDD, thanks to:
  
  memory mapping
  
  efficient out-of-core algorithms
  
  lazy evaluations.
  
  All wrapped in a Pandas-like API
  
  Python DataScience Vaex
3. pyxelr 02 Mar 2020
  
  in Public
  
  The first step is to convert the data into a memory mappable file format, such as Apache Arrow, Apache Parquet, or HDF5
  
  Before opening data with Vaex, we need to convert it into a memory mappable file format (e.g. Apache Arrow, Apache Parquet or HDF5). This way, 100 GB data can be load in Vaex in 0.052 seconds!
  
  Example of converting CSV ---> HDF5.
  
  DataScience Vaex
4. pyxelr 02 Mar 2020
  
  in Public
  
  The describe method nicely illustrates the power and efficiency of Vaex: all of these statistics were computed in under 3 minutes on my MacBook Pro (15", 2018, 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods would require either distributed computing or a cloud instance with over 100GB to preform the same computations.
  
  Possibilities of Vaex
  
  DataScience Vaex
5. pyxelr 02 Mar 2020
  
  in Public
  
  AWS offers instances with Terabytes of RAM. In this case you still have to manage cloud data buckets, wait for data transfer from bucket to instance every time the instance starts, handle compliance issues that come with putting data on the cloud, and deal with all the inconvenience that come with working on a remote machine. Not to mention the costs, which although start low, tend to pile up as time goes on.
  
  AWS as a solution to analyse data too big for RAM (like 30-50 GB range). In this case, it's still uncomfortable:
  
  managing cloud data buckets
  
  waiting for data transfer from bucket to instance every time the instance starts
  
  handling compliance issues coming by putting data on the cloud
  
  dealing with remote machines
  
  costs
  
  aws DataScience
Visit annotations in context

Tags

Vaex

aws

Python

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94
towardsdatascience.com towardsdatascience.com

Understanding AUC - ROC Curve – Towards Data Science

5
1. pyxelr 02 Mar 2020
  
  in Public
  
  when AUC is 0.5, it means model has no class separation capacity whatsoever.
  
  If AUC = 0.5
  
  DataScience statistics
2. pyxelr 02 Mar 2020
  
  in Public
  
  ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes.
  
  ROC & AUC
  
  DataScience statistics
3. pyxelr 02 Mar 2020
  
  in Public
  
  In multi-class model, we can plot N number of AUC ROC Curves for N number classes using One vs ALL methodology. So for Example, If you have three classes named X, Y and Z, you will have one ROC for X classified against Y and Z, another ROC for Y classified against X and Z, and a third one of Z classified against Y and X.
  
  Using AUC ROC curve for multi-class model
  
  DataScience statistics
4. pyxelr 02 Mar 2020
  
  in Public
  
  When AUC is approximately 0, model is actually reciprocating the classes. It means, model is predicting negative class as a positive class and vice versa
  
  If AUC = 0
  
  DataScience statistics
5. pyxelr 02 Mar 2020
  
  in Public
  
  AUC near to the 1 which means it has good measure of separability.
  
  If AUC = 1
  
  DataScience statistics
Visit annotations in context

Tags

statistics

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
www.linkedin.com www.linkedin.com

LinkedIn

1
1. pyxelr 02 Mar 2020
  
  in Public
  
  LR is nothing but the binomial regression with logit link (or probit), one of the numerous GLM cases. As a regression - itself it doesn't classify anything, it models the conditional (to linear predictor) expected value of the Bernoulli/binomially distributed DV.
  
  Linear Regression - the ultimate definition (it's not a classification algorithm!)
  
  It's used for classification when we specify a 50% threshold.
  
  DataScience
Visit annotations in context

Tags

DataScience

Annotators

pyxelr

URL

linkedin.com/feed/update/urn:li:activity:6601827374533353472/
victorzhou.com victorzhou.com

A Simple Explanation of the Bag-of-Words Model - victorzhou.com

3
1. pyxelr 02 Mar 2020
  
  in Public
  
  BOW is often used for Natural Language Processing (NLP) tasks like Text Classification. Its strengths lie in its simplicity: it’s inexpensive to compute, and sometimes simpler is better when positioning or contextual info aren’t relevant
  
  Usefulness of BOW:
  
  simplicity
  
  low on computing requirements
  
  DataScience
2. pyxelr 02 Mar 2020
  
  in Public
  
  Notice that we lose contextual information, e.g. where in the document the word appeared, when we use BOW. It’s like a literal bag-of-words: it only tells you what words occur in the document, not where they occurred
  
  The analogy behind using bag term in the bag-of-words (BOW) model.
  
  DataScience
3. pyxelr 02 Mar 2020
  
  in Public
  
  Here’s my preferred way of doing it, which uses Keras’s Tokenizer class
  
  Keras's Tokenizer Class - Victor's preferred way of implementing BOW in Python
  
  DataScience Python keras
Visit annotations in context

Tags

Python

DataScience

keras

Annotators

pyxelr

URL

victorzhou.com/blog/bag-of-words/
martinfowler.com martinfowler.com

Continuous Delivery for Machine Learning

1
1. pyxelr 02 Mar 2020
  
  in Public
  
  process that takes input data through a series of transformation stages, producing data as output
  
  Data pipeline
  
  DataScience
Visit annotations in context

Tags

DataScience

Annotators

pyxelr

URL

martinfowler.com/articles/cd4ml.html
victorzhou.com victorzhou.com

A Simple Explanation of the Softmax Function - victorzhou.com

1
1. pyxelr 02 Mar 2020
  
  in Public
  
  Softmax turns arbitrary real values into probabilities
  
  Softmax function -
  
  outputs of the function are in range [0,1] and add up to 1. Hence, they form a probability distribution
  
  the calcualtion invloves e (mathematical constant) and performs operation on n numbers: $$s(x_i) = \frac{e^{xi}}{\sum{j=1}^n e^{x_j}}$$
  
  the bigger the value, the higher its probability
  
  lets us answer classification questions with probabilities, which are more useful than simpler answers (e.g. binary yes/no)
  
  DataScience statistics
Visit annotations in context

Tags

statistics

DataScience

Annotators

pyxelr

URL

victorzhou.com/blog/softmax/
www.linkedin.com www.linkedin.com

LinkedIn

1
1. pyxelr 02 Mar 2020
  
  in Public
  
  1. Logistic regression IS a binomial regression (with logit link), a special case of the Generalized Linear Model. It doesn't classify anything *unless a threshold for the probability is set*. Classification is just its application. 2. Stepwise regression is by no means a regression. It's a (flawed) method of variable selection. 3. OLS is a method of estimation (among others: GLS, TLS, (RE)ML, PQL, etc.), NOT a regression. 4. Ridge, LASSO - it's a method of regularization, NOT a regression. 5. There are tens of models for the regression analysis. You mention mainly linear and logistic - it's just the GLM! Learn the others too (link in a comment). STOP with the "17 types of regression every DS should know". BTW, there're 270+ statistical tests. Not just t, chi2 & Wilcoxon
  
  5 clarifications to common misconceptions shared over data science cheatsheets on LinkedIn
  
  DataScience statistics
Visit annotations in context

Tags

statistics

DataScience

Annotators

pyxelr

URL

linkedin.com/posts/adrianolszewski_biostatistics-statistics-rstats-activity-6621020198185111552-jw12/
news.ycombinator.com news.ycombinator.com

I Got More Data, My Model Is More Refined, but My Estimator Is Getting Worse [pdf] | Hacker News

1
1. pyxelr 02 Mar 2020
  
  in Public
  
  400 sized probability sample (a small random sample from the whole population) is often better than a millions sized administrative sample (of the kind you can download from gov sites). The reason is that an arbitrary sample (as opposed to a random one) is very likely to be biased, and, if large enough, a confidence interval (which actually doesn't really make sense except for probability samples) will be so narrow that, because of the bias, it will actually rarely, if ever, include the true value we are trying to estimate. On the other hand, the small, random sample will be very likely to include the true value in its (wider) confidence interval
  
  Summary of Lecture 01 (Data Science Lifecycle, Study Design) - Data 100 Su19
  
  DataScience
Visit annotations in context

Tags

DataScience

Annotators

pyxelr

URL

news.ycombinator.com/item
www.linkedin.com www.linkedin.com

LinkedIn

1
1. pyxelr 02 Mar 2020
  
  in Public
  
  An exploratory plot is all about you getting to know the data. An explanatory graphic, on the other hand, is about telling a story using that data to a specific audience.
  
  Exploratory vs Explanatory plot
  
  DataScience statistics visualisation
Visit annotations in context

Tags

statistics

visualisation

DataScience

Annotators

pyxelr

URL

linkedin.com/feed/update/urn:li:activity:6632255681946767360/
Dec 2018
www.tandfonline.com www.tandfonline.com

Data Science: A Three Ring Circus or a Big Tent?

2
1. AlastairKelly 06 Dec 2018
  
  in Public
  
  ggplot2 and dplyr are clear intellectual contributions because they provide tools (grammars) for visualization and data manipulation, respectively. The tools make the tasks radically easier by providing clear organizing principles coupled with effective code. Users often remark on the ease of manipulating data with dplyr and it is natural to wonder if perhaps the task itself is trivial. We claim it is not. Many probability challenges become dramatically easier, once you strike upon the “right” notation. In both cases, what feels like a matter of notation or syntax is really a more about exploiting the “right” abstraction.
  
  This is a fascinating example to me of both an important intellectual contribution that is embodied in the code itself, and also how extremely important work can be invisible or taken for granted because of how seamlessly or easily it becomes integrated into other research,
  
  DataScience CodeIsScience
2. AlastairKelly 06 Dec 2018
  
  in Public
  
  This article is one of many commentaries on the Donoho "50 Years of Data Science" paper.
  
  DataScience
Visit annotations in context

Tags

DataScience

DataScience CodeIsScience

Annotators

AlastairKelly

URL

tandfonline.com/doi/full/10.1080/10618600.2017.1389743

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators