Hypothesis

9 Matching Annotations

Mar 2020
towardsdatascience.com towardsdatascience.com

How to analyse 100s of GBs of data on your laptop with Python

9
1. pyxelr 02 Mar 2020
  
  in Public
  
  Vaex supports Just-In-Time compilation via Numba (using LLVM) or Pythran (acceleration via C++), giving better performance. If you happen to have a NVIDIA graphics card, you can use CUDA via the jit_cuda method to get even faster performance.
  
  Tools supported by Vaex
  
  Vaex
2. pyxelr 02 Mar 2020
  
  in Public
  
  virtual columns. These columns just house the mathematical expressions, and are evaluated only when required
  
  Virtual columns
  
  Vaex
3. pyxelr 02 Mar 2020
  
  in Public
  
  displaying a Vaex DataFrame or column requires only the first and last 5 rows to be read from disk
  
  Vaex tries to go over the entire dataset with as few passes as possible
  
  Vaex
4. pyxelr 02 Mar 2020
  
  in Public
  
  Why is it so fast? When you open a memory mapped file with Vaex, there is actually no data reading going on. Vaex only reads the file metadata
  
  Vaex only reads the file metadata:
  
  location of the data on disk
  
  data structure (number of rows, columns...)
  
  file description
  
  and so on...
  
  Vaex
5. pyxelr 02 Mar 2020
  
  in Public
  
  When filtering a Vaex DataFrame no copies of the data are made. Instead only a reference to the original object is created, on which a binary mask is applied
  
  Filtering Vaex DataFrame works on reference to the original data, saving lots of RAM
  
  Vaex
6. pyxelr 02 Mar 2020
  
  in Public
  
  If you are interested in exploring the dataset used in this article, it can be used straight from S3 with Vaex. See the full Jupyter notebook to find out how to do this.
  
  Example of EDA in Vaex ---> Jupyter Notebook
  
  Vaex DataScience
7. pyxelr 02 Mar 2020
  
  in Public
  
  Vaex is an open-source DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your hard-drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms and lazy evaluations.
  
  Vaex - library to manage as large datasets as your HDD, thanks to:
  
  memory mapping
  
  efficient out-of-core algorithms
  
  lazy evaluations.
  
  All wrapped in a Pandas-like API
  
  Python DataScience Vaex
8. pyxelr 02 Mar 2020
  
  in Public
  
  The first step is to convert the data into a memory mappable file format, such as Apache Arrow, Apache Parquet, or HDF5
  
  Before opening data with Vaex, we need to convert it into a memory mappable file format (e.g. Apache Arrow, Apache Parquet or HDF5). This way, 100 GB data can be load in Vaex in 0.052 seconds!
  
  Example of converting CSV ---> HDF5.
  
  DataScience Vaex
9. pyxelr 02 Mar 2020
  
  in Public
  
  The describe method nicely illustrates the power and efficiency of Vaex: all of these statistics were computed in under 3 minutes on my MacBook Pro (15", 2018, 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods would require either distributed computing or a cloud instance with over 100GB to preform the same computations.
  
  Possibilities of Vaex
  
  DataScience Vaex
Visit annotations in context

Tags

Vaex

Python

DataScience

Annotators

pyxelr

URL

towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94

Tags

Annotators

URL