- Mar 2020
Vaex supports Just-In-Time compilation via Numba (using LLVM) or Pythran (acceleration via C++), giving better performance. If you happen to have a NVIDIA graphics card, you can use CUDA via the jit_cuda method to get even faster performance.
Tools supported by Vaex
virtual columns. These columns just house the mathematical expressions, and are evaluated only when required
displaying a Vaex DataFrame or column requires only the first and last 5 rows to be read from disk
Vaex tries to go over the entire dataset with as few passes as possible
Why is it so fast? When you open a memory mapped file with Vaex, there is actually no data reading going on. Vaex only reads the file metadata
Vaex only reads the file metadata:
- location of the data on disk
- data structure (number of rows, columns...)
- file description
- and so on...
When filtering a Vaex DataFrame no copies of the data are made. Instead only a reference to the original object is created, on which a binary mask is applied
Filtering Vaex DataFrame works on reference to the original data, saving lots of RAM
If you are interested in exploring the dataset used in this article, it can be used straight from S3 with Vaex. See the full Jupyter notebook to find out how to do this.
Example of EDA in Vaex ---> Jupyter Notebook
Vaex is an open-source DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your hard-drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms and lazy evaluations.
Vaex - library to manage as large datasets as your HDD, thanks to:
- memory mapping
- efficient out-of-core algorithms
- lazy evaluations.
All wrapped in a Pandas-like API
The first step is to convert the data into a memory mappable file format, such as Apache Arrow, Apache Parquet, or HDF5
The describe method nicely illustrates the power and efficiency of Vaex: all of these statistics were computed in under 3 minutes on my MacBook Pro (15", 2018, 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods would require either distributed computing or a cloud instance with over 100GB to preform the same computations.
Possibilities of Vaex