Python is a popular programming language that is easy to learn, efficient and enjoys the support of a large and active community. It is a general-purpose language with libraries specialized for various areas, including web development, scripting, data science, and DevOps.
Its primary data analysis library, Pandas, gained popularity among data scientists and data engineers. It follows Python’s principles, so it seems to be easy to learn, read and allows rapid development… at least based on the textbook examples. But, what happens if we leave the safe and convenient world of the textbook examples? Is Pandas still an easy-to-use data analysis tool to query tabular data? How does it perform compared to other professional tools like R and kdb+?
In this article, I will take an example that goes just one step beyond the simplest use cases by performing some aggregation based on multiple columns. The complexity of my use case is around level 2 out of 5 levels. Anybody who analyzes data tables will bump into the problem, probably in the second week. For comparison, I will also cover other popular tools that aim for data analysis.
- First of all, the problem can be solved by ANSI SQL so all traditional RDBM systems like PostegreSQL, MySQL, etc can enter the game. In the experiments, I will use BigQuery, a serverless, highly-scalable data warehouse solution by Google.
- The R programming language is designed for statistical analysis. It natively supports tables via its class data.frame. Using multiple aggregations is quite inconvenient due to the limitation of the core function aggregate. The R community developed the library plyr to simplify the usage of data.frame. Package plyr was retired and package dplyr was introduced with the promise of improved API and faster execution. Package dplyr is part of the tidyverse collection that is designed for professional data science work. It provides an abstract query layer and decouples the query from the data storage let it be data.frame or in an external database that supports ANSI SQL. Package data.table is an alternative of dplyr and it is famous for its speed and concise syntax. A data.table can also be queried by the dplyr syntax.
- In the Q/Kdb+ programming language, tables are also first-class citizens and the speed was a primary design concept of the language. Kdb+ made use of memory-mapping, multicore processors and employs map-reduce from its birth in 2004 if data was partitioned on the disk. From version 4.0 (released in March 2020) most primitives (such as sum, avg, dev) use slave threads and are executed in parallel even if the table is in memory. Productivity was the other design consideration - any redundant programming element that does not contribute to the understanding (even a parenthesis) is regarded as visual noise. Kdb+ is a good contender for any data analysis tool.
- Vaex is another Python library to analyze big tabular datasets. It is a young competitor that attracted interest recently. It supports both in-memory and on-disk (in formats Apache Arrow, HDF5, etc.) data exploration. Vaex uses memory mapping, a zero memory copy policy, and lazy computations. Since vaex supports tables that do not fit into memory and employs parallel processing it is a good candidate to push Pandas off the throne in the future.
I will consider the elegance, simplicity and the speed of the various solutions. Also, I investigate how to tune the performance and leverage multicore processors or cluster of computers by employing parallel computation.
Carry on and read the full article on LinkedIn.