Suppose you want to analyze data set by using your favorite tools (Pandas/NumPy).
By reading my previous article:
, you should realize how important is to do as much of data processing at the database SQL engine layer to get out dataset suitable for further analysis.
Once you finish with all aggregations/filtering/grouping/windowing on SQL, you can import resulting dataset into the Pandas framework.
Probably you’ll start full of enthusiasm with a familiar set of Python packages (Pandas), but soon you realize your Linux laptop become unresponsive.
Problem is that SQL result set is too large to fit in memory/swap, and Pandas package was not able to perform data wrangling tasks at satisfactory speed.
There are several issues with Pandas framework I want to cover:
- Single thread execution
Vanilla Pandas package is not capable to run in parallel.
For that reason it’s unable to scale-up.
There are several ways how to get parallelism (like CuDF, splitting Pandas dataset…), but in this article I’ll describe distributed approach that will cover majority of cases.
- Memory (in)efficiency
Pandas always use the maximum size of some data type.
Even if data from the relational database fit within byte type, Pandas will use Int64 as can be seen in the following example:
sql = 'select prod_id from sales' df_sales = pd.read_sql(sql, con) print("PROD_ID min value: ", df_sales['PROD_ID'].min()) print("PROD_ID max value: ", df_sales['PROD_ID'].max()) print("Number of records: ", len(df_sales.values)) print(df_sales.info()) PROD_ID min value: 13 PROD_ID max value: 148 Number of records: 918843 <class 'pandas.core.frame.DataFrame'> RangeIndex: 918843 entries, 0 to 918842 Data columns (total 1 columns): PROD_ID 918843 non-null int64 dtypes: int64(1) memory usage: 7.0 MB
As you can observer, to load 1 million records into the Pandas framework, for just one column I need 7 MB of memory.
If I select all columns (e.g. 50), I’ll have 50×7 = 350 MB for just 1 million records.
In the next example I’m going to use the most efficient type for this case (uint that ranges from 0..255).
#df_sales = pd.read_sql(sql, con) df_sales = pd.DataFrame(np.random.randint(low = 13, high = 148, size = 918843, dtype= 'uint8'), columns =['PROD_ID']) print("PROD_ID min value: ", df_sales['PROD_ID'].min()) print("PROD_ID max value: ", df_sales['PROD_ID'].max()) print("Number of records: ", len(df_sales.values)) print(df_sales.info()) PROD_ID min value: 13 PROD_ID max value: 147 Number of records: 918843 Number of records: 918843 <class 'pandas.core.frame.DataFrame'> RangeIndex: 918843 entries, 0 to 918842 Data columns (total 1 columns): PROD_ID 918843 non-null uint8 dtypes: uint8(1) memory usage: 897.4 KB
For DataFrame of the same size, memory consumption is down from 7 MB –> 0.89 MB. which is 7.86x lower.
- Unable to process larger data set needed for Machine Learning & similar tasks
If you have a larger data set to analyze (e.g. 100 million in my case on my laptop, but that can be many billions of rows), due to the large memory consumption you’ll end up with Error when memory/swap will be exhausted, as Linux Out Of Memory killer will kick off and kill your apps.
In the previous example you can observe how rapidly Pandas framework, when connecting to the database, consumes all available memory (64GB in my case) and swap.
As a consequence, you can’t scale up and analyze bigger data set which is needed for data science tasks as dataset is limited by the available RAM & Swap space.
On the other hand, even if I would have a huge amount of memory available (e.g. 3 TB of RAM), as Pandas framework is single-threaded, any king of data wrangling & analysis would be slow as I’m using only one CPU core.
But Pandas framework has so many great features, it’s not a miracle it enjoy a huge popularity among Python users.
To satisfy all sides, one of the possible solutions is to use one of distributed framework available.
Within the Python ecosystem there are more than one solution that provides distributed computing, but Dask is probably one of the most popular one.
On the following picture you can check what is happening if I try to process 100 million records with Pandas on my Laptop.
If I replace Pandas with Dask distributed framework, I’m able to analyze the same amount of data successfully.
More on Dask you’ll find in one of my future posts.
On the following figures you can see 8 node Dask Cluster in action.