Python has many different ways to fetch the data needed to do some processing.
Although majority of examples you can find around are using CSV file load, fetching a data from the database is still the most common way in practice.
For all tests I’ll use SALES table from the Oracle SH sample schema.
In this example, classic iteration over cx_Oracle cursor result set will be used:
import cx_Oracle import os import datetime import pandas as pd from numpy.distutils.system_info import dfftw_info os.environ["ORACLE_HOME"] = "/path/to/oracle/home" con = cx_Oracle.connect(user='sh', password = 'xxx', dsn = '(DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = tcp)(host = localhost) (Port = 9999))) (CONNECT_DATA = (SID = orcl)))') sql = 'select * from sales' cur = con.cursor() cur.arraysize = 1000 counter = 0 beginTime = datetime.datetime.now() try: cur.execute(sql) for row in cur: counter +=1 except cx_Oracle.DatabaseError as ora_exception: error, = ora_exception.args print("Oracle Error Code: ", error.code) print("Oracle Error Message: ", error.message) print('elapsed_time = ', datetime.datetime.now() - beginTime) print(counter) cur.close() con.close() elapsed_time = 0:00:01.591582 918843
Elapsed time for processing approximately 1 million records (918.843) is 1.59 seconds.
In this example, instead of iterating over a result set, Fetchall method will be used.
This is the part of code that I’ve changed:
r = cur.execute(sql).fetchall() counter = len(r) elapsed_time = 0:00:01.588170 918843
Elapsed time is almost identical as in the first case.
In the third example I’ll use Pandas instead of cursor. There are several variation named as a subcase with a letter a – c.
a) Getting the number of records by using PROD_ID column name.
df_sales = pd.read_sql(sql, con) counter = df_sales['PROD_ID'].size elapsed_time = 0:00:04.065327 918843
Elapsed time to process the same amount of data is 4.06 sec, which is almost 3 times slower comparing with a cursor approach (cases 1 & 2).
b) Several other ways to get the number of records that are running roughly at the same speed as 3a case.
#uncomment one of the following options #counter = len(df_sales.index) #counter = df_sales.shape #counter = len(df_sales.axes)
c) Here is the slowest way to get the number of records as I’m using values instead of key (index)
counter = len(df_sales.values) elapsed_time = 0:00:06.027399 918843
Pandas case – selecting only the first column (instead of all columns) from the SALES table
sql = 'select prod_id from sales' df_sales = pd.read_sql(sql, con) counter = df_sales['PROD_ID'].size elapsed_time = 0:00:03.744749 918843
Elapsed time is 3.74 sec which is by far the best result I can get by using the Pandas framework for client side processing.
Leaving to SQL engine to do its job of calculating the number of records – example with cursor:
sql = 'select count(*) as cnt from sales' cur = con.cursor() cur.execute(sql) counter = cur.fetchone() elapsed_time = 0:00:00.004216 918843
Leaving to SQL engine to do its job of calculating the number of records – example with Pandas:
sql = 'select count(*) as cnt from sales' a) counter = df_sales.iloc ['CNT'] elapsed_time = 0:00:00.013347 918843 b) counter = df_sales.iat[0,0] elapsed_time = 0:00:00.009238 918843 c) counter = df_sales.at[0, 'CNT'] elapsed_time = 0:00:00.002658 918843
When you need to process data from a database, the best way to do that is to leave database SQL engine to do a job.
Results with SQL engine (cases 5 & 6) are many time (600 times or more) faster comparing with procedural logic.
That is expected (relational databases are created for that purpose), although you can still very often find a code logic where client side processing is used to do database engine job, which is a very wrong approach.
Another interesting point is that Pandas package is about 2.5 times slower comparing with a cursor (result set).
In cases where cursor approach is covering all needed functionality, you shouldn’t use Pandas framework as it’s about 2.5 times slower comparing with a cursor (result set) approach.
If you need to use Pandas, the best you can do is to leave all heavy processing to the SQL database engine, and fetch and manipulate only with necessary data volume.