Imagine you need to create an ETL workflow that uses the Oracle database as one of its sources or even more common situation where you need to synchronize local table with table from the remote Oracle database via dblink.
The most common scenario is to apply the following steps:
- I have a local table for tracking the last execution time of a synchronization with remote table, or I’m using dictionary view in case of job / scheduler.
Let’s call that value LAST_EXECUTION_DATE (it’s a scalar value).
- On the remote database my source table has a column named LAST_UPDATED_DATE or similar, which keeps a time of the last update for each record (usually you’ll have CREATED_DATE column, LAST_UPDATED_DATE and a row level trigger that will update both columns at the time of row creation, and only the later one in case of future updates).
- Now you only need to implement query similar to the following one:
SELECT * from remote_table@dblink_name where LAST_UPDATED_DATE > LAST_EXECUTION_DATE;
What majority of developers might thing is that you will fetch only changes (INSERTS & UPDATES) after the last execution of the job (LAST_EXECUTION_DATE), and that a MERGE statement will take care to populate my local table correctly.
Unfortunately that is not correct approach when your source is the Oracle database, as you will miss some changes for sure.
But what is the reason for such behavior and why the same logic works correctly with some other databases?
To answer that question, again you need to know how Oracle database works.
Namely the previous logic works for all other databases where reads are blocked by writes (and vice versa).
Oracle is never blocking readers due to its multiversion read consistency.
I’ll provide an example.
Let’s assume I need to fetch all changes after the 11:04:35 PM.
In case where there are at least one open (not committed) transaction that started before that time I won’t be able to see its effect.
Instead I’ll get the last committed version of the remote table as of 11:04:35 PM.
This technique is called Multiversioning and it ensures that you’ll always get the correct results at the time when my query started.
OK, but then how to get correct result?
First we need to take a look at V$TRANSACTION dictionary view:
There you will find a two columns of interest:
You should use START_DATE as START_TIME is of Varchar2 type that you need to convert and adjust to your local NLS settings.
The following SQL shows transactions that will not be included if you use the standard workflow described above.
Note that I’m using SYSDATE where I would normally use LAST_EXECUTION_DATE variable instead.
TABLE_NAME in your source table name, since I want to avoid going through all open transactions.
SELECT t.start_date, SYSDATE, s.* FROM v$lock l ,dba_objects o ,v$session s , v$transaction t WHERE l.id1 = o.object_id AND s.sid = l.sid AND o.owner = '<SCHEMA_OWNER>’ AND o.object_name = '<TABLE_NAME>' AND s.saddr = t.ses_addr;
Note that START_DATE column is taken from the V$TRANSACTION view (START_TIME column is VARCHAR2 type), while SYSDATE is the current time.
As you can clearly see, by implementing the same logic that works for some other non-Oracle databases (where readers block writers), I would loose two transactions (row 2 & 3) as they started before 11:04:35 PM, but not completed before the LAST_EXECUTION_DATE (column SYSDATE – 11:04:35 PM).
To include all transactions, you need to shift LAST_EXECUTION_DATE (11:04:35 PM) 4 seconds ahead (11:04:31 PM).
Here is the SQL that will return the correct start date:
WITH min_transaction_date as ( SELECT min(t.start_date) as min_tran_date FROM v$lock l ,dba_objects o ,v$session s , v$transaction t WHERE l.id1 = o.object_id AND s.sid = l.sid AND o.owner = '<SCHEMA_OWNER>' AND o.object_name = '<TABLE_NAME>' AND s.saddr = t.ses_addr UNION ALL select SYSDATE from dual ) SELECT min(min_tran_date) from min_transaction_date;
If you replace SYSDATE with the LAST_EXECUTION_DATE variable, you’ll get the correct time to start your fetching job.
Don’t forget to grant direct permissions to the user that will execute synchronization job as V$s are views, not tables.