Recently I’ve been assign to help in resolving issue named “name_of_the_critical_app slowing down” between 2 and 3 AM.
As the system architecture is fairly complex, I’ve decided to start with analyzing OS data collected by Oracle’s utility OS Watcher.
From generated report, It’s clear that during one hour period several disks had large service time and high percent busy (see the picture below). All other components (CPU, network…) was normal.
All overloaded disks belong to DATA diskgroup (ASM instance).
As Oracle Db is the only client of ASM disks, the next step is to analyze AWR report for one hour period.
AWR report has wealth of information, but we only need a tiny part of it in the IO Stats part of the main section. .
First let’s see amount of reads and writes during the issue.
Notice the amount of data read in data file (1.3Tb ).
To see who is responsible for such amount of read, you can take a look at the IO Stat by Function summary report (there is even more detailed report IO Stat by Function/Filetype summary).
Now we know that RMAN backup process is responsible for majority of reads.
To be able to see what else was going on, it’s time to look at the tablespace IO stats.
Tablespace with most reads (masked) and most writes is holds indexes for one application.
As the last step, it’s time to take a look at SQL statistics ordered by reads (some columns like SQL module and SQL text are hidden).
To conclude this discussion, this is what I’ve found:
- There was RMAN backup running responsible for most reads during one hour period.
- External ETL process was running on the same time (loading data into Data Warehouse)
- Two batch processes, responsible for the majority of writes (look at the Tablespace IO Stats picture above) were running at the same time, which writes into one large global partitioned index.
Further step will be to re-schedule ETL process (responsible for reads) and batch processes (responsible for writes), and to optimize SQL statements and program logic.