Recently I’ve got a new interesting issue, which clearly shows all the traps and different kind of knowledge required to solve it.
The tile of the issue is “Unable to open promotion component”. In description of the issue I’ve found the following:
Users are unable to enter promotion component when it has more than a few (15+) items on zone level. RPM becomes unresponsive for a long period (almost an hour)
In the incident, another DBA has analyzed the issue, and made conclusion that problem is in one long running insert SQL statement:
INSERT INTO RPM_PROMO_ITEM_LOC_EXPL_GTT
To be able to resolve the issue, you have to check already done diagnostic, and have knowledge from different fields (RPM architecture, Db, WL etc.).
I knew that RPM is using complex Java data structures (collection types) to handle application logic, and I’ve first checked memory parameters on WL side and client side (Java webstart – RPM UI).
WebLogic part where OK, but rpmconfig.jnlp and rpm_jnlp_template.vm files have been misconfigured.
After replacing the following line:
<j2se version=”1.7*” href=”http://java.sun.com/products/autodl/j2se” inital-heap-size=”256M” max-heap-size=”256M”/>
<j2se version=”1.7*” href=”http://java.sun.com/products/autodl/j2se” inital-heap-size=”256M” max-heap-size=”1024M”/>
users are able to enter very large components in less than 10 seconds.
Let’s deep dive in more detail how to get the optimal value for heap size.
Heap size is a matter of balance. If you choose too small heap size, app (RPM in this case) will spend too much time performing GC, and less time to perform application logic.
Very large heap size is not good either, as the time spent on GC pauses will increase, although the pauses will occur less frequently.
This is not the only danger if you choose to have a very large heap size. Operating systems are using virtual memory to manage physical memory of the machine (in Linux you have to take swap in account for example).
If you allocate too much memory for the heap, your OS will start to swap processes from memory into the disk and revers and this is a very expensive operation, especially during the GC.
Lessons learned from this issue:
1. Always check everything before conclude something. In more than 50% incident description and diagnostic made by others where false.
2. Wide area of knowledge is required to solve the issues (not only performance) in complex enterprise environments.