Performance Troubleshooting
We are running LSF 901.9 on AIX 6.1 and are encountering performance issues on our prod system.
Typically, after the system has been rebooted and running for 1-1.5 weeks, the users (check LS=Y) start to report slow performance in Portal. Specifically, accessing multi-step jobs, printing, comments and, less specifically, just overall slowness. They do not, however, seem to have any slowness when authenticating/logging in.
When this occurs, running commands in LID for (check LS=N) users is also slow (rngdbdump, jobschd, etc. take 30-45 seconds to return). The odd piece is that we can launch two LID sessions, then execute a simple 'rngdbdump' (no arguments, just to return usage) in session 1 and, while waiting for this to return (30-45 seconds), if we do the same action in session 2, both session 1 and session 2 return the usage information for the command. It seems like the second session forces the command to complete in both sessions.
We've reviewed lase coredumps and lase logs, without finding any indication of a problem with lase. We've started reviewing Tivoli (v6.2), configuring auditing logging and running the 'perfanalyze_audit.pl', but the script doesn't find any commands that run longer than .1 seconds. We do not encounter this slowness on our test server, but the load definitely isn't the same on test. Our next step is to implement recommendations from ICS in the ibmslapd.conf (basically decreasing idle timeout, increasing db connections, transactions and paging results limit).
Has anyone else encountered these symptoms? Any troubleshooting recommendations or experience with Lawson system performance degradation would be greatly appreciated!
Thank you,
Anna