I know this may be hard to comprehend but wanted to try.
We went live in May 16. LTM and S3 payroll apps (a few others as well)
Employees/Managers go to html5 page/url on the LTM side
We are federated and that then redirects automatically to S3 url side for authentication
Once in July and then CONSISTENTLY starting in November, we have several downtimes just on employee/manager space and external job board... all of which use webserver on LTM side
We are working diligently with support/AMS. But not making much headway. Just that hung threads seem to be the issue.. no way of knowing what /why they are hanging.
We recently added 20 gb and adjusted heap on LmrkListBatch.. since then, we made it almost 2 weeks without issue but then had it go down Sunday and Monday both for about 20 minutes each time and resolved on its own.
90% of the time, it resolves on its own.. 20 minutes, then it's fine. Sometimes it's over an hour (we can see this after the fact as we have a 3 party do webchecks on our url to then report to us if it fails to launch successfully)
I have went through everything I can find to see if anything performed on our systems had the impact of when this consistently started in November.
The only 2 things I can find are application of Microsoft server patches the end of October and the implementation/use of Proxy roles to allow Coordinators to do things on behalf of Managers. Late November we updated the environments and application levels. We are on v 10 with Windows servers and sql db.. IBM WebSphere. AMS has added ram/adjusted heap twice. THis consistently occurs at night but also occurs at random during the day.
I'm not sure that anyone can really help here since we are working with Support, AMS and Development for months but thought it was worth a try to see if anyone has any insight.
Thanks in advance for any feedback.