| In this set of pages on Investigating A Slow Confluence Installation - Version 3.3.1, I document the steps I took to try and identify slow running operations on my copy of Confluence 3.3.1. I recently added about 10K pages and labels for an EGW topical index, roughly doubling the number of pages (not not total size of content). This should account for the recent slowdowns and non-responsive Confluence that I have seen.
In this set of topics I discuss how I examined the database, stack dumps, dealt with Confluence hangs, logged bugs with the vendor, examined robot activity, and took other defensive measures to keep my copy of Confluence running. Please let me know if these topics helped you run your copy of Conflunece. You can use the "Add comment" on any of these pages to let me know. Please include your email addres in your comment, I will keep a copy of your email so we can correspond about Confluence, but quickly remove it from the comment you place here. |
| My support request about the rogue CPU hogging thread in Confluence was met with a response by a technician claiming out of memory issues were the cause of my problems. He noted several other things from the Hercules logs also. There was no response about the specific thread stack trace I was asking about. They were also happy I disabled the global usage plugin, which is known not to scale well. (I get a max of 10K hits a day, that doesn't seem like a lot to me, but tech support reflexively recommends getting rid of that plugin. So I already disabled it, and am still seeing performance problems.) So I logged a response, with some more logs from yesterdays 3:00pm restart, which are lacking in any out of memory errors.
Atlassian does give a lot of support compared to the cost of their products. They talk about having legendary support, and I really can't disagree with that. However, sometimes you have to push to get any analysis beyond the simple reflexive responses to problems. It does help to make sure all the low hanging fruit is taken care of first, but I think I've already done that, and I tried to be proactive in taking care of some of the things that I thought support would comment on, and letting them know that. As you can see, the first response missed some of that. So I push a bit to get the answers I need. Here are some excerpts from the support ticket. I continued annotating the ticket with information about long running threads. I finally realized that the big problem was having documentation theme for the EGW space, and it having a page tree in the left hand pane that needed to generate 11K links on every access. I removed the page tree from the theme, and made sure there was a "search this space" box instead, but that discouraged me immensely. It is too easy to stumble on this kind of problem with Confluence. I also found a page, Confluence Page Profiling - Operations Taking Most Time, that needed {noformat} wrapped around the text, confluence doesn't like when cut and paste content has hundreds of things wrapped in [] characters, it has to go through and check every one of them whether it is a real link, does it have proper permissions for the user, etc. As a result, my confluence is performing better today, but I am still seeing proxy errors. There are some known problems in this area:
On 9/13, the ticket was escalated to an engineer who really believes the database performance is an issue. With Confluence hammering the database, it is no wonder. I had tried to figure out the query throughput with perfmon, and didn't see anywhere near as many queries as Confluence was showing in its logs. So I am going to have to work through the engineer's specific suggestions and see if I can learn more about SQL Server 2005 database monitoring. |
Original report:
I have been having very high CPU usage on my copy of Confluence. It seems to be related to a thread getting stuck in a thread that looks like this:
"http-8080-15" daemon prio=10 tid=0x096ef000 nid=0x44c7 runnable [0x60f00000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.size(HashMap.java:268)
at org.apache.commons.collections.SequencedHashMap.size(SequencedHashMap.java:237)
at net.sf.hibernate.util.IdentityMap.entryList(IdentityMap.java:170)
at net.sf.hibernate.util.IdentityMap.entries(IdentityMap.java:58)
at net.sf.hibernate.impl.SessionImpl.flushCollections(SessionImpl.java:2822
....
(See Analysis of thread dumps to figure out why Confluence CPU load suddenly decreased for example thread stack trace.
The ASTDirective stuff in the middle is a common element..... A series of thread dumps every ten minutes will show the same thread stuck in this for as long as two hours.
Analysis of time frames, notes on this issue, here: Analysis of thread dumps to figure out why Confluence CPU load suddenly decreased
Yesterday Confluence was repeatedly going into a nonresponsive mode, sometimes after just a couple of hours.
Hide
Permalink Edit Garnet R. Chaney added a comment - 12/Sep/10 1:11 PM
confluence_support_2010-09-12-11-09-42.zip - Current logs
Show Garnet R. Chaney added a comment - 12/Sep/10 1:11 PM confluence_support_2010-09-12-11-09-42.zip - Current logs
.Hide
Permalink Edit Garnet R. Chaney added a comment - 12/Sep/10 1:20 PM
Logs prior to several restarts required yesterday:
•logs-2010-09-11-22-10.zip
•2010-09-12-16-10.zip
•2010-09-12-18-20.zip
Show Garnet R. Chaney added a comment - 12/Sep/10 1:20 PM Logs prior to several restarts required yesterday: logs-2010-09-11-22-10.zip2010-09-12-16-10.zip2010-09-12-18-20.zip
.Hide
Permalink Edit Garnet R. Chaney added a comment - 12/Sep/10 1:24 PM
One more from yesterday, I think this was the first downtime.
Each time Confluence process was still running, eating up lots of CPU, but not responding to web requests.
2010-09-11-11-32.zip
Garnet R. Chaney added a comment - 12/Sep/10 1:39 PM
Actions taken:
- Disabled usage plugin: http://confluence.atlassian.com/display/CONFKB/Poor+Performance+due+to+Confluence+Usage+Plugin+-+Space+Activity
Last night rebuilt all indexes for all tables in database.
Ran database tuning analyzer which made recommendations for additional indexes on some very frequent queries
My bandana table has these indexes:
- band_context_idx
- band_key_idx - already existed, on bandanakey asc.
- pk_bandana_07020F21
- Administration >> General Configuration page loads as quickly as any other page.
- Settings persistence is at 99% effectiveness, but only 26% used
Gave example of Cache, and System info. See Analysis of thread dumps to figure out why Confluence CPU load suddenly decreased
Garnet R. Chaney added a comment - 12/Sep/10 2:50 PM
What are these get permitted entities? These seem to be exactly what is chewing up my CPU!
2010-09-12 11:02:49,733 DEBUG [http-8080-11] [atlassian.util.profiling.UtilTimerStack] log [13122ms] - Rendering macro: {tip}
2010-09-12 11:04:57,577 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [123992ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:05:04,626 DEBUG [http-8080-14] [atlassian.util.profiling.UtilTimerStack] log [12935ms] - /display/egw/Flagon,+Flagons
2010-09-12 11:06:00,824 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [63243ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:09:56,018 DEBUG [pool-8-thread-1] [atlassian.util.profiling.UtilTimerStack] log [12010ms] - com.atlassian.bonnie.LoggingReentrantLock.indexWriteLock
2010-09-12 11:09:56,872 DEBUG [http-8080-18] [atlassian.util.profiling.UtilTimerStack] log [16016ms] - /admin/docreatesupportzip.action
2010-09-12 11:10:48,693 DEBUG [http-8080-7] [atlassian.util.profiling.UtilTimerStack] log [42976ms] - /download/temp/confluence_support_2010-09-12-11-09-42.zip
2010-09-12 11:12:30,698 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [386254ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:17:49,156 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [318447ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:18:45,835 DEBUG [http-8080-5] [atlassian.util.profiling.UtilTimerStack] log [15657ms] - /display/ibs/I+Respect+Your+Opinion+I+Just+Think+It's+Stupid
2010-09-12 11:19:55,234 DEBUG [http-8080-18] [atlassian.util.profiling.UtilTimerStack] log [217170ms] - /display/bobsgear/Confluence+Page+Profiling+-+Operations+Taking+Most+Time
2010-09-12 11:32:27,637 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [878455ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:34:12,986 DEBUG [DefaultQuartzScheduler_Worker-7] [atlassian.util.profiling.UtilTimerStack] log [10737ms] - MailAccountManager.pollAllSpaces()
2010-09-12 11:44:32,417 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [724740ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:45:00,472 DEBUG [http-8080-20] [atlassian.util.profiling.UtilTimerStack] log [10783ms] - /pages/editpage.action
2010-09-12 11:48:31,755 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [239324ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:49:53,284 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [81518ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:52:35,944 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [162653ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:53:25,421 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [49472ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:54:53,551 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [88107ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:55:16,073 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [22491ms] - PermissionManager.getPermittedEntities()
2010-09-12 11:55:32,574 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [16481ms] - PermissionManager.getPermittedEntities()
2010-09-12 12:03:22,046 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [468789ms] - PermissionManager.getPermittedEntities()
2010-09-12 12:07:10,695 DEBUG [DefaultQuartzScheduler_Worker-9] [atlassian.util.profiling.UtilTimerStack] log [10680ms] - MailAccountManager.pollAllSpaces()
2010-09-12 12:09:24,670 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [362607ms] - PermissionManager.getPermittedEntities()
Right now, at 12:20, http-8080-4 seems to be stuck in the same kind of thread. It appears it may have started before 12:00....
Lets see if there are any recent entries from this thread: grep -i http-8080-4 atlassian-confluence.log
2010-09-12 12:38:55,218 DEBUG [http-8080-4] [atlassian.util.profiling.UtilTimerStack] log [1770510ms] - PermissionManager.getPermittedEntities()
There is no detail in atlassian-confluence.log for these items!
Hello Garnet,
Thank you for contacting Atlassian Support and for providing all this information.
I'm glad you have already disabled the Confluence Usage Plugin, because it is one of the main factors for Performance issues.
I believe your main problem is regarding Out Of Memory errors. Your log files are filled with it, which makes much sense because your system information is showing only 14% of available memory for your system. My recommendation for the moment is for you to increase the JVM Memory.
Some of the problems found in your log files and the respective KB articles:
- How to Fix Out of Memory errors by Increasing Available Memory.
- How to Manage Application Server Memory Settings.
Some database deadlocks also have occurred, mainly because of the OutOfMemory errors:
Hope the suggested resolutions in the above articles help you solve the problem.
If you have any questions, please let me know.
Kind regards,
Atlassian Support
Garnet R. Chaney added a comment - 13/Sep/10 10:02 AM
I agree out of memory errors don't help Confluence. I have already allocated 2.3 GB to this virtual machine, of which 1.5GB is allocated to Confluence, which is about the max to ensure no disk swapping. I think this is a huge amount of ram to devote to a copy of confluence with only 25K current pages, and usage rate of maybe 100 hits a minute, or less.
Here's my setenv.sh:
JAVA_HOME=/usr/java/jdk1.6.0_18
export JAVA_HOME
- For 3.3.1:
JAVA_OPTS="-Xms1056m -Xmx1056m -XX:NewSize=930m -XX:+UseParallelOldGC -XX:MaxPermSize=156m $JAVA_OPTS -Djava.awt.headless=true -XX:CompileCommand=exclude,org/apache/velocity/runtime/directive/Foreach,render "
I am still experiencing severe performance problems even after disabling the global usage plugin.
I didn't see any comment about the specific thread that I asked for information on. I also saw no response to my question
- Garnet R. Chaney added a comment - 12/Sep/10 2:50 PM - What are these get permitted entities? These seem to be exactly what is chewing up my CPU!
This logs "confluence_support_2010-09-12-11-09-42.zip - Current logs" has a series of thread dumps that show the same thread being stuck in a the same stack trace that I quoted, for hours at a time. When this goes away, then performance is good. Then it comes back on another thread and another can be stuck for hours at a time. While that is happening my performance is ruined. What is this thread doing?
Right now when no threads are stuck in this mode, the CPU usage is very low, and memory usage is reasonable:
Java VM Memory Statistics
Heap Memory
43 % Free
Maximum Heap Size 864 MB
Total Allocated Memory 864 MB
Used Memory 496 MB
Free Allocated Memory 368 MB
Total Free Memory 368 MB
Permanent Generation Memory
48 % Free
Maximum PermGen 156 MB
Used PermGen 81 MB
Please see my blog post about this for more details: http://www.bobsgear.com/display/bobsgear/Analysis+of+thread+dumps+to+figure+out+why+Confluence+CPU+load+suddenly+decreased
I also mentioned above that I have looked at all recommendations regarding bandana table. Did I miss anything?
Garnet R. Chaney added a comment - 13/Sep/10 10:20am
This set of logs is after global usage plugin disabled. There are no out of memory errors in it. Yet before 3:00pm yesterday, as you can see from apache access log, Confluence was running, but not responding to requests, causing 502 "proxy" errors for Apache.
Garnet R. Chaney added a comment - 13/Sep/10 10:13pm
My confluence performance is horrible right now. Logs attached: confluence_support_2010-09-13-20-00-46.zip
i have enabled both page profiling and SQL logging, and most of the work is being caused by http-8080-10. At 18:57 was the last time there ws a UtilTimerStack showing /json/startheartbeatactivity.action (786 ms). After that, it starts making most of the queries against the database, the same query over and over.
At 19:06:41 http-8080-10 and http-8080-12 start hitting the database, and for next 40 minutes all they do is hit the database.
The thread http-8080-10 shows this query over and over, over 3000 times so far, with nothing else logged from it:
Using this grep, shows that for the last 2397 queries against the database, 1082 of them were from this thread. That's over a period of approx 19:23:01 to 19:31:38
Occassionally it makes a couple of queries queries for labels.
tail -n 10000 atlassian-confluence.log | grep "hibernate.SQL" | nl | grep http-8080-10 | nl | tail
We're up to 4651 queries in the last 30 minutes to http-8080-10
http-8080-12 doesn't put as much load, but it is also doing a query of its own over and over:
Including http-8080-12.... Last 3384 queries, 249 were http-8080-12, 1641 were http-8080-10
What are http-8080 threads doing for such a long time without creating page logging entries in the log?
I give thumbs down to bandana queries issue from hercules. Bandana query is in the atlassian-confluence.log about 40 times total. http-8080-10 & http-8080-12 have made 10,596 queries against the database, out of a total of 20146 SQL queries.
For more, see CSP-51459 Slow Confluence - Escalation Engineer Response