| In this set of pages on Investigating A Slow Confluence Installation - Version 3.3.1, I document the steps I took to try and identify slow running operations on my copy of Confluence 3.3.1. I recently added about 12000 new pages and 11000 labels for the Ellen G. White topical index, and I suspect that extra content is pushing Confluence into bad performance. The content has been there for a few days, so things like indexing should be fine, but robots are noticing the content, and coming in to index it. None of this should account for the recent slowdowns and non-responsive Confluence that I have seen.
Atlassian does have a lot of documentation about tuning Confluence performance for high content, or high load scenarios, but my installation really shouldn't fall into that category yet. I know about most of these techniques from applying them to tuning Adobe's wiki installations, one of which has hundreds of gigabytes of content, thousands of spaces, and thousands of users. Adobe runs some external copies of Confluence also, such as at http://learn.adobe.com Having my own copy of Confluence on the internet gives me the perfect playground to see how it performs, and test various fixes without inconveniencing thousands of users of wiki at a multi-billion dollar company. I can apply tweaks without any red tape. I talk in this set of topics how I examined the database, stack dumps, dealt with Confluence hangs, logged bugs with the vendor, examined robot activity, and took other defensive measures to keep my copy of Confluence running. Please let me know if these topics helped you run your copy of Conflunece. You can use the "Add comment" on any of these pages to let me know. Please include your email addres in your comment, I will keep a copy of your email so we can correspond about Confluence, but quickly remove it from the comment you place here. |
I have some pages in Confluence that appear to be taking 10 to 20 or more seconds to load. Bobsgear is currently hosted within a VMware server image running Linux on a Windows 2003 host. The linux guest is configured to have two cpus, and the host has four CPU cores, and plenty of ram to not be swapping to disk. The host runs SQL 2005 Server which is used for the storage of Bobsgear database. All attachments and pages are in the database.
Taskmgr on the host is showing 40%-60% CPU usage, but all the windows seem responsive. The host has 8GB of ram, and 1.7GB of physical memory is available.
top - 17:13:38 up 1 day, 2:53, 5 users, load average: 3.51, 3.68, 3.66
Tasks: 139 total, 3 running, 136 sleeping, 0 stopped, 0 zombie
Cpu(s): 74.1%us, 2.3%sy, 0.0%ni, 23.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2301952k total, 2221576k used, 80376k free, 65496k buffers
Swap: 1081336k total, 112k used, 1081224k free, 560984k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4213 root 18 0 1481m 1.3g 19m S 151.2 59.0 1456:09 java
2621 root 34 19 2580 1184 996 S 0.3 0.1 0:02.83 gam_server
7692 root 15 0 9940 2920 2352 R 0.3 0.1 0:02.91 sshd
7827 apache 15 0 27656 6860 1672 S 0.3 0.3 0:01.05 httpd
1 root 15 0 2088 672 576 S 0.0 0.0 0:10.99 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.34 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.02 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.22 migration/1
6 root 34 19 0 0 0 S 0.0 0.0 0:00.05 ksoftirqd/1
7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
8 root 10 -5 0 0 0 S 0.0 0.0 0:00.34 events/0
9 root 10 -5 0 0 0 S 0.0 0.0 0:00.07 events/1
10 root 10 -5 0 0 0 S 0.0 0.0 0:00.04 khelper
11 root 10 -5 0 0 0 S 0.0 0.0 0:00.03 kthread
13 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenwatch
I wanted to record the information from system information, and advanced cache, but forgot to grab them.
I used the admin interface in Confluence to enable page performance logging for a while, and generated about 21MB of atlassian-confluence.log in a couple of hours.
[20100910-speed]# ls -al
total 86768
drwxr-xr-x 2 root root 4096 Sep 10 17:12 .
drwxr-xr-x 16 root root 4096 Sep 10 16:28 ..
rw-rr- 1 root root 65292733 Sep 10 16:28 access_log
rw-rr- 1 root root 21660370 Sep 10 16:29 atlassian-confluence.log
rw-rr- 1 root root 11772 Sep 10 17:00 slow_access_example_10seconds
rw-rr- 1 root root 1742204 Sep 10 17:12 slow_performance.txt
Here is an example of a slow loading page: Slow Access Example - 10 second page load
And using this query I was able to filter out all the tasks that don't take much time:
grep -v "\[[0-9]ms\]" atlassian-confluence.log | grep -v "\[[1-2][0-9]ms\]" > slow_performance.txt
See the results of this query for confluence operations taking the most time.
I am using the Confluence-cli to upload some things from the filesystem of the Bobsgear VM into Bobsgear. I copied confluence.sh to bobsgear-cli.sh, and customized it with login params for the wiki, and placed the bobsgear-cli.sh into my path.
bobsgear-cli.sh --help shows the help screen.
This command uploaded a file into the wiki:
bobsgear-cli.sh --action storepage --space "Bobsgear" --title "Slow Access Example - 10 second page load" --parent "Investigating A Slow Confluence Installation" --file "slow_access_example_10seconds"
(Note:Add extra blank lines within a noformat if the text may scroll off the screen, because the right left scrollbar is blocking the text if you only have one line)
The slowness has gotten really bad. Looked at the http access_log, and noticed stuff like
66.249.68.77 - - [10/Sep/2010:17:55:10 -0700] "GET /pages/createpage.action?spaceKey=egw&title=Souls%2C&linkCreation=true&fromPageId=19673782 HTTP/1.1" 200 76251 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Decided to Create robots.txt for Confluence by analysis of http logs
Confluence finally just was giving proxy errors to every request, so I killed it with a -8.
Restarted confluence. It took
- INFO: Server startup in 160522 ms
Things were a lot better after the restart:
top - 19:26:36 up 1 day, 5:06, 5 users, load average: 0.52, 1.53, 2.45
Tasks: 141 total, 1 running, 140 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2301952k total, 2215816k used, 86136k free, 51496k buffers
Swap: 1081336k total, 176k used, 1081160k free, 738552k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7722 root 15 0 2352 1084 812 R 0.3 0.0 1:15.64 top
1 root 15 0 2088 672 576 S 0.0 0.0 0:11.01 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.37 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.12 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.02 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.26 migration/1
6 root 34 19 0 0 0 S 0.0 0.0 0:00.10 ksoftirqd/1
7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
8 root 10 -5 0 0 0 S 0.0 0.0 0:00.37 events/0
9 root 10 -5 0 0 0 S 0.0 0.0 0:00.09 events/1
10 root 10 -5 0 0 0 S 0.0 0.0 0:00.04 khelper
Host was showing only 4% - 8% usage, but msnbot had gone away too. Yandex took up the slack and started coming in for more content, but CPU usage in guest 1% for java process, and host 2% was much better.
Going to have to [study the thread dumps to see what was happening before hung Confluence had to be restarted].
After the restart, Confluence kept spinning back out of control after about an hour or two.
After finishing the first draft of the Confluence watchdog for linux, and getting lots of chances to let it to decide to restart Confluence, I started looking into issues with the database.....
More details to come about the database
Sunday Morning 9am.... Since I started at this task Friday evening, I have made good progress on fortifying my Confluence. My Confluence watchdog for linux caught Confluence returning a lot of 502 or 503 errors in the middle of the night, so it restarted it.
Looking at SQL Server 2005 report of most read intensive queries for Confluence, I was able to find a few queries that benefited from having additional indexes created. I noticed some queries using the "lower" operator on titles in the WHERE clause. I don't know if a lowered version of titles can be indexed, but I would look into adding a second copy of the the title that was lower case, and making an index on that field. I am concerned that a dumb SQL query planner would do a table scan instead of using an index. And database tuning advisor did recommend additional indexes for the queries I found that were doing this, but I wasn't able to really figure out if they were addressing this issue or not.
For the other queries I looked at, the database tuning advisor could not recommend any improvements.
The next thing I did was rebuild all the database indexes to get rid of fragmentation. Still no big improvement in speed. Many of them continued to be fragmented, not sure if that is an issue, I need to go back and look at the size of the indexes to see if it matters. Wasn't sure whether to rebuild or reorganize the indexes, either task finished very quickly. Also not sure if rebuilding the indexes improved performance, but average load on the host did seem to decrease by a few percent. I also did some database tuning analysis which claimed 50% speed up on some queries, and did seem to be doing that much or even better. I also figured out how to get SQL Server Profiler to show me some of the queries being executed for about an hour, but I haven't fully figured out what to do with that info. Database tuning advisor is not coming up with any recommendation for a lot of the queries I've been trying.
I've already given attention to things like memory and JVM optimization, and the only thing I could do better is run Confluence in it's own dedicated machine, instead of a VM on the same machine the database server is running on. The vendor likes to blaim the database for problems with the app, and the database could definitely be a source of problems for all kinds of reasons, probably the chief of which might be connection pooling not being able to make enough database connections, lack of indexing in the database, etc. But I've already paid attention to all those things.
In my experience with the Confluence app, assuming the database is properly setup, a SQL Server 2005 database usually accounts for only 10-20% of the overall performance of this application. So even a 10% speedup in the database probably wouldn't be noticed as it would be only a 2% speedup in the application itself.
This morning I have developed mod_rewrite rules to deflect a lot of yandex robot queries for triple lable combinations, since most triple label combinations (caused by adding on related labels, and yandex following those links despite the nofollow Confluence puts on all the links) dont actually select any documents. I am redirecting to a "Sorry, here are the best of our links!" page.
I also decided to use Apache, which sits in front of the confluence requests, to deflect the queries for a full list of labels for spaces, and instead serve this result from a daily static created file. One of the spaces here has over 10K labels in it, and the web page for all the labels in the space is a monstrous 808K! With judicious use of search and replace to get rid of the nofollow and class tags, and duplicate spaces, this page can shrink to "svelt" 479K. So I wrote a script to pull the labels for the biggest spaces, do a lot of search and replace to trim them down, and am letting Apache serve them instead of calling Confluence. There really isn't a lot of traffic to those pages.
I've been keeping an eye on TOP, and all morning Confluence java process has been the top CPU cycle user, coming in at between 100% and 150% CPU. That's better than the 180% it was getting to yesterday before crashing. I'm also keeping a watch on the HTTP access_log, and yandex is still busy looking for three word label combinations, and a little bit of googlebot action, and an occassional real user. No excuse at all for Confluence to be chewing up so much processor.
Then suddently, maybe 30 minutes ago, Confluence java process disappeared from TOP! Now I'm seeing it occassionally reappear using 10% or less of the cPU, and then going back away. It just showed up in the list at 100%, then next update 47%, next update 5%, and then gone again. Need to try and find a reason for this.
Around 8:00am Confluence was still very busy. Fortunately my watchdog is asking for a thread dump every ten minutes, so I have something to compare.
As often happens with this app, the bad performance is coming from within the app itself. As I mentioned, I noticed that suddenly the performance of Confluence increased, and the load on the CPU went way down, so I did an Analysis of thread dumps to figure out why Confluence CPU load suddenly decreased, and also Examing Confluence Page Profiling, and discovered there is often an http processing thread stuck in a long running operation. This thread can remain stuck in this operation for 1 or 2 hours.
Monday morning, as I look at the server, the overall load is at 2%. It's a safe bet that Confluence is not in the middle of that operation. It's still early Monday morning, no response yet from the vendor on the bug report I filed yesterday.
While adding {include:About my investigations of slow performance in Confluence 3.3.1} to 7 pages, I received 3 proxy timeouts. Load is back up to 31% on the host, and Confluence is back to chewing up 150% of the 2 core guest. The copy of Internet Explorer I was using just died in the middle of trying to click to another page, so now I am in a new copy of IE trying to finish this edit. It is a good thing that Confluence frequently saves drafts of edits. I just got a bad gateway response trying to click to this page. I never used to see these problems before the recent extra content that I added. Whatever the high load operation was in Confluence it has gone away and Java is showing 5% cpu utilization.
The watchdog restart Confluence yesterday around 3pm while I was away from the keyboard. But in it's most recent health report it did not mention the 502 errors I saw in the last few minutes (09/13/10 07:00-07:30). So I'm going to have to figure out what is going on with that.