| In this set of pages on Investigating A Slow Confluence Installation - Version 3.3.1, I document the steps I took to try and identify slow running operations on my copy of Confluence 3.3.1. I recently added about 10K pages and labels for an EGW topical index, roughly doubling the number of pages (not not total size of content). This should account for the recent slowdowns and non-responsive Confluence that I have seen.
In this set of topics I discuss how I examined the database, stack dumps, dealt with Confluence hangs, logged bugs with the vendor, examined robot activity, and took other defensive measures to keep my copy of Confluence running. Please let me know if these topics helped you run your copy of Conflunece. You can use the "Add comment" on any of these pages to let me know. Please include your email addres in your comment, I will keep a copy of your email so we can correspond about Confluence, but quickly remove it from the comment you place here. |
While Investigating A Slow Confluence Installation - Version 3.3.1, I looked at the http access_log, and noticed stuff like
66.249.68.77 - - [10/Sep/2010:17:55:10 -0700] "GET /pages/createpage.action?spaceKey=egw&title=Souls%2C&linkCreation=true&fromPageId=19673782 HTTP/1.1" 200 76251 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
That's not good. I don't need google trying to create pages in the wiki.... Prevent search engine indexing using robots.txt has details.
This found all the googlebot hits for the last two days, and counted them:
grep -i googlebot access_log | nl | tail
From 05/Sep 4am to 10/Sep 18:28 - 10163K hits. Not too terrible. About 1 a minute right now. Not bad at all.
Don't need google to look at page sources, so block this:
- /pages/viewpagesrc.action
These got me a page not found, they could be for previous versions of Confluence. So we'll block them:
- /pages/favourites/pagefavourites.action
- /pages/watchers/pagewatchers.action
- /pages/worddav
- /plugins/servlet/editinword
- /s/1814/3 - don't need bots looking at style stuff
- /pages/editpage.action - Obviously!
- /login.action
- /s/1724/1/1.0/
- /spaces/flyingpdf - I don't need bots bogging me down asking for PDF versions of my pages.
- /wiki/
- /s/1923
- /exportword
These are all 404! No need to carefully consider them, just try to block them all!
Questionable:
- Do I really want search engines hot linking to my attachments? Probably not, so block /pages/viewpageattachments.action
To find more, I used this command:
grep -i googlebot access_log | grep -v -i display | grep -v pagefavourites.action | grep -v viewpagesrc.action | grep -v pagewatchers.action | grep -v /spaces/flyingpdf | grep -v /exportword | grep -v /s/1814 | grep -v /s/1923 | grep -v viewpageattachments.action | grep -v worddav | grep -v login.action | grep -v editpage.action | grep 404 | less
I used this command to determine how many googlebot references were left:
grep -i googlebot access_log | grep -v /pages/createpage.action | grep -v viewpagesrc.action | grep -v pagefavourites.action | grep -v pagewatchers.action | grep -v /pages/worddav | grep -v editinword | grep -v /s/1814/3 | grep -v editpage.action | grep -v login.action |grep -v /s/1724 | grep -v viewpageattachments.action |grep -v flyingpdf | grep -v exportword | grep -v /s/1923 | grep -v /wiki/ | grep -v viewpreviousversions.action | grep -v copypage.action | grep -v /download/tmp | nl | tail
Total is 5495. So a good robots.txt exclusion list can knock down 50 percent of the traffic from robots (assuming they obey it!) Out of the 5495, 929 are things other than /display pages.
URLS like this
/plugins/recently-updated/changes.action?theme=sidebar&pageSize=5&startHandle=com.atlassian.confluence.pages.Comment-18612989&authors=admin&spaceKeys=*&contentType=-mail,page,comment,blogpost,attachment,userinfo,spacedesc,personalspacedesc,status
seem to always return No recent updates found
With this query,
grep -i googlebot access_log | grep -v /pages/createpage.action | grep -v viewpagesrc.action | grep -v pagefavourites.action | grep -v pagewatchers.action | grep -v /pages/worddav | grep -v editinword | grep -v /s/1814/3 | grep -v editpage.action | grep -v login.action |grep -v /s/1724 | grep -v viewpageattachments.action |grep -v flyingpdf | grep -v exportword | grep -v /s/1923 | grep -v /wiki/ | grep -v viewpreviousversions.action | grep -v copypage.action | grep -v /download/tmp |grep -v /display/ | grep /plugins/recently-updated | grep -v 64
I found some that returned more content, such as 2696, or 3079 characters. But when I checked them, just the small phrase.
Each one has a different startHandler, but the rest of the params seem to be same. not to useful, so blocking these too.
That got me down to 852. Eliminating robots.txt got rid of another 19 hits. So google bots check a few times a day for the exclusions.
/pages/diffpagesbyversion.action urls return some interesting unique content, so I am not excluding them.
URLs like this could be just duplicates of display pages, or they could be for pages that have certain characters in the title:
- /pages/viewpage.action?pageId=20227890
I had 200 of them. Leaving them in.
/pages/viewinfo.action?pageId=19675346 type urls just seem to go back to viewpage if the ID is not the latest revision of the page. There are 210 here.
Found this in the http://confluence.atlassian.com/robots.txt:
User-agent: * crawl-delay: 60 Disallow: /display/TEST Disallow: /display/TESTING Disallow: /label Disallow: /pages/doexportpage.action Disallow: /exportword Disallow: /spaces/flyingpdf Disallow: /pages/downloadallattachments.action Disallow: /download/temp/
Added it to mine.... Except for label, I want robots to crawl all the label pages.
So here is the robots.txt I came up with so far:
User-agent: * crawl-delay: 30 Disallow: /pages/createpage.action Disallow: /pages/viewpagesrc.action Disallow: /pages/favourites/pagefavourites.action Disallow: /pages/watchers/pagewatchers.action Disallow: /pages/worddav Disallow: /plugins/servlet/editinword Disallow: /s/1814/3 Disallow: /pages/editpage.action Disallow: /login.action Disallow: /s/1724/1/1.0/ Disallow: /pages/viewpageattachments.action Disallow: /spaces/flyingpdf Disallow: /exportword Disallow: /s/1923 Disallow: /wiki/ Disallow: /pages/viewpreviousversions.action Disallow: /pages/copypage.action Disallow: /download/temp Disallow: /plugins/recently-updated/changes.action Disallow: /display/TEST Disallow: /display/TESTING Disallow: /label Disallow: /pages/doexportpage.action Disallow: /pages/downloadallattachments.action Disallow: /download/temp/
Comments on the prevent search engine indexing page had a very extensive robots.txt to try and tame the google search appliance on an intranet. So I added most of those disallows to mine.