Hi Garnet,
My name is Vincent and I am the escalation engineer taking an indepth look into your logs and all the information you provided:
Firstly, the database performance does not look healthy at all. I note from your latest logs you have attached:Database Latency:5279
Database Latency:4
An optimal value for normal running confluence is between 0 and 3. 4-10 means the database is under load. We rarely see 5.2 seconds to read one row from the 'ClusterSaftey' table.
The thread dumps you generated: we're going to focus on why thread: http-8080-x threads are doing hibernate or database related activities.
Thread[http-8080-25,5,main] java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.read(SocketInputStream.java:129) java.io.DataInputStream.readFully(DataInputStream.java:178) java.io.DataInputStream.readFully(DataInputStream.java:152) net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:841) net.sourceforge.jtds.jdbc.SharedSocket.getNetPacket(SharedSocket.java:722) net.sourceforge.jtds.jdbc.ResponseStream.getPacket(ResponseStream.java:466) net.sourceforge.jtds.jdbc.ResponseStream.read(ResponseStream.java:103) ... Thread[http-8080-34,5,main] java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.read(SocketInputStream.java:129) java.io.DataInputStream.readFully(DataInputStream.java:178) java.io.DataInputStream.readFully(DataInputStream.java:152) net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:841) ... Thread[http-8080-12,5,main] net.sf.hibernate.util.IdentityMap.get(IdentityMap.java:125) net.sf.hibernate.impl.SessionImpl.getEntry(SessionImpl.java:709) net.sf.hibernate.impl.SessionImpl.getEntityIdentifier(SessionImpl.java:2749) net.sf.hibernate.impl.SessionImpl.updateReachableCollection(SessionImpl.java:2911) net.sf.hibernate.impl.FlushVisitor.processCollection(FlushVisitor.java:32) net.sf.hibernate.impl.AbstractVisitor.processValue(AbstractVisitor.java:69) Thread[http-8080-20,5,main] com.mchange.v2.resourcepool.BasicResourcePool.checkinResource(BasicResourcePool.java:646) com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$ConnectionEventListenerImpl.doCheckinResource(C3P0PooledConnectionPool.java:636) com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$ConnectionEventListenerImpl.connectionClosed(C3P0PooledConnectionPool.java:630) com.mchange.v2.c3p0.util.ConnectionEventSupport.fireConnectionClosed(ConnectionEventSupport.java:55) com.mchange.v2.c3p0.impl.NewPooledConnection.fireConnectionClosed(NewPooledConnection.java:510) com.mchange.v2.c3p0.impl.NewPooledConnection.markClosedProxyConnection(NewPooledConnection.java:381) com.mchange.v2.c3p0.impl.NewProxyConnection.close(NewProxyConnection.java:1246) net.sf.hibernate.connection.C3P0ConnectionProvider.closeConnection(C3P0ConnectionProvider.java:40) net.sf.hibernate.impl.BatcherImpl.closeConnection(BatcherImpl.java:311) net.sf.hibernate.impl.SessionImpl.disconnect(SessionImpl.java:3397) Thread[http-8080-30,5,main] java.lang.Object.hashCode(Native Method) java.util.HashMap.put(HashMap.java:372) java.util.HashSet.add(HashSet.java:200) net.sf.hibernate.util.IdentityMap.entrySet(IdentityMap.java:164) net.sf.hibernate.impl.SessionImpl.postFlush(SessionImpl.java:2854)
As Jean has pointed out in the previous comment you have situations where your instance has run out of connections from the pool (hibernate.c3p0.max_size = 80 - which is high and you still getting threads waiting for pool resources) , and have threads waiting, which leads to you and your users experiencing non responsiveness.
(The problem is exhabibated when users try to refresh...then it uses more of your http-8080 threads, until the thread limit for tomcat is reached, in your cases 75 processes..
At this stage it looks like there maybe database performance issues. Would it be possible to check with your Microsoft SQL performance and see how long each query is taking to execute: (ie turn on SQL logging on sql side)
We understand you next question is why does it appear that Confluence is Spamming the database (and probably causing it to slow down,)
Confluence does require intensive communication with the backend.
In most cases we find that customer that use 'documentation layout' and those that utilize 'page tree macro' / 'natural children' macro will tend to make Confluence poll the database more often because these require the code to find children of the current page to display + as well as for these pages it finds assert the current user has permission to view those pages....'naturally'
2010-09-13 20:01:00,745 DEBUG [http-8080-10] [net.sf.hibernate.SQL] log select page0_.CONTENTID as CONTENTID, page0_.CHILD_POSITION as CHILD_P13_, page0_.PARENTID as PARENTID, page0_.SPACEID as SPACEID, page0_.TITLE as TITLE, page0_.VERSION as VERSION, page0_.CREATOR as CREATOR, page0_.CREATIONDATE as CREATION6_, page0_.LASTMODIFIER as LASTMODI7_, page0_.LASTMODDATE as LASTMODD8_, page0_.VERSIONCOMMENT as VERSIONC9_, page0_.PREVVER as PREVVER, page0_.CONTENT_STATUS as CONTENT11_ from CONTENT page0_ where page0_.CONTENTTYPE='PAGE' and ((page0_.SPACEID=? )and(lower(page0_.TITLE)=? )and(page0_.PREVVER is null )and(page0_.CONTENT_STATUS='current' )) 2010-09-13 20:01:00,749 DEBUG [http-8080-10] [net.sf.hibernate.SQL] log select page0_.CONTENTID as CONTENTID, page0_.CHILD_POSITION as CHILD_P13_, page0_.PARENTID as PARENTID, page0_.SPACEID as SPACEID, page0_.TITLE as TITLE, page0_.VERSION as VERSION, page0_.CREATOR as CREATOR, page0_.CREATIONDATE as CREATION6_, page0_.LASTMODIFIER as LASTMODI7_, page0_.LASTMODDATE as LASTMODD8_, page0_.VERSIONCOMMENT as VERSIONC9_, page0_.PREVVER as PREVVER, page0_.CONTENT_STATUS as CONTENT11_ from CONTENT page0_ where page0_.CONTENTTYPE='PAGE' and ((page0_.SPACEID=? )and(lower(page0_.TITLE)=? )and(page0_.PREVVER is null )and(page0_.CONTENT_STATUS='current' )) 2010-09-13 20:01:00,753 DEBUG [http-8080-10] [net.sf.hibernate.SQL] log select page0_.CONTENTID as CONTENTID, page0_.CHILD_POSITION as CHILD_P13_, page0_.PARENTID as PARENTID, page0_.SPACEID as SPACEID, page0_.TITLE as TITLE, page0_.VERSION as VERSION, page0_.CREATOR as CREATOR, page0_.CREATIONDATE as CREATION6_, page0_.LASTMODIFIER as LASTMODI7_, page0_.LASTMODDATE as LASTMODD8_, page0_.VERSIONCOMMENT as VERSIONC9_, page0_.PREVVER as PREVVER, page0_.CONTENT_STATUS as CONTENT11_ from CONTENT page0_ where page0_.CONTENTTYPE='PAGE' and ((page0_.SPACEID=? )and(lower(page0_.TITLE)=? )and(page0_.PREVVER is null )and(page0_.CONTENT_STATUS='current' )) 2010-09-13 20:01:00,758 DEBUG [http-8080-10] [net.sf.hibernate.SQL] log select page0_.CONTENTID as CONTENTID, page0_.CHILD_POSITION as CHILD_P13_, page0_.PARENTID as PARENTID, page0_.SPACEID as SPACEID, page0_.TITLE as TITLE, page0_.VERSION as VERSION, page0_.CREATOR as CREATOR, page0_.CREATIONDATE as CREATION6_, page0_.LASTMODIFIER as LASTMODI7_, page0_.LASTMODDATE as LASTMODD8_, page0_.VERSIONCOMMENT as VERSIONC9_, page0_.PREVVER as PREVVER, page0_.CONTENT_STATUS as CONTENT11_ from CONTENT page0_ where page0_.CONTENTTYPE='PAGE' and ((page0_.SPACEID=? )and(lower(page0_.TITLE)=? )and(page0_.PREVVER is null )and(page0_.CONTENT_STATUS='current' )) 2010-09-13 20:01:00,762 DEBUG [http-8080-10] [net.sf.hibernate.SQL] log select page0_.CONTENTID as CONTENTID, page0_.CHILD_POSITION as CHILD_P13_, page0_.PARENTID as PARENTID, page0_.SPACEID as SPACEID, page0_.TITLE as TITLE, page0_.VERSION as VERSION, page0_.CREATOR as CREATOR, page0_.CREATIONDATE as CREATION6_, page0_.LASTMODIFIER as LASTMODI7_, page0_.LASTMODDATE as LASTMODD8_, page0_.VERSIONCOMMENT as VERSIONC9_, page0_.PREVVER as PREVVER, page0_.CONTENT_STATUS as CONTENT11_ from CONTENT page0_ where page0_.CONTENTTYPE='PAGE' and ((page0_.SPACEID=? )and(lower(page0_.TITLE)=? )and(page0_.PREVVER is null )and(page0_.CONTENT_STATUS='current' ))
Security checks: -> child pages.
2010-09-13 19:31:44,984 DEBUG [http-8080-12] [net.sf.hibernate.SQL] log select contentper3_.ID as ID0_, contentper4_.ID as ID1_, contentper3_.CONT_PERM_TYPE as CONT_PER2_0_, contentper3_.CONTENT_ID as CONTENT_ID0_, contentper3_.CREATIONDATE as CREATION4_0_, contentper3_.LASTMODDATE as LASTMODD5_0_, contentper4_.CP_TYPE as CP_TYPE1_, contentper4_.USERNAME as USERNAME1_, contentper4_.GROUPNAME as GROUPNAME1_, contentper4_.CPS_ID as CPS_ID1_, contentper4_.CREATOR as CREATOR1_, contentper4_.CREATIONDATE as CREATION7_1_, contentper4_.LASTMODIFIER as LASTMODI8_1_, contentper4_.LASTMODDATE as LASTMODD9_1_, contentper4_.CPS_ID as CPS_ID__, contentper4_.ID as ID__ from CONTENT page0_ inner join CONFANCESTORS ancestors1_ on page0_.CONTENTID=ancestors1_.DESCENDENTID inner join CONTENT contentent2_ on ancestors1_.ANCESTORID=contentent2_.CONTENTID inner join CONTENT_PERM_SET contentper3_ on contentent2_.CONTENTID=contentper3_.CONTENT_ID inner join CONTENT_PERM contentper4_ on contentper3_.ID=contentper4_.CPS_ID where page0_.CONTENTTYPE='PAGE' and ((page0_.CONTENTID=? )and(contentper3_.CONT_PERM_TYPE=? ))
Now the above hibernate out indicates that lower (SQL function) is being used: Have you checked you have added in your sql database an index for lower int he content table? If not, please see this KB article which explains how you could improve performance.
Our Final thing that we notice is you're running Zen VM Linux: May I ask what mode of zen you are running in: Paravirtualization or Harware Assisted or Emulation - We suggestion (and it is your option and call) to try to replicate your test data in a test database and get an evaluation version of Confluence on a dedicated server, and see the performance difference. The reason we ask to compare is that even though we do support VM configuration on non clustered editions, we have not tested this against Zen VM, so we are unable fully support your environment.
Just to put perspective to this matter. I just finished dealing with another customer, who was running mysql (and it's a fast database, but it was running in VM), and their instance was slow on a clean instance of Confluence (no spaces and no pages - fresh install).
We requested the customer to run his database in non VM and that did NOT work. Then we requested the the user to setup confluence run it as an in memory HSQL confluence (evaluation mode), and the performance was much faster. The customer in the end put his VM confluence on dedicated machine and the case was closed.
We look forward to your feedback in the results of different environment configurations.
Hope our response explains the current situation.
Thank you.
Regards
Vincent Choy