CMS垃圾收集运行后,WebLogic服务器似乎冻结

我们使用带有10.3.6 WebLogic Cluster的Java Hotspot 1.6,其中两个节点在物理负载平衡器后面运行OSB 11.1.1.7:

java version "1.6.0_91" Java(TM) SE Runtime Environment (build 1.6.0_91-b13) Java HotSpot(TM) 64-Bit Server VM (build 20.91-b07, mixed mode) 

我们已经调整了JVM以使用CMS并避免长时间运行的垃圾回收:

 -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:NewRatio=4 -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark 

大多数GC的暂停阶段(初始标记和备注)只需不到一秒钟,我们没有看到任何失败的CMS运行,并且我们在GC日志中看不到完整的GC。

但是,在完成CMS GC执行后几秒钟(特别是在具有大量调用OSB的组合的Oracle SOA安装中)时,我们正在尝试连接到OSB的客户机上收到超时错误:

OSB 1 gc.log:

 > 2016-07-28T08:31:52.656-0300: 492247.689: [GC > 2016-07-28T08:31:52.656-0300: 492247.690: [ParNew: > 1027265K->22014K(1132416K), 0.0412613 secs] > 4046262K->3043322K(6165632K), 0.0420791 secs] [Times: user=0.25 > sys=0.00, real=0.04 secs] 2016-07-28T08:31:52.701-0300: 492247.735: > [GC [1 CMS-initial-mark: 3021307K(5033216K)] 3043844K(6165632K), > 0.0683781 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] 2016-07-28T08:31:52.770-0300: 492247.804: [CMS-concurrent-mark-start] > 2016-07-28T08:31:56.971-0300: 492252.005: [CMS-concurrent-mark: > 4.201/4.201 secs] [Times: user=10.55 sys=0.16, real=4.20 secs] 2016-07-28T08:31:56.971-0300: 492252.005: > [CMS-concurrent-preclean-start] 2016-07-28T08:31:57.064-0300: > 492252.098: [CMS-concurrent-preclean: 0.090/0.093 secs] [Times: user=0.12 sys=0.00, real=0.09 secs] 2016-07-28T08:31:57.064-0300: > 492252.098: [CMS-concurrent-abortable-preclean-start] 2016-07-28T08:32:01.198-0300: 492256.231: [GC > 2016-07-28T08:32:01.198-0300: 492256.232: [ParNew: > 1028128K->30711K(1132416K), 0.0502326 secs] > 4049435K->3053383K(6165632K), 0.0508313 secs] [Times: user=0.34 > sys=0.00, real=0.05 secs] CMS: abort preclean due to time > 2016-07-28T08:32:02.128-0300: 492257.161: > [CMS-concurrent-abortable-preclean: 4.027/5.063 secs] [Times: > user=7.62 sys=0.24, real=5.06 secs] 2016-07-28T08:32:02.134-0300: > 492257.167: [GC[YG occupancy: 186361 K (1132416 K)]2016-07-28T08:32:02.134-0300: 492257.167: [GC > 2016-07-28T08:32:02.134-0300: 492257.167: [ParNew: > 186361K->25052K(1132416K), 0.0231613 secs] > 3209034K->3049538K(6165632K), 0.0236069 secs] [Times: user=0.17 > sys=0.00, real=0.02 secs] 2016-07-28T08:32:02.158-0300: 492257.191: > [Rescan (parallel) , 0.0289939 secs]2016-07-28T08:32:02.187-0300: > 492257.220: [weak refs processing, 0.3412320 secs] [1 CMS-remark: 3024485K(5033216K)] 3049538K(6165632K), 0.4179878 secs] [Times: > user=0.67 sys=0.00, real=0.42 secs] 2016-07-28T08:32:02.552-0300: > 492257.586: [CMS-concurrent-sweep-start] 2016-07-28T08:32:07.144-0300: 492262.177: [CMS-concurrent-sweep: 4.591/4.591 secs] [Times: user=14.05 sys=0.22, real=4.59 secs] 2016-07-28T08:32:07.146-0300: > 492262.180: [CMS-concurrent-reset-start] 2016-07-28T08:32:07.283-0300: 492262.316: [CMS-concurrent-reset: 0.137/0.137 secs] [Times: user=0.36 sys=0.01, real=0.14 secs] 2016-07-28T08:32:36.201-0300: 492291.234: > [GC 2016-07-28T08:32:36.202-0300: 492291.234: [ParNew: > 1030955K->45449K(1132416K), 0.0671313 secs] > 2269248K->1285059K(6165632K), 0.0676485 secs] [Times: user=0.41 > sys=0.00, real=0.07 secs] 

在OSB WebLogic访问日志中,我们注意到这种奇怪的行为,通常在CMS启动后5或10秒,访问日志中的请求之间有20多秒的间隔,通常我们每秒看到几十个调用。

OSB1访问日志:

 2016-07-28 08:32:03 POST /service1/ 200 0.043 2016-07-28 08:32:03 POST /service1/ 200 0.074 2016-07-28 08:32:03 POST /service1/ 200 0.045 2016-07-28 08:32:03 POST /service1/ 200 0.053 2016-07-28 08:32:35 POST /service1/ 200 0.076 

与此同时,SOA服务器的一些调用失败:

SOA日志错误:https://esburl.mycompany.com/service1“成功归因于:无法成功调用端点URI” https://esburl.mycompany.com/service1 “,原因是:javax.xml.soap.SOAPException: javax.xml.soap.SOAPException:消息发送失败:读取超时>

在OSB weblogic输出日志中没有报告错误。 GC运行时,我们已经采取了一些措施,但一直没有find问题的答案。 我们认为,如果初始标记和注释足够快,我们不会看到这样的问题,但是我们仍然怀疑这是JVM,WebLogic还是应用程序问题(OSB)。 任何人在WebLogic或其他应用程序服务器上看到类似的东西?

谢谢!

编辑:

这似乎是问题是SSL相关的:

 ~$ curl -v -k https://mycompany.com:9001/service1/ -o /dev/null * Hostname was NOT found in DNS cache % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying xxxx.. * Connected to mycompany.com (xxxx) port 9001 (#0) * successfully set certificate verify locations: * CAfile: none CApath: /etc/ssl/certs * SSLv3, TLS handshake, **Client hello** (1): } [data not shown] 0 0 0 0 0 0 0 0 --:--:-- **0:00:16** --:--:-- 0* SSLv3, TLS handshake, **Server hello** (2): { [data not shown] * SSLv3, TLS handshake, CERT (11): { [data not shown] * SSLv3, TLS handshake, Server key exchange (12): { [data not shown] * SSLv3, TLS handshake, Server finished (14): { [data not shown] * SSLv3, TLS handshake, Client key exchange (16): } [data not shown] * SSLv3, TLS change cipher, Client hello (1): } [data not shown] * SSLv3, TLS handshake, Finished (20): } [data not shown] * SSLv3, TLS change cipher, Client hello (1): { [data not shown] * SSLv3, TLS handshake, Finished (20): { [data not shown] * SSL connection using EDH-RSA-DES-CBC3-SHA * Server certificate: * subject: C=xx; ST=xx; L=xx; O=xx; OU=Tech; CN=mycompany.com * start date: 2016-02-25 13:17:10 GMT * expire date: 2021-02-23 13:17:10 GMT * issuer: DC=xx; DC=xx; DC=xx; CN=xx * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. > GET /service1 HTTP/1.1 > User-Agent: curl/7.35.0 > Host: mycompany.com:9001 > Accept: */* > < HTTP/1.1 200 OK < Date: Wed, 03 Aug 2016 15:46:41 GMT < Transfer-Encoding: chunked < Content-Type: text/xml; charset=UTF-8 < X-Powered-By: Servlet/2.5 JSP/2.1 

看起来,虽然做GC,它需要服务器很多时间来完成SSL握手(在这个例子中是16秒),但我不知道为什么。 有谁知道有关GC和SSL协议的任何Java问题?

问候。

经过漫长的调查,我们终于find了问题的原因,而CMS GC正在运行,我们采取了几个线程转储,发现这个显眼的堆栈在5或6个线程:

 "[ACTIVE] ExecuteThread: '35' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=3 tid=0x000000001333c000 nid=0xa1 runnable [0xfffffd7e23dd1000] java.lang.Thread.State: RUNNABLE at sun.security.pkcs11.wrapper.PKCS11.C_DestroyObject(Native Method) at sun.security.pkcs11.SessionKeyRef.dispose(P11Key.java:1043) at sun.security.pkcs11.SessionKeyRef.drainRefQueueBounded(P11Key.java:1019) at sun.security.pkcs11.SessionKeyRef. (P11Key.java:1034) at sun.security.pkcs11.P11Key. (P11Key.java:98) at sun.security.pkcs11.P11Key$P11SecretKey. (P11Key.java:379) at sun.security.pkcs11.P11Key.secretKey(P11Key.java:271) at sun.security.pkcs11.P11TlsRsaPremasterSecretGenerator.engineGenerateKey(P11TlsRsaPremasterSecretGenerator.java:84) at javax.crypto.KeyGenerator.generateKey(DashoA13*..) at com.sun.net.ssl.internal.ssl.RSAClientKeyExchange. (RSAClientKeyExchange.java:68) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverHelloDone(ClientHandshaker.java:807) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:286) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:608) at com.sun.net.ssl.internal.ssl.Handshaker$1.run(Handshaker.java:548) at java.security.AccessController.doPrivileged(Native Method) at com.sun.net.ssl.internal.ssl.Handshaker$DelegatedTask.run(Handshaker.java:941) - locked <0xfffffd7e402b5ca0> (a com.sun.net.ssl.internal.ssl.SSLEngineImpl) 

这段代码的执行时间是相同的线程,间隔20-25秒,所以它似乎是一个非常好的嫌疑犯。

在阅读Oracle支持文档和这些链接之后:

https://bugs.openjdk.java.net/browse/JDK-8059337 https://stackoverflow.com/questions/31188663/disabling-pkcs11-solaris-implementation

我们决定使用“-Dsun.security.pkcs11.enable-solaris = false”来禁用Solaris pkcs11实现。 一旦我们这样做了,所有的超时消失了,GC之后访问日志中的差距就不再存在了。

我不确定使用sun pcks11比其他提供商有什么优势,但禁用它似乎解决了我们的问题,迄今为止还没有发现这个变化产生的新问题,所以我们将坚持下去。