我们的一个应用程序使用Elasticsearch(1.4.4)作为内存中的caching。 该应用程序是使用Oracle 1.7在Tomcat 7上部署的Java webapp。 elasticsearch实例是部署在同一台服务器上的单节点安装程序。
由于elasticsearch 1.3.3在应用程序和Elasticsearch节点之间的环回接口中使用空闲应用程序进出约40 MBit / s。
这不是那么多,但会导致一个明显的负载,否则平抑系统。 我没有这个应用程序的生产系统,所以我不能真正说出它是如何演变的。
通过tcpdump抓取stream量并在Wireshark中分析stream量,这表明应用中的Elasticsearch-Client不断要求节点获取每次产生10k答案的cluster/node/info 。
也许完全不相关,但启用服务器和客户端日志logging给我们:
Elasticsearch服务器日志:
[2015-05-12 14:45:01,600][INFO ][node ] [Illyana Rasputin] initializing ... [2015-05-12 14:45:01,608][INFO ][plugins ] [Illyana Rasputin] loaded [], sites [] [2015-05-12 14:45:06,666][INFO ][node ] [Illyana Rasputin] initialized [2015-05-12 14:45:06,667][INFO ][node ] [Illyana Rasputin] starting ... [2015-05-12 14:45:06,828][INFO ][transport ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.24.1.128:9300]} [2015-05-12 14:45:06,851][INFO ][discovery ] [Illyana Rasputin] bkbo_index/TITPDFdtR6SXX5EeOXaidg [2015-05-12 14:45:09,892][INFO ][cluster.service ] [Illyana Rasputin] new_master [Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[/10.24.1.128:9300]], reason: zen-disco-join (elected_as_master) [2015-05-12 14:45:09,943][INFO ][http ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.24.1.128:9200]} [2015-05-12 14:45:09,944][INFO ][node ] [Illyana Rasputin] started [2015-05-12 14:45:11,283][INFO ][gateway ] [Illyana Rasputin] recovered [2] indices into cluster_state
Elasticsearch客户端:
2015-05-12 14:46:40,683 INFO [localhost-startStop-1] PluginsService:<init>:151 [Antiphon the Overseer] loaded [], sites [] 2015-05-12 14:46:41,548 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms] 2015-05-12 14:46:41,594 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Antiphon the Overseer] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] 2015-05-12 14:46:41,625 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] 2015-05-12 14:46:41,655 INFO [localhost-startStop-1] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting... org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [0] timed out after [6ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-12 14:46:41,658 DEBUG [localhost-startStop-1] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call 2015-05-12 14:46:41,661 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] 2015-05-12 14:46:41,669 INFO [elasticsearch[Antiphon the Overseer][generic][T#1]] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting... org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [1] timed out after [5ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-12 14:46:41,670 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call 2015-05-12 14:46:41,676 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] 2015-05-12 14:46:41,677 WARN [elasticsearch[Antiphon the Overseer][transport_client_worker][T#2]{New I/O worker #2}] TransportService$Adapter:remove:280 [Antiphon the Overseer] Received response for a request that has timed out, sent [14ms] ago, timed out [9ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]], id [1] 2015-05-12 14:46:41,682 INFO [localhost-startStop-1] PluginsService:<init>:151 [Ricochet] loaded [], sites [] 2015-05-12 14:46:41,722 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Ricochet] node_sampler_interval[5ms] 2015-05-12 14:46:41,733 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Ricochet] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] 2015-05-12 14:46:41,734 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] 2015-05-12 14:46:41,759 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]] 2015-05-12 14:46:41,760 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
是的,这个应用程序有两个客户端连接,应该是可以的(根据开发者)。 这些断开/重新连接周期大约每隔一分钟发生一次。
任何线索是怎么回事? 我已经通过discovery.zen.ping.multicast.enabled: false禁用了多播。
你的客户,似乎join了集群(这很好,虽然如果你使用Kibana 4,你可能会从Kibana(不知道这些投诉是否使它不在4testing版)
从您的客户端日志中:
2015-05-12 14:46:41,548 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms]
5ms看起来相当积极是集群中的采样节点。 我还没有看到默认情况下,这是什么,但我猜测,什么是毫秒configuration时,预计秒?
此时,您需要考虑客户端API的设置,尽pipe客户端可能会从集群中select此设置(因为它正在成为集群的一部分)
据推测你使用的是elastic.co提供的Java API?
你是否可以在任何地方configurationclient.transport.nodes_sampler_interval ?
您是否按照Java客户端API的文档使用兼容的客户端/服务器版本
请注意,我们鼓励您在客户端和集群端使用相同的版本。 混合主要版本时,您可能会遇到一些不兼容问题
如果一个度量单位在版本之间有变化,我不会感到惊讶,尽pipe文档确实说默认值是5s
检查你的node_sampler_interval和你的代码中node_sampler_interval实例。 也许你需要用5sreplace裸体5 ?