Nginx作为负载均衡器。 频繁上行超时(110:连接超时)连接上行

我正在尝试在Centos 7虚拟机上使用nginx作为负载平衡器来replace老化的Coyote Point硬件设备。 然而,在我们的一个web应用程序中,我们看到日志中频繁且持续的上游超时错误,而客户端正在报告使用该系统时的会话问题。

这里是我们的nginx.conf的相关位

user nginx; worker_processes 4; error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid; events { worker_connections 1024; } upstream farm { ip_hash; server www1.domain.com:8080; server www2.domain.com:8080 down; server www3.domain.com:8080; server www4.domain.com:8080; } server { listen 192.168.1.87:80; server_name host.domain.com; return 301 https://$server_name$request_uri; } server { listen 192.168.1.87:443 ssl; server_name host.domain.com; ## Compression gzip on; gzip_buffers 16 8k; gzip_comp_level 4; gzip_http_version 1.0; gzip_min_length 1280; gzip_types text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript image/x-icon image/bmp; gzip_vary on; tcp_nodelay on; tcp_nopush on; sendfile off; location / { proxy_connect_timeout 10; proxy_send_timeout 180; proxy_read_timeout 180; #to allow for large managers reports proxy_buffering off; proxy_buffer_size 128k; proxy_buffers 4 256k; proxy_busy_buffers_size 256k; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_pass http://farm; location ~* \.(css|jpg|gif|ico|js)$ { proxy_cache mypms_cache; add_header X-Proxy-Cache $upstream_cache_status; proxy_cache_valid 200 60m; expires 60m; proxy_pass http://farm; } } location /basic_status { stub_status; } error_page 502 502 = /maintenance.html; location = /maintenance.html { root /www/; } } 

在日志中,我经常看到类似的条目

 2015/03/13 15:22:58 [error] 4482#0: *557390 upstream timed out (110: Connection timed out) while connecting to upstream, client: 72.160.92.101, server: host.domain.com, request: "GET /tapechart.php HTTP/1.1", upstream: "http://192.168.1.50:8080/tapechart.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php" 2015/03/13 15:23:14 [error] 4481#0: *557663 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=3105543&show=0 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=3105543&show=0", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3105543&show=0" 2015/03/13 15:23:19 [error] 4481#0: *557550 upstream timed out (110: Connection timed out) while connecting to upstream, client: 50.134.133.213, server: host.domain.com, request: "GET /tbltapechart.php?numNights=30&startDate=1-Aug-2015&roomTypeID=-1&hideNav=N&bookingID=&roomFilter=-1 HTTP/1.1", upstream: "http://192.168.1.50:8080/tbltapechart.php?numNights=30&startDate=1-Aug-2015&roomTypeID=-1&hideNav=N&bookingID=&roomFilter=-1", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php" 2015/03/13 15:23:37 [error] 4483#0: *561705 upstream timed out (110: Connection timed out) while connecting to upstream, client: 74.223.167.14, server: host.domain.com, request: "GET /js/multiselect/jquery.multiselect.filter.css HTTP/1.1", upstream: "http://192.168.1.55:8080/js/multiselect/jquery.multiselect.filter.css", host: "host.domain.com", referrer: "https://host.domain.com/fdhome.php" 2015/03/13 15:23:40 [error] 4481#0: *561099 upstream timed out (110: Connection timed out) while connecting to upstream, client: 74.223.167.14, server: host.domain.com, request: "GET /img/tabs_left_bc.jpg HTTP/1.1", upstream: "http://192.168.1.55:8080/img/tabs_left_bc.jpg", host: "host.domain.com", referrer: "https://host.domain.com/fdhome.php" 2015/03/13 15:23:45 [error] 4481#0: *557214 upstream timed out (110: Connection timed out) while connecting to upstream, client: 75.37.141.182, server: host.domain.com, request: "GET /tapechart.php HTTP/1.1", upstream: "http://192.168.1.50:8080/tapechart.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php" 2015/03/13 15:23:52 [error] 4482#0: *557330 upstream timed out (110: Connection timed out) while connecting to upstream, client: 173.164.149.18, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434" 2015/03/13 15:24:14 [error] 4481#0: *557663 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=3105543&show=0 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=3105543&show=0", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3105543&show=0" 2015/03/13 15:24:15 [error] 4481#0: *557752 upstream timed out (110: Connection timed out) while connecting to upstream, client: 24.158.4.70, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=2070569 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=2070569", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php" 2015/03/13 15:24:15 [error] 4482#0: *558613 upstream timed out (110: Connection timed out) while connecting to upstream, client: 199.102.121.3, server: host.domain.com, request: "GET /rptlanding.php HTTP/1.1", upstream: "http://192.168.1.50:8080/rptlanding.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php" 2015/03/13 15:24:17 [error] 4482#0: *557353 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /js/multiselect/demo/assets/prettify.js HTTP/1.1", upstream: "http://192.168.1.50:8080/js/multiselect/demo/assets/prettify.js", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3186044" 

我最初发现我必须设置这么高的proxy_read_timeout,因为我们有一个非常大的报告,并且需要至less20秒才能完成对具有中等数据集的用户的呈现。 具有最大数据集的用户最多可能需要2分钟来呈现报表。 然而,它很less运行,通常每天使用less于一次,并且从来不是日志中的GETstring中的URL。

四个后端服务器是完全相同的Apache服务器,所有服务器都运行httpd 2.2.29和php 5.5.22,它们都是源于同一个版本的centos并且是最新的。 正如我最初看到的日志中的MaxClients命中,我在每个Apache主机上定义了以下内容

 <IfModule mpm_prefork_module> StartServers 10 MinSpareServers 10 MaxSpareServers 20 MaxClients 200 MaxRequestsPerChild 300 </IfModule> 

nginx服务器和apache服务器都位于同一个数据中心,位于同一个子网和vlan上,而且我在apache服务器端的error_log中没有看到任何指示超时的原因。

我们试图解决这个问题的其他内容包括

  • 将proxy_read_timeout加到300。
  • 删除Gzip设置。
  • 删除位置块的CSS,图像和JavaScriptcaching。
  • 启用proxy_buffering。 它被禁用,由于大报告允许nginx开始提供呈现(包括一个build设报告的JavaScript进度指示器),而不是显示空白页20 – 120秒。
  • 向上游添加KeepAlive 8/16/32/64。

在这一点上,我怀疑这是一个networking问题或后端问题,因为我已经把Web应用程序移回土狼点负载平衡器,投诉已经下降。

我真的很喜欢弄清楚这一点,但是我有点不知所措。 请教?

我在nginx < – > apache2安装程序中遇到了这样的问题。 由于MySQL陷入僵局,这是Apache在负载下花费太长时间。 为了找出apache所花费的时间,我将日志格式更改为:

 LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %DµSEC" timed 

和nginx日志到:

 log_format timed_combined '$remote_addr - $remote_user [$time_local] ' 

然后更容易看出,虽然Apache正在完成所有的请求,但是在将数据传回nginx的时候已经很晚了(几秒钟)。

我不知道为什么haproxy帮助你的情况,除非一个Apache服务器比其他人慢。 当一台机器出现可恢复的磁盘错误时,可能会出现这种情况。 错误应该显示在系统日志中。