nginx上行超时。多个服务器在同一时间

我有几台服务器为单个站点提供服务。

主服务器运行nginx和php-fpm。所有其他服务器运行php-fpm。运行nginx和php-fpm的服务器通过unix套接字连接，其他的通过tcp连接。

大概一个小时（不完全，有时更频繁），有一个奇怪的行为。所有连接的nginx到php-fpm服务器超时。它无法build立连接。

2014/03/24 04:59:09 [error] 2123#0: *925153 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2124#0: *926742 connect() to unix:/tmp/php-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://unix:/tmp/php-fpm.sock:", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2123#0: *925159 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.2:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2123#0: *923874 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2123#0: *925164 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2124#0: *909392 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2124#0: *923098 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2125#0: *923309 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"

由于这是一个相当繁忙的网站，像上面的日志得到填充相当快。

这持续大约10〜15秒，一切都恢复正常。除了这里发布的连接超时错误之外，似乎没有任何其他错误。

我怀疑问题在于nginx，因为它在所有的php-fpm服务器上同时发生。

这会导致什么？这怎么能解决？

我的nginxconfiguration是…

 user nginx; worker_processes 4; worker_rlimit_nofile 30000; error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid; events { worker_connections 4096; } http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; keepalive_timeout 5; fastcgi_buffers 256 4k; gzip on; gzip_disable "msie6"; fastcgi_cache_path /dev/shm/caches/ levels=1:2 keys_zone=zoneone:4000m max_size=4000m inactive=30m; fastcgi_temp_path /var/www/tmp 1 2; fastcgi_cache_key "$scheme$proxy_host$request_uri"; fastcgi_connect_timeout 3s; limit_req_zone $binary_remote_addr zone=limitone:200m rate=1r/s; limit_req_zone $binary_remote_addr zone=limitcomic:500m rate=40r/m; upstream partone { server unix:/tmp/php-fpm.sock; } upstream parttwo { server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s; } upstream parttre { server 192.168.1.2:9000 weight=8 max_fails=0 fail_timeout=2s; server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s; } ... stuff with server, locations and such... }

你可以看到，我甚至没有在同一个上下文中使用全部5台服务器。

nginx版本：nginx / 1.4.5

这是一个有教养的猜测。问题可能是由于用于连接到上游服务器的本地TCP端口用尽引起的。

您可以通过以下方式检查允许的端口范围：

 sysctl net.ipv4.ip_local_port_range

我的Debian安装的默认是32768 – 61000。

您可以通过以root身份input以下命令来扩展范围：

 echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range

如果您正在运行Debian或派生分发版，您可以通过编辑/etc/sysctl.d/99-local.conf并在此文件中input以下内容，在重新启动时保留此设置：

 net.ipv4.ip_local_port_range = 1024 65535

nginx上行超时。 多个服务器在同一时间

nginx上行超时。多个服务器在同一时间