我有几台服务器为单个站点提供服务。
主服务器运行nginx和php-fpm。 所有其他服务器运行php-fpm。 运行nginx和php-fpm的服务器通过unix套接字连接,其他的通过tcp连接。
大概一个小时(不完全,有时更频繁),有一个奇怪的行为。 所有连接的nginx到php-fpm服务器超时。 它无法build立连接。
2014/03/24 04:59:09 [error] 2123#0: *925153 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2124#0: *926742 connect() to unix:/tmp/php-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://unix:/tmp/php-fpm.sock:", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2123#0: *925159 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.2:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2123#0: *923874 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2123#0: *925164 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2124#0: *909392 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2124#0: *923098 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here" 2014/03/24 04:59:09 [error] 2125#0: *923309 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
由于这是一个相当繁忙的网站,像上面的日志得到填充相当快。
这持续大约10〜15秒,一切都恢复正常。 除了这里发布的连接超时错误之外,似乎没有任何其他错误。
我怀疑问题在于nginx,因为它在所有的php-fpm服务器上同时发生。
这会导致什么? 这怎么能解决?
我的nginxconfiguration是…
user nginx; worker_processes 4; worker_rlimit_nofile 30000; error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid; events { worker_connections 4096; } http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; keepalive_timeout 5; fastcgi_buffers 256 4k; gzip on; gzip_disable "msie6"; fastcgi_cache_path /dev/shm/caches/ levels=1:2 keys_zone=zoneone:4000m max_size=4000m inactive=30m; fastcgi_temp_path /var/www/tmp 1 2; fastcgi_cache_key "$scheme$proxy_host$request_uri"; fastcgi_connect_timeout 3s; limit_req_zone $binary_remote_addr zone=limitone:200m rate=1r/s; limit_req_zone $binary_remote_addr zone=limitcomic:500m rate=40r/m; upstream partone { server unix:/tmp/php-fpm.sock; } upstream parttwo { server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s; } upstream parttre { server 192.168.1.2:9000 weight=8 max_fails=0 fail_timeout=2s; server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s; server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s; } ... stuff with server, locations and such... }
你可以看到,我甚至没有在同一个上下文中使用全部5台服务器。
nginx版本:nginx / 1.4.5
这是一个有教养的猜测。 问题可能是由于用于连接到上游服务器的本地TCP端口用尽引起的。
您可以通过以下方式检查允许的端口范围:
sysctl net.ipv4.ip_local_port_range
我的Debian安装的默认是32768 – 61000。
您可以通过以root身份input以下命令来扩展范围:
echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range
如果您正在运行Debian或派生分发版,您可以通过编辑/etc/sysctl.d/99-local.conf并在此文件中input以下内容,在重新启动时保留此设置:
net.ipv4.ip_local_port_range = 1024 65535