我有一个运行在App Engine中的webserver容器,它提供了一个REST API。 我试图去一个相对标准的实现 – nginx + PHP-FPM使用TCP套接字(我没有得到一个unix套接字工作出于某种原因)。 数据库连接也是通过Google Cloud VPN运行的TCP套接字。
我在API上获得了〜25%的可用性。 在最大时间(App Engine的nginx代理设置为60秒)之后,通常请求将会发生504 Gateway Timeout 。 有时候,如果PHP-FPM超时( request_terminate_timeout ),将会出现502 Bad Gateway 。
我试图找出这是否是App Engine nginxconfiguration不当,我的nginx或我的PHP-FPMconfiguration。 Nginx应该closures套接字或者重用它们,但是它似乎并没有这样做。
当我siege任何给定的端点(25个用户)几分钟,我看到:
HTTP/1.1 504 60.88 secs: 176 bytes ==> GET /path/to/rest ...15 lines... HTTP/1.1 504 61.23 secs: 176 bytes ==> GET /path/to/rest HTTP/1.1 200 57.54 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 57.68 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 504 60.76 secs: 176 bytes ==> GET /path/to/rest ...15 lines... HTTP/1.1 504 61.06 secs: 176 bytes ==> GET /path/to/rest HTTP/1.1 200 33.35 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 32.97 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 36.61 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 39.00 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 42.47 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 48.51 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 56.45 secs: 13143 bytes ==> GET /path/to/rest # Another run HTTP/1.1 200 7.65 secs: 13143 bytes ==> GET /path/to/rest ...10 lines... HTTP/1.1 200 8.20 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 8.30 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 504 61.15 secs: 176 bytes ==> GET /path/to/rest HTTP/1.1 502 54.46 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 54.33 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 54.25 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 53.63 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 48.40 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 7.31 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 6.97 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.27 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.26 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 54.99 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 60.08 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 60.56 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 6.83 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 60.85 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 59.99 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 58.99 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 52.40 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 52.21 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 59.61 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 52.65 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 7.13 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 6.96 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.48 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.81 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 6.89 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 59.26 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 6.80 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 59.44 secs: 166 bytes ==> GET /path/to/rest
这也只发生在一个用户:
HTTP/1.1 502 55.43 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 7.71 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.54 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 59.21 secs: 166 bytes ==> GET /path/to/rest
每个案例的Nginx日志:
# 200 Normal logging ie [notice] GET /path/to/rest (param1, param2) ... # 502 [error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com" # 504 [error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
这就是netstat -t样子:
# Before starting tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED # During the siege tcp 0 0 localhost:56144 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED tcp 0 0 5c2ad0938ce9:53073 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED tcp 0 0 localhost:56148 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:53071 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED tcp 0 0 localhost:56106 localhost:9000 ESTABLISHED tcp 0 0 localhost:56191 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED tcp 0 0 localhost:56113 localhost:9000 ESTABLISHED tcp 0 0 localhost:56150 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED tcp 0 0 5c2ad0938ce9:53102 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:53051 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED tcp 8 0 localhost:9000 localhost:56146 ESTABLISHED tcp 0 0 localhost:9000 localhost:56117 TIME_WAIT tcp 8 0 localhost:9000 localhost:56179 ESTABLISHED tcp 8 0 localhost:9000 localhost:56160 ESTABLISHED tcp 0 0 localhost:9000 localhost:56168 TIME_WAIT tcp 0 0 localhost:9000 localhost:56170 TIME_WAIT tcp 0 0 localhost:9000 localhost:56111 TIME_WAIT tcp 0 0 localhost:9000 localhost:56115 TIME_WAIT tcp 8 0 localhost:9000 localhost:56123 ESTABLISHED tcp 0 0 localhost:9000 localhost:56109 TIME_WAIT tcp 8 0 localhost:9000 localhost:56113 ESTABLISHED tcp 0 0 localhost:9000 localhost:56140 TIME_WAIT tcp 0 0 localhost:9000 localhost:56181 TIME_WAIT tcp 0 0 localhost:9000 localhost:56121 TIME_WAIT tcp 8 0 localhost:9000 localhost:56191 ESTABLISHED tcp 0 0 localhost:9000 localhost:56119 TIME_WAIT tcp 0 0 localhost:9000 localhost:56142 TIME_WAIT tcp 8 0 localhost:9000 localhost:56106 ESTABLISHED tcp 0 0 localhost:9000 localhost:56110 TIME_WAIT tcp 8 0 localhost:9000 localhost:56144 ESTABLISHED tcp 8 0 localhost:9000 localhost:56148 ESTABLISHED tcp 8 0 localhost:9000 localhost:56150 ESTABLISHED # A minute or so after ending the siege tcp 0 0 5c2ad0938ce9:53319 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34578 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34576 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34570 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED tcp 0 0 localhost:56396 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED tcp 8 0 localhost:9000 localhost:56396 ESTABLISHED
user www-data; worker_processes auto; worker_cpu_affinity auto; events { worker_connections 512; } http { server_tokens off; fastcgi_ignore_client_abort off; keepalive_timeout 650; keepalive_requests 10000; gzip on; ..more gzip settings.. server { charset utf-8; client_max_body_size 512M; listen 8080; rewrite_log on; root /app/web; index index.php; location / { try_files $uri /index.php?$args; } location ~ \.php$ { fastcgi_pass 127.0.0.1:9000; include /etc/nginx/fastcgi_params; fastcgi_keep_conn off; fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name; } } include /etc/nginx/conf.d/*.conf; # There are no extra conf files }
[www] user = www-data group = www-data listen = 127.0.0.1:9000 pm = ondemand pm.process_idle_timeout = 10s request_terminate_timeout = 45
禁用Keepalive是一个坏主意,因为App Engine不断轮询容器进行健康检查,这会造成很多死亡的TIME_WAIT套接字(我试过了)。
在request_terminate_timeout之前,有很多CLOSE_WAIT套接字,而不是TIME_WAIT 。 设置request_terminate_timeout = 45在某种意义上确实有帮助,因为工作进程被终止并在重新生成后再次提供200 。 较低的终止超时将只产生更多的502 s和更less的504 s。
process_idle_timeout被忽略,因为套接字在技术上不是空闲的。
设置fastcgi_keep_conn on nginx的行为没有可衡量的影响。
事实certificate,问题与容器configuration有关,而不是应用程序。 在将MTU设置为Google云networking的适当值(从1500降低到1430)之后,查询应用程序不再有任何问题。
这是通过将问题隔离到仅通过Google Cloud VPN向数据库打开套接字的请求(请参阅netstat日志中的postgresql条目)来发现的。 我们碰巧有VPN路由到第二个VPN,其数据库连接工作完美,只有第一跳进行高MTUstream量。