如何用Nginxdebuggingmemc / memcached超时?

问题

从Nginx的memcached后端定期超时(请参阅下面的示例错误)

事情已经testing:

  • 它在本地和远程memcached服务器上都会出现,所以networking不会成为问题
  • 它同时发生在memc和memcached后端
  • 无论是否有上行Keepalive模块,都会发生
  • 它发生在nginx 0.8,1.0.x和1.1.x
  • 缓冲日志似乎稍微减less了错误的数量,但它们仍然存在,这可能只是巧合
  • 通过udp(即非阻塞)login到另一个没有任何区别

其他信息:

  • memcached服务器每秒从这个和其他(远程)客户端处理约500个请求
  • memcached日志中没有错误/报告
  • 使用以下设置启动memcached:
    • tcp和udp都启用(目前只使用tcp)
    • 16GB的内存
    • 16个线程(在32核心机器上)
    • 限于8192个连接
    • 通过supervisord以non-daemonic模式启动
  • 服务器运行FreeBSD 8.2
  • 超时发生在低负载和高负载(现在每秒只有大约50个请求)

这个问题

所以我的问题是…我怎么能debugging这些问题? 任何人都有一个想法的根本原因可能是什么? 错误不会持续发生,但会频繁出现在每隔几分钟的日志中。


来自日志的示例错误:

2011/08/30 17:23:34 [error] 13921#0: *38602 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38591 upstream timed out (60: Operation timed out) while reading response header from u`enter code here`pstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38601 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38918 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:35 [error] 13921#0: *38595 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:24:32 [error] 13921#0: *41768 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:24:36 [error] 13921#0: *38599 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:12 [error] 13921#0: *42489 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:30 [error] 13922#0: *39444 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:30 [error] 13922#0: *39452 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 

netstat -m输出

 # netstat -m 3404/25531/28935 mbufs in use (current/cache/total) 3318/25004/28322/229376 mbuf clusters in use (current/cache/total/max) 1161/20599 mbuf+clusters out of packet secondary zone in use (current/cache) 26/4420/4446/192000 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 7591K/74070K/81661K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 4925075 requests for I/O initiated by sendfile 0 calls to protocol drain routines 

从错误日志输出中,看起来好像NGiNX正在等待networking通信,试图build立新连接并从旧连接中获取数据。 你提到它发生在本地和远程,所以你说你排除了networking。 但是,即使是本地的仍然使用TCP套接字连接正确? 您是否已经检查过您没有使用BSD盒子上的mbufs:

 # netstat -m 16387/4613/21000 mbufs in use (current/cache/total) 16385/3957/20342/25600 mbuf clusters in use (current/cache/total/max) 16384/2176 mbuf+clusters out of packet secondary zone in use (current/cache) 0/403/403/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 36866K/10679K/47546K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines 

如果您看到mbufs是您需要更新loader.conf并从/boot/loader.conf重新引导(可悲的是不是运行时可调参数)的问题:

 kern.ipc.nmbclusters="128000" 

如果你没有看到任何东西,那么你应该先确认nginx不是怪罪。 我可能会这样做,通过tcpduming连接从nginx到memcached,并validationnginx确实发送连接build立或请求,当它声称其超时等待。 如果这样做,那么我可能会开始在memcached进程上运行ktrace几分钟,而发生错误,并通过kdump查看是否有系统调用失败(如发送或接受)。