如何用Nginxdebuggingmemc / memcached超时？

问题

从Nginx的memcached后端定期超时（请参阅下面的示例错误）

事情已经testing：

它在本地和远程memcached服务器上都会出现，所以networking不会成为问题
它同时发生在memc和memcached后端
无论是否有上行Keepalive模块，都会发生
它发生在nginx 0.8,1.0.x和1.1.x
缓冲日志似乎稍微减less了错误的数量，但它们仍然存在，这可能只是巧合
通过udp（即非阻塞）login到另一个没有任何区别

其他信息：

memcached服务器每秒从这个和其他（远程）客户端处理约500个请求
memcached日志中没有错误/报告
使用以下设置启动memcached：
- tcp和udp都启用（目前只使用tcp）
- 16GB的内存
- 16个线程（在32核心机器上）
- 限于8192个连接
- 通过supervisord以non-daemonic模式启动
服务器运行FreeBSD 8.2
超时发生在低负载和高负载（现在每秒只有大约50个请求）

这个问题

所以我的问题是…我怎么能debugging这些问题？任何人都有一个想法的根本原因可能是什么？错误不会持续发生，但会频繁出现在每隔几分钟的日志中。

来自日志的示例错误：

2011/08/30 17:23:34 [error] 13921#0: *38602 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38591 upstream timed out (60: Operation timed out) while reading response header from u`enter code here`pstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38601 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38918 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:35 [error] 13921#0: *38595 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:24:32 [error] 13921#0: *41768 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:24:36 [error] 13921#0: *38599 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:12 [error] 13921#0: *42489 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:30 [error] 13922#0: *39444 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:30 [error] 13922#0: *39452 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"

从netstat -m输出

 # netstat -m 3404/25531/28935 mbufs in use (current/cache/total) 3318/25004/28322/229376 mbuf clusters in use (current/cache/total/max) 1161/20599 mbuf+clusters out of packet secondary zone in use (current/cache) 26/4420/4446/192000 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 7591K/74070K/81661K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 4925075 requests for I/O initiated by sendfile 0 calls to protocol drain routines

从错误日志输出中，看起来好像NGiNX正在等待networking通信，试图build立新连接并从旧连接中获取数据。你提到它发生在本地和远程，所以你说你排除了networking。但是，即使是本地的仍然使用TCP套接字连接正确？您是否已经检查过您没有使用BSD盒子上的mbufs：

 # netstat -m 16387/4613/21000 mbufs in use (current/cache/total) 16385/3957/20342/25600 mbuf clusters in use (current/cache/total/max) 16384/2176 mbuf+clusters out of packet secondary zone in use (current/cache) 0/403/403/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 36866K/10679K/47546K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines

如果您看到mbufs是您需要更新loader.conf并从/boot/loader.conf重新引导（可悲的是不是运行时可调参数）的问题：

 kern.ipc.nmbclusters="128000"

如果你没有看到任何东西，那么你应该先确认nginx不是怪罪。我可能会这样做，通过tcpduming连接从nginx到memcached，并validationnginx确实发送连接build立或请求，当它声称其超时等待。如果这样做，那么我可能会开始在memcached进程上运行ktrace几分钟，而发生错误，并通过kdump查看是否有系统调用失败（如发送或接受）。