从Nginx的memcached后端定期超时(请参阅下面的示例错误)
所以我的问题是…我怎么能debugging这些问题? 任何人都有一个想法的根本原因可能是什么? 错误不会持续发生,但会频繁出现在每隔几分钟的日志中。
来自日志的示例错误:
2011/08/30 17:23:34 [error] 13921#0: *38602 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38591 upstream timed out (60: Operation timed out) while reading response header from u`enter code here`pstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38601 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:34 [error] 13921#0: *38918 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:23:35 [error] 13921#0: *38595 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:24:32 [error] 13921#0: *41768 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:24:36 [error] 13921#0: *38599 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:12 [error] 13921#0: *42489 upstream timed out (60: Operation timed out) while connecting to upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:30 [error] 13922#0: *39444 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com" 2011/08/30 17:25:30 [error] 13922#0: *39452 upstream timed out (60: Operation timed out) while reading response header from upstream, client: xxxx, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
从netstat -m输出
# netstat -m 3404/25531/28935 mbufs in use (current/cache/total) 3318/25004/28322/229376 mbuf clusters in use (current/cache/total/max) 1161/20599 mbuf+clusters out of packet secondary zone in use (current/cache) 26/4420/4446/192000 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 7591K/74070K/81661K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 4925075 requests for I/O initiated by sendfile 0 calls to protocol drain routines
从错误日志输出中,看起来好像NGiNX正在等待networking通信,试图build立新连接并从旧连接中获取数据。 你提到它发生在本地和远程,所以你说你排除了networking。 但是,即使是本地的仍然使用TCP套接字连接正确? 您是否已经检查过您没有使用BSD盒子上的mbufs:
# netstat -m 16387/4613/21000 mbufs in use (current/cache/total) 16385/3957/20342/25600 mbuf clusters in use (current/cache/total/max) 16384/2176 mbuf+clusters out of packet secondary zone in use (current/cache) 0/403/403/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 36866K/10679K/47546K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines
如果您看到mbufs是您需要更新loader.conf并从/boot/loader.conf重新引导(可悲的是不是运行时可调参数)的问题:
kern.ipc.nmbclusters="128000"
如果你没有看到任何东西,那么你应该先确认nginx不是怪罪。 我可能会这样做,通过tcpduming连接从nginx到memcached,并validationnginx确实发送连接build立或请求,当它声称其超时等待。 如果这样做,那么我可能会开始在memcached进程上运行ktrace几分钟,而发生错误,并通过kdump查看是否有系统调用失败(如发送或接受)。