创build太多虚拟主机后,Apache将停止与memcache通信

我注意到Apache的一个非常奇怪的问题。 我有很多虚拟主机 – 大约是501。

第493号主机出现问题。首先493个主机按预期工作,但是一旦我添加第494号主机,PHP就停止与memcache通信,并且每次读写访问都会超时。

其实,我使用memcache作为后端会话存储,所以,PHP函数:

session_start(); 

在30秒后简单地超时。

如果我删除了494个虚拟主机中的随机虚拟主机,并重新启动apache,它将重新开始工作。

我已经确定ulimit非常高(65k),但它没有帮助。 我已经尝试彻底closuresulimits,但没有运气。

你们有什么想法还有什么我可以尝试?

我已经尝试了在连接到浏览器的httpd进程之后,等待30秒后再开始。

这是strace输出:

 select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) 

所以基本上apache被卡在select()上,就这样,它会无限地重复select()系统调用。

接下来我想到的是tcpdump,看看这个包是不是真的从apache传过来的,而且确实如此:

 22:11:28.366677 IP6 ::1.51404 > ::1.11914: Flags [S], seq 2899674987, win 32752, options [mss 16376,sackOK,TS val 1384759049 ecr 0,nop,wscale 9], length 0 22:11:28.366697 IP6 ::1.11914 > ::1.51404: Flags [S.], seq 2034630080, ack 2899674988, win 32728, options [mss 16376,sackOK,TS val 1384759049 ecr 1384759049,nop,wscale 9], length 0 22:11:28.366709 IP6 ::1.51404 > ::1.11914: Flags [.], ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0 22:11:28.366752 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 1:41, ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 40 22:11:28.366758 IP6 ::1.11914 > ::1.51404: Flags [.], ack 41, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0 22:11:28.366768 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 41:90, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759049], length 49 22:11:28.366772 IP6 ::1.11914 > ::1.51404: Flags [.], ack 90, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.366779 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 90:122, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 32 22:11:28.366783 IP6 ::1.11914 > ::1.51404: Flags [.], ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367063 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 1:12, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 11 22:11:28.367070 IP6 ::1.51404 > ::1.11914: Flags [.], ack 12, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367266 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 12:20, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 8 22:11:28.367275 IP6 ::1.51404 > ::1.11914: Flags [.], ack 20, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367477 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 20:25, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 5 22:11:28.367489 IP6 ::1.51404 > ::1.11914: Flags [.], ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367629 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 122:181, ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 59 22:11:28.367859 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 25:33, ack 181, win 64, options [nop,nop,TS val 1384759051 ecr 1384759050], length 8 22:11:28.367869 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 181:230, ack 33, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 49 22:11:28.368102 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 33:41, ack 230, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 8 22:11:28.368138 IP6 ::1.51404 > ::1.11914: Flags [F.], seq 230, ack 41, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0 22:11:28.368195 IP6 ::1.11914 > ::1.51404: Flags [F.], seq 41, ack 231, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0 22:11:28.368206 IP6 ::1.51404 > ::1.11914: Flags [.], ack 42, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0 

接下来我做的是Apache进程的GDB,当我对包含session_start()的页面发出一个curl调用时,这是输出:

 232 *(*new)->local_addr = *sock->local_addr; 241 if (sock->local_addr->sa.sin.sin_family == AF_INET) { 238 (*new)->local_addr->pool = connection_context; 241 if (sock->local_addr->sa.sin.sin_family == AF_INET) { 238 (*new)->local_addr->pool = connection_context; 241 if (sock->local_addr->sa.sin.sin_family == AF_INET) { 245 else if (sock->local_addr->sa.sin.sin_family == AF_INET6) { 246 (*new)->local_addr->ipaddr_ptr = &(*new)->local_addr->sa.sin6.sin6_addr; 249 (*new)->remote_addr->port = ntohs((*new)->remote_addr->sa.sin.sin_port); 250 if (sock->local_port_unknown) { 256 if (apr_is_option_set(sock, APR_TCP_NODELAY) == 1) { 257 apr_set_option(*new, APR_TCP_NODELAY, 1); 266 if (sock->local_interface_unknown || 267 !memcmp(sock->local_addr->ipaddr_ptr, 266 if (sock->local_interface_unknown || 276 (*new)->local_interface_unknown = 1; 293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup, 292 (*new)->inherit = 0; 293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup, 296 } unixd_accept (accepted=0x7fff14ecddf0, lr=0x7fe93a905aa8, ptrans=<value optimized out>) at /usr/src/debug/httpd-2.2.15/os/unix/unixd.c:507 507 if (status == APR_SUCCESS) { 508 *accepted = csd; 649 } child_main (child_num_arg=<value optimized out>) at /usr/src/debug/httpd-2.2.15/server/mpm/prefork/prefork.c:650 650 SAFE_ACCEPT(accept_mutex_off()); /* unlock after "accept" */ 652 if (status == APR_EGENERAL) { 656 else if (status != APR_SUCCESS) { 665 current_conn = ap_run_create_connection(ptrans, ap_server_conf, csd, my_child_num, sbh, bucket_alloc); 666 if (current_conn) { 667 ap_process_connection(current_conn, csd); 

在这个位置有一个很大的停顿(~30秒),直到php超时。 之后,我得到这个:

 668 ap_lingering_close(current_conn); 676 if (ap_mpm_pod_check(pod) == APR_SUCCESS) { /* selected as idle? */ 680 ap_scoreboard_image->global->running_generation) { /* restart? */ 679 else if (ap_my_generation != 680 ap_scoreboard_image->global->running_generation) { /* restart? */ 679 else if (ap_my_generation != 551 while (!die_now && !shutdown_pending) { 559 apr_pool_clear(ptrans); 562 && requests_this_child++ >= ap_max_requests_per_child)) { 561 if ((ap_max_requests_per_child > 0 562 && requests_this_child++ >= ap_max_requests_per_child)) { 561 if ((ap_max_requests_per_child > 0 562 && requests_this_child++ >= ap_max_requests_per_child)) { 561 if ((ap_max_requests_per_child > 0 566 (void) ap_update_child_status(sbh, SERVER_READY, (request_rec *) NULL); 573 SAFE_ACCEPT(accept_mutex_on()); 575 if (num_listensocks == 1) { 

最奇怪的是我不能在另一台机器上重现这一点。 相同的操作系统,相同的包,相同的configuration(木偶)相同的内核,不同的硬件。

经过几个星期的debugging和注意问题后,我终于偶然发现了一条消息:

 You MUST recompile PHP with a larger value of FD_SETSIZE. It is set to 1024, but you have descriptors numbered at least as high as 1073. --enable-fd-setsize=2048 is recommended, but you may want to set it to equal the maximum number of open files supported by your system, in order to avoid seeing this error again at a later date. 

我会尝试这个修复,但男孩哦,男孩,为什么PHP的人这样做? 这是如此丑陋,硬编码极限是完全破碎的devise。 更何况,如果这是解决scheme,迫使我重新编译每个PHP小版本和安全补丁,并维护我自己的软件包是一个很大的喧嚣。

编辑:经过更广泛的debugging之后,似乎不仅PHP是'被devise破坏',还有一堆memcache扩展本身的问题。

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629896

https://bugs.php.net/bug.php?id=59876

虫子已经开放了很长一段时间,没有任何反应。 我想应该只是转储memcache扩展,并find一个独立于它的解决scheme: – /