我有一台带有rabbitmq服务器和4台芹菜工作站的机器( test-server ),另外一台机器( test-worker )有240名芹菜工人,它们连接到test-server服务器上的rabbitmq test-server 。
所有队列目前都是空的。
有了这个设置, beam.smp (我收集的是一个rabbitmq相关的进程)在200-250%的CPU,并消耗几百MB的RAM(这可能是好的,不知道)。
如果我停止远程机器上的工作,它将恢复正常。 如果我只启动例如40个工人,而不是240个,那么它或多或less都可以 – 仍然占用CPU,但是大约有50%。
主要的beam.smp线程卡在select ,我认为是好的,因为它只是在听线程的子线程。 以下是子线程的一个子集。 有一些调用epoll_wait与零超时,还有很多futex调用。
我也发现这个错误,在oslo中描述(不知道是什么) https://bugs.launchpad.net/oslo.messaging/+bug/1518430 ,也提到了零超时epoll_wait调用,并提到rabbitmq。
任何想法,如果这是兔子在这些条件下的预期行为? 我应该在哪里寻找原因?
谢谢
test-server$ sudo strace -p 26866 2>&1 | head -n 50 Process 26866 attached futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 epoll_wait(3, {}, 256, 0) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 785829269}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 writev(473, [{NULL, 0}, {"\1\0\3\0\0\0-\0<\0<\5None3\0\0\0\0\0\0\5\326\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\3\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445 clock_gettime(CLOCK_MONOTONIC, {87999, 786592082}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 epoll_wait(3, {}, 256, 0) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 787427449}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 epoll_wait(3, {}, 256, 0) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 788308663}) = 0 writev(201, [{NULL, 0}, {"\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\35\245\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445 clock_gettime(CLOCK_MONOTONIC, {87999, 789017598}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 789278489}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 writev(392, [{NULL, 0}, {"\1\0\3\0\0\0-\0<\0<\5None3\0\0\0\0\0\0\16\270\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\3\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445 clock_gettime(CLOCK_MONOTONIC, {87999, 792374556}) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 792553480}) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 792796024}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {87999, 793154206}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {87999, 793493003}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {87999, 793842449}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {87999, 794054061}) = 0 writev(318, [{NULL, 0}, {"\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\25\370\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\25\371\0\10cele"..., 73}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 10) = 890 clock_gettime(CLOCK_MONOTONIC, {87999, 794411001}) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 795090977}) = 0 epoll_wait(3, {}, 256, 0) = 0 clock_gettime(CLOCK_MONOTONIC, {87999, 796129182}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
另一个摘录:
Process 26867 attached clock_gettime(CLOCK_MONOTONIC, {88350, 863599878}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x82e500, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) clock_gettime(CLOCK_MONOTONIC, {88350, 865231792}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 865436250}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 865776903}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 872757864}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 872984686}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 873209787}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 873382297}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 873578979}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 epoll_wait(3, {}, 256, 0) = 0 clock_gettime(CLOCK_MONOTONIC, {88350, 875428570}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 875624976}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 875847357}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {88350, 876478262}) = 0 futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x82e500, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
我没有设法解决这个问题,但是我通过减less工作人员数量和提高并发性来解决这个问题。 似乎有一个兔子每个工人的开销…
所以,而不是
celery multi start -A proj 240 -c2
我现在呢
celery multi start -A proj 20 -c24
FWIW