环境
我有一个运行CentOS 5.5 x64的小型双核英特尔凌动服务器,它具有一个微小的定制Xen内核。 它还有一个板载10/100网卡和一个额外的3端口10/100网卡。 在这个服务器中,我也运行一个单独的Xen domU作为防火墙,DHCP服务器和cachingDNS转发器。 该domU也运行CentOS 5.5 x64,但有一个股票Xen内核。
我正在使用pciback内核模块从dom0隐藏3端口网卡,并将其分配给我的虚拟防火墙。 Eth1是我的公共接口,板载NIC(eth0)是我的专用接口,位于XEN桥接器上,并在Dom0和DomU之间共享。
问题
问题是eth1(我的虚拟防火墙上的公共接口)决定每天停止多次工作。 这似乎与使用有关:如果我几乎没有通过该接口运行很多stream量,它可能会持续几天。 沉重的网页浏览将在几个小时内消失。 当它死了,这是我的防火墙上的/ var / log / messages中的错误:
Jul 30 14:17:48 fw kernel: irq 18: nobody cared (try booting with the "irqpoll" option) Jul 30 14:17:48 fw kernel: Jul 30 14:17:48 fw kernel: Call Trace: Jul 30 14:17:48 fw kernel: <IRQ> [<ffffffff802b3d60>] __report_bad_irq+0x30/0x7d Jul 30 14:17:48 fw kernel: [<ffffffff802b3f97>] note_interrupt+0x1ea/0x22b Jul 30 14:17:48 fw kernel: [<ffffffff802b348f>] __do_IRQ+0xbd/0x103 Jul 30 14:17:48 fw kernel: [<ffffffff80290319>] _local_bh_enable+0x61/0xc5 Jul 30 14:17:48 fw kernel: [<ffffffff8026df48>] do_IRQ+0xe7/0xf5 Jul 30 14:17:48 fw kernel: [<ffffffff803b3eca>] evtchn_do_upcall+0x13b/0x1fb Jul 30 14:17:48 fw kernel: [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c Jul 30 14:17:48 fw kernel: <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 Jul 30 14:17:48 fw kernel: [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 Jul 30 14:17:48 fw kernel: [<ffffffff8026f4eb>] raw_safe_halt+0x84/0xa8 Jul 30 14:17:48 fw kernel: [<ffffffff8024ad2e>] cpu_idle+0x4a/0xba Jul 30 14:17:48 fw kernel: [<ffffffff8026ca80>] xen_idle+0x38/0x4a Jul 30 14:17:48 fw kernel: [<ffffffff8024ad7b>] cpu_idle+0x97/0xba Jul 30 14:17:48 fw kernel: [<ffffffff8064cb0f>] start_kernel+0x21f/0x224 Jul 30 14:17:48 fw kernel: [<ffffffff8064c1e5>] _sinittext+0x1e5/0x1eb Jul 30 14:17:48 fw kernel: Jul 30 14:17:48 fw kernel: handlers: Jul 30 14:17:48 fw kernel: [<ffffffff8811b8dd>] (rtl8139_interrupt+0x0/0x421 [8139too]) Jul 30 14:17:48 fw kernel: Disabling IRQ #18 Jul 30 14:18:02 fw kernel: NETDEV WATCHDOG: eth1: transmit timed out Jul 30 14:18:05 fw kernel: eth1: link up, 100Mbps, full-duplex, lpa 0x41E1 Jul 30 14:18:17 fw kernel: NETDEV WATCHDOG: eth1: transmit timed out Jul 30 14:18:20 fw kernel: eth1: link up, 100Mbps, full-duplex, lpa 0x41E1 Jul 30 14:18:32 fw kernel: NETDEV WATCHDOG: eth1: transmit timed out
我在Dom0的日志中看到类似的故事。 但正如你所看到的,它完全禁用了NIC的IRQ并closures了接口。
Jul 30 13:46:54 server kernel: irq 18: nobody cared (try booting with the "irqpoll" option) Jul 30 13:46:54 server kernel: Jul 30 13:46:54 server kernel: Call Trace: Jul 30 13:46:54 server kernel: <IRQ> [<ffffffff802b3e13>] __report_bad_irq+0x30/0x7d Jul 30 13:46:54 server kernel: [<ffffffff802b404a>] note_interrupt+0x1ea/0x22b Jul 30 13:46:54 server kernel: [<ffffffff802b3542>] __do_IRQ+0xbd/0x103 Jul 30 13:46:54 server kernel: [<ffffffff8029044e>] _local_bh_enable+0x61/0xc5 Jul 30 13:46:54 server kernel: [<ffffffff8026df5a>] do_IRQ+0xe7/0xf5 Jul 30 13:46:54 server kernel: [<ffffffff803b3993>] evtchn_do_upcall+0x13b/0x1fb Jul 30 13:46:54 server kernel: [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c Jul 30 13:46:54 server kernel: <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 Jul 30 13:46:54 server kernel: [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 Jul 30 13:46:54 server kernel: [<ffffffff8026f4fd>] raw_safe_halt+0x84/0xa8 Jul 30 13:46:54 server kernel: [<ffffffff8026ca92>] xen_idle+0x38/0x4a Jul 30 13:46:54 server kernel: [<ffffffff8024b0b6>] cpu_idle+0x97/0xba Jul 30 13:46:54 server kernel: [<ffffffff8064cb0f>] start_kernel+0x21f/0x224 Jul 30 13:46:54 server kernel: [<ffffffff8064c1e5>] _sinittext+0x1e5/0x1eb Jul 30 13:46:54 server kernel: Jul 30 13:46:54 server kernel: handlers: Jul 30 13:46:54 server kernel: [<ffffffff803e7b6c>] (usb_hcd_irq+0x0/0x55) Jul 30 13:46:54 server kernel: Disabling IRQ #18 Jul 30 14:26:06 server kernel: xenbr0: port 3(vif1.0) entering disabled state Jul 30 14:26:06 server kernel: device vif1.0 left promiscuous mode Jul 30 14:26:06 server kernel: xenbr0: port 3(vif1.0) entering disabled state Jul 30 14:26:06 server kernel: ACPI: PCI interrupt for device 0000:02:04.0 disabled Jul 30 14:26:06 server kernel: ACPI: PCI interrupt for device 0000:02:06.0 disabled Jul 30 14:26:06 server kernel: ACPI: PCI interrupt for device 0000:02:07.0 disabled
显而易见的答案是做错误消息所build议的,并用“irqpoll”选项启动。 然而,这并没有影响,不pipe我是否用“irqpoll”启动dom0或domU。 有没有人有什么build议? 我在这里有些绝望
其他技术细节
在dom0上截断了“lspci -vv”输出:
02:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) Subsystem: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Interrupt: pin A routed to IRQ 18 Region 0: I/O ports at de00 [disabled] [size=256] Region 1: Memory at fdeff000 (32-bit, non-prefetchable) [disabled] [size=256] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME-
猫/ proc /中断dom0:
CPU0 CPU1 CPU2 CPU3 1: 8 0 0 0 Phys-irq i8042 4: 10 0 0 0 Phys-irq serial 8: 0 0 0 0 Phys-irq rtc 9: 0 0 0 0 Phys-irq acpi 12: 4 0 0 0 Phys-irq i8042 14: 162844 0 1910 0 Phys-irq ata_piix 15: 0 0 0 0 Phys-irq ata_piix 16: 0 0 0 0 Phys-irq uhci_hcd:usb5 17: 0 0 0 0 Phys-irq uhci_hcd:usb3 18: 200001 0 0 0 Phys-irq uhci_hcd:usb4 19: 2 0 0 0 Phys-irq ehci_hcd:usb1, uhci_hcd:usb2 254: 515869 0 0 33 Phys-irq peth0 256: 26214795 0 0 0 Dynamic-irq timer0 257: 26047 0 0 0 Dynamic-irq resched0 258: 54 0 0 0 Dynamic-irq callfunc0 259: 0 15252 0 0 Dynamic-irq resched1 260: 0 176 0 0 Dynamic-irq callfunc1 261: 0 768956 0 0 Dynamic-irq timer1 262: 0 0 96066 0 Dynamic-irq resched2 263: 0 0 175 0 Dynamic-irq callfunc2 264: 0 0 2193136 0 Dynamic-irq timer2 265: 0 0 0 30317 Dynamic-irq resched3 266: 0 0 0 132 Dynamic-irq callfunc3 267: 0 0 0 904610 Dynamic-irq timer3 268: 371 0 512 0 Dynamic-irq xenbus NMI: 0 0 0 0 LOC: 0 0 0 0 ERR: 0
在dom0上截断了“/boot/grub/grub.conf”:
title CentOS (2.6.18-194.3.1.el5.sb_iq1xen) root (hd0,0) kernel /xen.gz-2.6.18-194.3.1.el5.sb_iq1 module /vmlinuz-2.6.18-194.3.1.el5.sb_iq1xen ro root=/dev/VolGroup00/LogVol00 xencons=off console=ttyS0,38400 irqpoll module /initrd-2.6.18-194.3.1.el5.sb_iq1xen.img
如果这是一个容量问题,您是否曾经想过简单地设置stream量整形规则,以避免超出基准?