* Why the number of /proc/interrupts doesn't change when nic is under heavy workload? @ 2012-01-15 20:53 Yuehai Xu 2012-01-15 22:09 ` Eric Dumazet 0 siblings, 1 reply; 7+ messages in thread From: Yuehai Xu @ 2012-01-15 20:53 UTC (permalink / raw) To: netdev; +Cc: linux-kernel, yhxu Hi All, My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet Controller, the driver is e1000e, and my Linux version is 3.1.4. I have a Memcached server running on this 8 core box, the weird thing is that when my server is under heavy workload, the number of /proc/interrupts doesn't change at all. Below are some details: ======= cat /proc/interrupts | grep eth0 68: 330887 330861 331432 330544 330346 330227 330830 330575 PCI-MSI-edge eth0 ======= cat /proc/irq/68/smp_affinity ff I know when network is under heavy load, NAPI will disable nic interrupt and poll ring buffer in nic. My question is, when is nic interrupt enabled again? It seems that it will never be enabled if the heavy workload doesn't stop, simply because the number showed by /proc/interrupts doesn't change at all. In my case, one of core is saturated by ksoftirqd, because lots of softirqs are pending to that core. I just want to distribute these softirqs to other cores. Even RPS is enabled, that core is still occupied by ksoftirq, nearly 100%. I dive into the codes and find these statements: __napi_schedule ==> local_irq_save(flags); ____napi_schedule(&__get_cpu_var(softnet_data), n); local_irq_restore(flags); here "local_irq_save" actually invokes "cli" which disable interrupt for the local core, is this the one that used in NAPI to disable nic interrupt? Personally I don't think it is because it just disables local cpu. I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable" under drivers/net/e1000e, are these used in NAPI to disable nic interrupt, but I fail to get any clue that they are used in the code path of NAPI? My current situation is that, almost 60% of time of other 7 cores are idle, while only one core which is occupied by ksoftirq is 100% busy. Thanks very much! Yuehai ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload? 2012-01-15 20:53 Why the number of /proc/interrupts doesn't change when nic is under heavy workload? Yuehai Xu @ 2012-01-15 22:09 ` Eric Dumazet 2012-01-15 22:27 ` Yuehai Xu 0 siblings, 1 reply; 7+ messages in thread From: Eric Dumazet @ 2012-01-15 22:09 UTC (permalink / raw) To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu Le dimanche 15 janvier 2012 à 15:53 -0500, Yuehai Xu a écrit : > Hi All, > > My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet > Controller, the driver is e1000e, and my Linux version is 3.1.4. I > have a Memcached server running on this 8 core box, the weird thing is > that when my server is under heavy workload, the number of > /proc/interrupts doesn't change at all. Below are some details: > ======= > cat /proc/interrupts | grep eth0 > 68: 330887 330861 331432 330544 330346 330227 > 330830 330575 PCI-MSI-edge eth0 > ======= > cat /proc/irq/68/smp_affinity > ff > > I know when network is under heavy load, NAPI will disable nic > interrupt and poll ring buffer in nic. My question is, when is nic > interrupt enabled again? It seems that it will never be enabled if the > heavy workload doesn't stop, simply because the number showed by > /proc/interrupts doesn't change at all. In my case, one of core is > saturated by ksoftirqd, because lots of softirqs are pending to that > core. I just want to distribute these softirqs to other cores. Even > RPS is enabled, that core is still occupied by ksoftirq, nearly 100%. > > I dive into the codes and find these statements: > __napi_schedule ==> > local_irq_save(flags); > ____napi_schedule(&__get_cpu_var(softnet_data), n); > local_irq_restore(flags); > > here "local_irq_save" actually invokes "cli" which disable interrupt > for the local core, is this the one that used in NAPI to disable nic > interrupt? Personally I don't think it is because it just disables > local cpu. > > I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable" > under drivers/net/e1000e, are these used in NAPI to disable nic > interrupt, but I fail to get any clue that they are used in the code > path of NAPI? This is done in the device driver itself, not in generic NAPI code. When NAPI poll() get less packets than the budget, it re-enables chip interrupts. > > My current situation is that, almost 60% of time of other 7 cores are > idle, while only one core which is occupied by ksoftirq is 100% busy. > You could post some info, like "cat /proc/net/softnet_stat" If you use RPS on a very high workload, on a mono queue NIC, best is to stick for example cpu0 for the packet dispatching, and other cpus for IP/UDP handling. echo 01 >/proc/irq/68/smp_affinity echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus Please keep in mind that if your memcache uses a single UDP socket, you probably hit a lot of contention on the socket spinlock and various counters. So maybe it would be better to _reduce_ number of cpus handling network load to reduce false sharing. echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus Really, if you have a single UDP queue, best would be to not use RPS and only have : echo 01 >/proc/irq/68/smp_affinity Then you could post the result of "perf top -C 0" so that we can spot obvious problems on the hot path for this particular cpu. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload? 2012-01-15 22:09 ` Eric Dumazet @ 2012-01-15 22:27 ` Yuehai Xu 2012-01-15 22:45 ` Yuehai Xu 2012-01-16 6:53 ` Eric Dumazet 0 siblings, 2 replies; 7+ messages in thread From: Yuehai Xu @ 2012-01-15 22:27 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, linux-kernel, yhxu Thanks for replying! Please see below: On Sun, Jan 15, 2012 at 5:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le dimanche 15 janvier 2012 à 15:53 -0500, Yuehai Xu a écrit : >> Hi All, >> >> My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet >> Controller, the driver is e1000e, and my Linux version is 3.1.4. I >> have a Memcached server running on this 8 core box, the weird thing is >> that when my server is under heavy workload, the number of >> /proc/interrupts doesn't change at all. Below are some details: >> ======= >> cat /proc/interrupts | grep eth0 >> 68: 330887 330861 331432 330544 330346 330227 >> 330830 330575 PCI-MSI-edge eth0 >> ======= >> cat /proc/irq/68/smp_affinity >> ff >> >> I know when network is under heavy load, NAPI will disable nic >> interrupt and poll ring buffer in nic. My question is, when is nic >> interrupt enabled again? It seems that it will never be enabled if the >> heavy workload doesn't stop, simply because the number showed by >> /proc/interrupts doesn't change at all. In my case, one of core is >> saturated by ksoftirqd, because lots of softirqs are pending to that >> core. I just want to distribute these softirqs to other cores. Even >> RPS is enabled, that core is still occupied by ksoftirq, nearly 100%. >> >> I dive into the codes and find these statements: >> __napi_schedule ==> >> local_irq_save(flags); >> ____napi_schedule(&__get_cpu_var(softnet_data), n); >> local_irq_restore(flags); >> >> here "local_irq_save" actually invokes "cli" which disable interrupt >> for the local core, is this the one that used in NAPI to disable nic >> interrupt? Personally I don't think it is because it just disables >> local cpu. >> >> I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable" >> under drivers/net/e1000e, are these used in NAPI to disable nic >> interrupt, but I fail to get any clue that they are used in the code >> path of NAPI? > > This is done in the device driver itself, not in generic NAPI code. > > When NAPI poll() get less packets than the budget, it re-enables chip > interrupts. > > So you mean that if NAPI poll() get more or equal packets than budget, it will not enable chip interrupts, right? In this case, one core still suffers from heavy workloads. Can you please briefly show me where is this control statement in kernel source code? I have looked for it several days but without luck. >> >> My current situation is that, almost 60% of time of other 7 cores are >> idle, while only one core which is occupied by ksoftirq is 100% busy. >> > > You could post some info, like "cat /proc/net/softnet_stat" > > If you use RPS on a very high workload, on a mono queue NIC, best is to > stick for example cpu0 for the packet dispatching, and other cpus for > IP/UDP handling. > > echo 01 >/proc/irq/68/smp_affinity > echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus > > Please keep in mind that if your memcache uses a single UDP socket, you > probably hit a lot of contention on the socket spinlock and various > counters. So maybe it would be better to _reduce_ number of cpus > handling network load to reduce false sharing. My memcached uses 8 different UDP sockets(8 different UDP ports), so there should be no lock contention for a single UDP rx-queue. > > echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus > > Really, if you have a single UDP queue, best would be to not use RPS and > only have : > > echo 01 >/proc/irq/68/smp_affinity > > Then you could post the result of "perf top -C 0" so that we can spot > obvious problems on the hot path for this particular cpu. > > > Thanks! Yuehai ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload? 2012-01-15 22:27 ` Yuehai Xu @ 2012-01-15 22:45 ` Yuehai Xu 2012-01-15 23:10 ` Eric Dumazet 2012-01-16 6:53 ` Eric Dumazet 1 sibling, 1 reply; 7+ messages in thread From: Yuehai Xu @ 2012-01-15 22:45 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, linux-kernel, yhxu On Sun, Jan 15, 2012 at 5:27 PM, Yuehai Xu <yuehaixu@gmail.com> wrote: > Thanks for replying! Please see below: > > On Sun, Jan 15, 2012 at 5:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> Le dimanche 15 janvier 2012 à 15:53 -0500, Yuehai Xu a écrit : >>> Hi All, >>> >>> My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet >>> Controller, the driver is e1000e, and my Linux version is 3.1.4. I >>> have a Memcached server running on this 8 core box, the weird thing is >>> that when my server is under heavy workload, the number of >>> /proc/interrupts doesn't change at all. Below are some details: >>> ======= >>> cat /proc/interrupts | grep eth0 >>> 68: 330887 330861 331432 330544 330346 330227 >>> 330830 330575 PCI-MSI-edge eth0 >>> ======= >>> cat /proc/irq/68/smp_affinity >>> ff >>> >>> I know when network is under heavy load, NAPI will disable nic >>> interrupt and poll ring buffer in nic. My question is, when is nic >>> interrupt enabled again? It seems that it will never be enabled if the >>> heavy workload doesn't stop, simply because the number showed by >>> /proc/interrupts doesn't change at all. In my case, one of core is >>> saturated by ksoftirqd, because lots of softirqs are pending to that >>> core. I just want to distribute these softirqs to other cores. Even >>> RPS is enabled, that core is still occupied by ksoftirq, nearly 100%. >>> >>> I dive into the codes and find these statements: >>> __napi_schedule ==> >>> local_irq_save(flags); >>> ____napi_schedule(&__get_cpu_var(softnet_data), n); >>> local_irq_restore(flags); >>> >>> here "local_irq_save" actually invokes "cli" which disable interrupt >>> for the local core, is this the one that used in NAPI to disable nic >>> interrupt? Personally I don't think it is because it just disables >>> local cpu. >>> >>> I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable" >>> under drivers/net/e1000e, are these used in NAPI to disable nic >>> interrupt, but I fail to get any clue that they are used in the code >>> path of NAPI? >> >> This is done in the device driver itself, not in generic NAPI code. >> >> When NAPI poll() get less packets than the budget, it re-enables chip >> interrupts. >> >> > > So you mean that if NAPI poll() get more or equal packets than budget, > it will not enable chip interrupts, right? In this case, one core > still suffers from heavy workloads. Can you please briefly show me > where is this control statement in kernel source code? I have looked > for it several days but without luck. > I go through the codes again, since NAPI poll() actually invokes e1000_clean() (my driver is e1000e), and this routine shows: .... adapter->clean_rx(adapter, &work_done, budget); .... /* If budget not fully consumed, exit the polling mode */ if(work_done < budget) { .... e1000_irq_enable(adapter) ..... } I think above should be what you have said. Correct me if I am wrong, "work_done" the number of packets that NAPI polls, while budget is a parameter that administrator can set. So, if work_done always larger or equal than budget, chip interrupt will never be re-enabled. This sounds make sense. However, this also means there is a certain core needs to handle all softirqs, simply because my smp_affinity of irq doesn't work here. Even RPS can alleviate some softirqs to other cores, it doesn't solve the problem 100%. > >>> >>> My current situation is that, almost 60% of time of other 7 cores are >>> idle, while only one core which is occupied by ksoftirq is 100% busy. >>> >> >> You could post some info, like "cat /proc/net/softnet_stat" >> >> If you use RPS on a very high workload, on a mono queue NIC, best is to >> stick for example cpu0 for the packet dispatching, and other cpus for >> IP/UDP handling. >> >> echo 01 >/proc/irq/68/smp_affinity >> echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus >> >> Please keep in mind that if your memcache uses a single UDP socket, you >> probably hit a lot of contention on the socket spinlock and various >> counters. So maybe it would be better to _reduce_ number of cpus >> handling network load to reduce false sharing. > > My memcached uses 8 different UDP sockets(8 different UDP ports), so > there should be no lock contention for a single UDP rx-queue. > >> >> echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus >> >> Really, if you have a single UDP queue, best would be to not use RPS and >> only have : >> >> echo 01 >/proc/irq/68/smp_affinity >> >> Then you could post the result of "perf top -C 0" so that we can spot >> obvious problems on the hot path for this particular cpu. >> >> >> > > Thanks! > Yuehai ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload? 2012-01-15 22:45 ` Yuehai Xu @ 2012-01-15 23:10 ` Eric Dumazet 0 siblings, 0 replies; 7+ messages in thread From: Eric Dumazet @ 2012-01-15 23:10 UTC (permalink / raw) To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu Le dimanche 15 janvier 2012 à 17:45 -0500, Yuehai Xu a écrit : > However, this also means there is a certain core > needs to handle all softirqs, simply because my smp_affinity of irq > doesn't work here. You miss something here. You really should not ask to distribute hardware irqs on all cores. This is in conflict with RPS, but also with cache efficiency. And anyway, if load is high enough, only one core is calling nic poll() from its NAPI handler. One cpu handles the nic poll(), and thanks to RPS distributes packets to other cpus so that they handle (in their softirq handler) the IP stack, plus the TCP/UDP stack. > Even RPS can alleviate some softirqs to other > cores, it doesn't solve the problem 100%. Problem is we dont know yet what is 'the problem', as you gave litle info. (You didnt tell us if your memcached was using TCP or UDP transport) If your workload consists of many short lived tcp connections , RPS wont help that much because the three way handshake needs to hold the listener lock. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload? 2012-01-15 22:27 ` Yuehai Xu 2012-01-15 22:45 ` Yuehai Xu @ 2012-01-16 6:53 ` Eric Dumazet 2012-01-16 7:01 ` Eric Dumazet 1 sibling, 1 reply; 7+ messages in thread From: Eric Dumazet @ 2012-01-16 6:53 UTC (permalink / raw) To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu Le dimanche 15 janvier 2012 à 17:27 -0500, Yuehai Xu a écrit : > Thanks for replying! Please see below: > My memcached uses 8 different UDP sockets(8 different UDP ports), so > there should be no lock contention for a single UDP rx-queue. Ah, I missed this mail, so you really should post here result of "perf top -C 0", after you make sure your NIC interrupts are handled by cpu 0. Also, what speed is your link, and how many UDP messages per second do you receive ? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload? 2012-01-16 6:53 ` Eric Dumazet @ 2012-01-16 7:01 ` Eric Dumazet 0 siblings, 0 replies; 7+ messages in thread From: Eric Dumazet @ 2012-01-16 7:01 UTC (permalink / raw) To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu Le lundi 16 janvier 2012 à 07:53 +0100, Eric Dumazet a écrit : > Le dimanche 15 janvier 2012 à 17:27 -0500, Yuehai Xu a écrit : > > Thanks for replying! Please see below: > > > My memcached uses 8 different UDP sockets(8 different UDP ports), so > > there should be no lock contention for a single UDP rx-queue. > > Ah, I missed this mail, so you really should post here result of "perf > top -C 0", after you make sure your NIC interrupts are handled by cpu 0. > > Also, what speed is your link, and how many UDP messages per second do > you receive ? > RPS is not good for you because the generic rxhash computation will spread messages for UDP port XX on many different cpus (because rxhash computation takes into account the complete tuple (src ip, dst ip, src port, dst port), not only dst port. It would be better for your workload to only hash dst port, to avoid false sharing on socket structure. I guess we could extend rxhash computation to use a pluggable BPF filter, now we can have fast filters. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-01-16 7:01 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-01-15 20:53 Why the number of /proc/interrupts doesn't change when nic is under heavy workload? Yuehai Xu 2012-01-15 22:09 ` Eric Dumazet 2012-01-15 22:27 ` Yuehai Xu 2012-01-15 22:45 ` Yuehai Xu 2012-01-15 23:10 ` Eric Dumazet 2012-01-16 6:53 ` Eric Dumazet 2012-01-16 7:01 ` Eric Dumazet
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.