Why the number of /proc/interrupts doesn't change when nic is under heavy workload?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Why the number of /proc/interrupts doesn't change when nic is under heavy workload?
@ 2012-01-15 20:53 Yuehai Xu
  2012-01-15 22:09 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Yuehai Xu @ 2012-01-15 20:53 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, yhxu

Hi All,

My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet
Controller, the driver is e1000e, and my Linux version is 3.1.4. I
have a Memcached server running on this 8 core box, the weird thing is
that when my server is under heavy workload, the number of
/proc/interrupts doesn't change at all. Below are some details:
=======
cat /proc/interrupts | grep eth0
68:     330887     330861     331432     330544     330346     330227
   330830     330575   PCI-MSI-edge      eth0
=======
cat /proc/irq/68/smp_affinity
ff

I know when network is under heavy load, NAPI will disable nic
interrupt and poll ring buffer in nic. My question is, when is nic
interrupt enabled again? It seems that it will never be enabled if the
heavy workload doesn't stop, simply because the number showed by
/proc/interrupts doesn't change at all. In my case, one of core is
saturated by ksoftirqd, because lots of softirqs are pending to that
core. I just want to distribute these softirqs to other cores. Even
RPS is enabled, that core is still occupied by ksoftirq, nearly 100%.

I dive into the codes and find these statements:
__napi_schedule ==>
   local_irq_save(flags);
   ____napi_schedule(&__get_cpu_var(softnet_data), n);
   local_irq_restore(flags);

here "local_irq_save" actually invokes "cli" which disable interrupt
for the local core, is this the one that used in NAPI to disable nic
interrupt? Personally I don't think it is because it just disables
local cpu.

I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable"
under drivers/net/e1000e, are these used in NAPI to disable nic
interrupt, but I fail to get any clue that they are used in the code
path of NAPI?

My current situation is that, almost 60% of time of other 7 cores are
idle, while only one core which is occupied by ksoftirq is 100% busy.

Thanks very much!
Yuehai

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload?
  2012-01-15 20:53 Why the number of /proc/interrupts doesn't change when nic is under heavy workload? Yuehai Xu
@ 2012-01-15 22:09 ` Eric Dumazet
  2012-01-15 22:27   ` Yuehai Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-01-15 22:09 UTC (permalink / raw)
  To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu

Le dimanche 15 janvier 2012 à 15:53 -0500, Yuehai Xu a écrit :
> Hi All,
> 
> My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet
> Controller, the driver is e1000e, and my Linux version is 3.1.4. I
> have a Memcached server running on this 8 core box, the weird thing is
> that when my server is under heavy workload, the number of
> /proc/interrupts doesn't change at all. Below are some details:
> =======
> cat /proc/interrupts | grep eth0
> 68:     330887     330861     331432     330544     330346     330227
>    330830     330575   PCI-MSI-edge      eth0
> =======
> cat /proc/irq/68/smp_affinity
> ff
> 
> I know when network is under heavy load, NAPI will disable nic
> interrupt and poll ring buffer in nic. My question is, when is nic
> interrupt enabled again? It seems that it will never be enabled if the
> heavy workload doesn't stop, simply because the number showed by
> /proc/interrupts doesn't change at all. In my case, one of core is
> saturated by ksoftirqd, because lots of softirqs are pending to that
> core. I just want to distribute these softirqs to other cores. Even
> RPS is enabled, that core is still occupied by ksoftirq, nearly 100%.
> 
> I dive into the codes and find these statements:
> __napi_schedule ==>
>    local_irq_save(flags);
>    ____napi_schedule(&__get_cpu_var(softnet_data), n);
>    local_irq_restore(flags);
> 
> here "local_irq_save" actually invokes "cli" which disable interrupt
> for the local core, is this the one that used in NAPI to disable nic
> interrupt? Personally I don't think it is because it just disables
> local cpu.
> 
> I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable"
> under drivers/net/e1000e, are these used in NAPI to disable nic
> interrupt, but I fail to get any clue that they are used in the code
> path of NAPI?

This is done in the device driver itself, not in generic NAPI code.

When NAPI poll() get less packets than the budget, it re-enables chip
interrupts.

> 
> My current situation is that, almost 60% of time of other 7 cores are
> idle, while only one core which is occupied by ksoftirq is 100% busy.
> 

You could post some info, like "cat /proc/net/softnet_stat"

If you use RPS on a very high workload, on a mono queue NIC, best is to
stick for example cpu0 for the packet dispatching, and other cpus for
IP/UDP handling.

echo 01 >/proc/irq/68/smp_affinity
echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus

Please keep in mind that if your memcache uses a single UDP socket, you
probably hit a lot of contention on the socket spinlock and various
counters. So maybe it would be better to _reduce_ number of cpus
handling network load to reduce false sharing.

echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus

Really, if you have a single UDP queue, best would be to not use RPS and
only have :

echo 01 >/proc/irq/68/smp_affinity

Then you could post the result of "perf top -C 0" so that we can spot
obvious problems on the hot path for this particular cpu.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload?
  2012-01-15 22:09 ` Eric Dumazet
@ 2012-01-15 22:27   ` Yuehai Xu
  2012-01-15 22:45     ` Yuehai Xu
  2012-01-16  6:53     ` Eric Dumazet
  0 siblings, 2 replies; 7+ messages in thread
From: Yuehai Xu @ 2012-01-15 22:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, yhxu

Thanks for replying! Please see below:

On Sun, Jan 15, 2012 at 5:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le dimanche 15 janvier 2012 à 15:53 -0500, Yuehai Xu a écrit :
>> Hi All,
>>
>> My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet
>> Controller, the driver is e1000e, and my Linux version is 3.1.4. I
>> have a Memcached server running on this 8 core box, the weird thing is
>> that when my server is under heavy workload, the number of
>> /proc/interrupts doesn't change at all. Below are some details:
>> =======
>> cat /proc/interrupts | grep eth0
>> 68:     330887     330861     331432     330544     330346     330227
>>    330830     330575   PCI-MSI-edge      eth0
>> =======
>> cat /proc/irq/68/smp_affinity
>> ff
>>
>> I know when network is under heavy load, NAPI will disable nic
>> interrupt and poll ring buffer in nic. My question is, when is nic
>> interrupt enabled again? It seems that it will never be enabled if the
>> heavy workload doesn't stop, simply because the number showed by
>> /proc/interrupts doesn't change at all. In my case, one of core is
>> saturated by ksoftirqd, because lots of softirqs are pending to that
>> core. I just want to distribute these softirqs to other cores. Even
>> RPS is enabled, that core is still occupied by ksoftirq, nearly 100%.
>>
>> I dive into the codes and find these statements:
>> __napi_schedule ==>
>>    local_irq_save(flags);
>>    ____napi_schedule(&__get_cpu_var(softnet_data), n);
>>    local_irq_restore(flags);
>>
>> here "local_irq_save" actually invokes "cli" which disable interrupt
>> for the local core, is this the one that used in NAPI to disable nic
>> interrupt? Personally I don't think it is because it just disables
>> local cpu.
>>
>> I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable"
>> under drivers/net/e1000e, are these used in NAPI to disable nic
>> interrupt, but I fail to get any clue that they are used in the code
>> path of NAPI?
>
> This is done in the device driver itself, not in generic NAPI code.
>
> When NAPI poll() get less packets than the budget, it re-enables chip
> interrupts.
>
>

So you mean that if NAPI poll() get more or equal packets than budget,
it will not enable chip interrupts, right? In this case, one core
still suffers from heavy workloads. Can you please briefly show me
where is this control statement in kernel source code? I have looked
for it several days but without luck.


>>
>> My current situation is that, almost 60% of time of other 7 cores are
>> idle, while only one core which is occupied by ksoftirq is 100% busy.
>>
>
> You could post some info, like "cat /proc/net/softnet_stat"
>
> If you use RPS on a very high workload, on a mono queue NIC, best is to
> stick for example cpu0 for the packet dispatching, and other cpus for
> IP/UDP handling.
>
> echo 01 >/proc/irq/68/smp_affinity
> echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus
>
> Please keep in mind that if your memcache uses a single UDP socket, you
> probably hit a lot of contention on the socket spinlock and various
> counters. So maybe it would be better to _reduce_ number of cpus
> handling network load to reduce false sharing.

My memcached uses 8 different UDP sockets(8 different UDP ports), so
there should be no lock contention for a single UDP rx-queue.

>
> echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus
>
> Really, if you have a single UDP queue, best would be to not use RPS and
> only have :
>
> echo 01 >/proc/irq/68/smp_affinity
>
> Then you could post the result of "perf top -C 0" so that we can spot
> obvious problems on the hot path for this particular cpu.
>
>
>

Thanks!
Yuehai

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload?
  2012-01-15 22:27   ` Yuehai Xu
@ 2012-01-15 22:45     ` Yuehai Xu
  2012-01-15 23:10       ` Eric Dumazet
  2012-01-16  6:53     ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Yuehai Xu @ 2012-01-15 22:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, yhxu

On Sun, Jan 15, 2012 at 5:27 PM, Yuehai Xu <yuehaixu@gmail.com> wrote:
> Thanks for replying! Please see below:
>
> On Sun, Jan 15, 2012 at 5:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le dimanche 15 janvier 2012 à 15:53 -0500, Yuehai Xu a écrit :
>>> Hi All,
>>>
>>> My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet
>>> Controller, the driver is e1000e, and my Linux version is 3.1.4. I
>>> have a Memcached server running on this 8 core box, the weird thing is
>>> that when my server is under heavy workload, the number of
>>> /proc/interrupts doesn't change at all. Below are some details:
>>> =======
>>> cat /proc/interrupts | grep eth0
>>> 68:     330887     330861     331432     330544     330346     330227
>>>    330830     330575   PCI-MSI-edge      eth0
>>> =======
>>> cat /proc/irq/68/smp_affinity
>>> ff
>>>
>>> I know when network is under heavy load, NAPI will disable nic
>>> interrupt and poll ring buffer in nic. My question is, when is nic
>>> interrupt enabled again? It seems that it will never be enabled if the
>>> heavy workload doesn't stop, simply because the number showed by
>>> /proc/interrupts doesn't change at all. In my case, one of core is
>>> saturated by ksoftirqd, because lots of softirqs are pending to that
>>> core. I just want to distribute these softirqs to other cores. Even
>>> RPS is enabled, that core is still occupied by ksoftirq, nearly 100%.
>>>
>>> I dive into the codes and find these statements:
>>> __napi_schedule ==>
>>>    local_irq_save(flags);
>>>    ____napi_schedule(&__get_cpu_var(softnet_data), n);
>>>    local_irq_restore(flags);
>>>
>>> here "local_irq_save" actually invokes "cli" which disable interrupt
>>> for the local core, is this the one that used in NAPI to disable nic
>>> interrupt? Personally I don't think it is because it just disables
>>> local cpu.
>>>
>>> I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable"
>>> under drivers/net/e1000e, are these used in NAPI to disable nic
>>> interrupt, but I fail to get any clue that they are used in the code
>>> path of NAPI?
>>
>> This is done in the device driver itself, not in generic NAPI code.
>>
>> When NAPI poll() get less packets than the budget, it re-enables chip
>> interrupts.
>>
>>
>
> So you mean that if NAPI poll() get more or equal packets than budget,
> it will not enable chip interrupts, right? In this case, one core
> still suffers from heavy workloads. Can you please briefly show me
> where is this control statement in kernel source code? I have looked
> for it several days but without luck.
>

I go through the codes again, since NAPI poll() actually invokes
e1000_clean() (my driver is e1000e), and this routine shows:
....
adapter->clean_rx(adapter, &work_done, budget);
....
/* If budget not fully consumed, exit the polling mode */
if(work_done < budget) {
     ....
     e1000_irq_enable(adapter)
     .....
}

I think above should be what you have said. Correct me if I am wrong,
"work_done" the number of packets that NAPI polls, while budget is a
parameter that administrator can set. So, if work_done always larger
or equal than budget, chip interrupt will never be re-enabled. This
sounds make sense. However, this also means there is a certain core
needs to handle all softirqs, simply because my smp_affinity of irq
doesn't work here. Even RPS can alleviate some softirqs to other
cores, it doesn't solve the problem 100%.

>
>>>
>>> My current situation is that, almost 60% of time of other 7 cores are
>>> idle, while only one core which is occupied by ksoftirq is 100% busy.
>>>
>>
>> You could post some info, like "cat /proc/net/softnet_stat"
>>
>> If you use RPS on a very high workload, on a mono queue NIC, best is to
>> stick for example cpu0 for the packet dispatching, and other cpus for
>> IP/UDP handling.
>>
>> echo 01 >/proc/irq/68/smp_affinity
>> echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus
>>
>> Please keep in mind that if your memcache uses a single UDP socket, you
>> probably hit a lot of contention on the socket spinlock and various
>> counters. So maybe it would be better to _reduce_ number of cpus
>> handling network load to reduce false sharing.
>
> My memcached uses 8 different UDP sockets(8 different UDP ports), so
> there should be no lock contention for a single UDP rx-queue.
>
>>
>> echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus
>>
>> Really, if you have a single UDP queue, best would be to not use RPS and
>> only have :
>>
>> echo 01 >/proc/irq/68/smp_affinity
>>
>> Then you could post the result of "perf top -C 0" so that we can spot
>> obvious problems on the hot path for this particular cpu.
>>
>>
>>
>
> Thanks!
> Yuehai

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload?
  2012-01-15 22:45     ` Yuehai Xu
@ 2012-01-15 23:10       ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2012-01-15 23:10 UTC (permalink / raw)
  To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu

Le dimanche 15 janvier 2012 à 17:45 -0500, Yuehai Xu a écrit :

>  However, this also means there is a certain core
> needs to handle all softirqs, simply because my smp_affinity of irq
> doesn't work here.

You miss something here. You really should not ask to distribute
hardware irqs on all cores. This is in conflict with RPS, but also with
cache efficiency.

And anyway, if load is high enough, only one core is calling nic poll()
from its NAPI handler.

One cpu handles the nic poll(), and thanks to RPS distributes packets to
other cpus so that they handle (in their softirq handler) the IP stack,
plus the TCP/UDP stack.

>  Even RPS can alleviate some softirqs to other
> cores, it doesn't solve the problem 100%.

Problem is we dont know yet what is 'the problem', as you gave litle
info.

(You didnt tell us if your memcached was using TCP or UDP transport)

If your workload consists of many short lived tcp connections , RPS wont
help that much because the three way handshake needs to hold the
listener lock.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload?
  2012-01-15 22:27   ` Yuehai Xu
  2012-01-15 22:45     ` Yuehai Xu
@ 2012-01-16  6:53     ` Eric Dumazet
  2012-01-16  7:01       ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-01-16  6:53 UTC (permalink / raw)
  To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu

Le dimanche 15 janvier 2012 à 17:27 -0500, Yuehai Xu a écrit :
> Thanks for replying! Please see below:

> My memcached uses 8 different UDP sockets(8 different UDP ports), so
> there should be no lock contention for a single UDP rx-queue.

Ah, I missed this mail, so you really should post here result of "perf
top -C 0", after you make sure your NIC interrupts are handled by cpu 0.

Also, what speed is your link, and how many UDP messages per second do
you receive ?




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload?
  2012-01-16  6:53     ` Eric Dumazet
@ 2012-01-16  7:01       ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2012-01-16  7:01 UTC (permalink / raw)
  To: Yuehai Xu; +Cc: netdev, linux-kernel, yhxu

Le lundi 16 janvier 2012 à 07:53 +0100, Eric Dumazet a écrit :
> Le dimanche 15 janvier 2012 à 17:27 -0500, Yuehai Xu a écrit :
> > Thanks for replying! Please see below:
> 
> > My memcached uses 8 different UDP sockets(8 different UDP ports), so
> > there should be no lock contention for a single UDP rx-queue.
> 
> Ah, I missed this mail, so you really should post here result of "perf
> top -C 0", after you make sure your NIC interrupts are handled by cpu 0.
> 
> Also, what speed is your link, and how many UDP messages per second do
> you receive ?
> 

RPS is not good for you because the generic rxhash computation will
spread messages for UDP port XX on many different cpus (because rxhash
computation takes into account the complete tuple (src ip, dst ip, src
port, dst port), not only dst port.

It would be better for your workload to only hash dst port, to avoid
false sharing on socket structure.

I guess we could extend rxhash computation to use a pluggable BPF
filter, now we can have fast filters.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-16  7:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-15 20:53 Why the number of /proc/interrupts doesn't change when nic is under heavy workload? Yuehai Xu
2012-01-15 22:09 ` Eric Dumazet
2012-01-15 22:27   ` Yuehai Xu
2012-01-15 22:45     ` Yuehai Xu
2012-01-15 23:10       ` Eric Dumazet
2012-01-16  6:53     ` Eric Dumazet
2012-01-16  7:01       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.