linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* questions on NAPI processing latency and dropped network packets
@ 2008-01-10 17:24 Chris Friesen
  2008-01-10 17:37 ` Kok, Auke
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: Chris Friesen @ 2008-01-10 17:24 UTC (permalink / raw)
  To: netdev, linux-kernel

Hi all,

I've got an issue that's popped up with a deployed system running 
2.6.10.  I'm looking for some help figuring out why incoming network 
packets aren't being processed fast enough.

After a recent userspace app change, we've started seeing packets being 
dropped by the ethernet hardware (e1000, NAPI is enabled).  The 
error/dropped/fifo counts are going up in ethtool:

      rx_packets: 32180834
      rx_bytes: 5480756958
      rx_errors: 862506
      rx_dropped: 771345
      rx_length_errors: 0
      rx_over_errors: 0
      rx_crc_errors: 0
      rx_frame_errors: 0
      rx_fifo_errors: 91161
      rx_missed_errors: 91161

This link is receiving roughly 13K packets/sec, and we're dropping 
roughly 51 packets/sec due to fifo errors.

Increasing the rx descriptor ring size from 256 up to around 3000 or so 
seems to make the problem stop, but it seems to me that this is just a 
workaround for the latency in processing the incoming packets.

So, I'm looking for some suggestions on how to fix this or to figure out 
where the latency is coming from.

Some additional information:


1) Interrupts are being processed on both cpus:

root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
            CPU0       CPU1
  30:    1703756    4530785  U3-MPIC Level     eth0




2) "top" shows a fair amount of time processing softirqs, but very 
little time in ksoftirqd (or is that a sampling artifact?).


Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
Mem:  4007812k total, 2199148k used,  1808664k free,     0k buffers
Swap:   0k total,       0k used,      0k free,   219844k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  5375 root      15   0 2682m 1.8g 6640 S 99.9 46.7  31:17.68 
SigtranServices
  7696 root      17   0  6952 3212 1192 S  7.3  0.1   0:15.75 
schedmon.ppc210
  7859 root      16   0  2688 1228  964 R  0.7  0.0   0:00.04 top
  2956 root       8  -8 18940 7436 5776 S  0.3  0.2   0:01.35 blademtc
     1 root      16   0  1660  620  532 S  0.0  0.0   0:30.62 init
     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/0
     3 root      15   0     0    0    0 S  0.0  0.0   0:00.55 ksoftirqd/0
     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
     5 root      15   0     0    0    0 S  0.0  0.0   0:00.43 ksoftirqd/1


3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300


So...anyone have any ideas/suggestions?

Thanks,

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 17:24 questions on NAPI processing latency and dropped network packets Chris Friesen
@ 2008-01-10 17:37 ` Kok, Auke
  2008-01-10 18:12   ` Chris Friesen
  2008-01-10 18:25 ` James Chapman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Kok, Auke @ 2008-01-10 17:37 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

Chris Friesen wrote:
> Hi all,
> 
> I've got an issue that's popped up with a deployed system running
> 2.6.10.  I'm looking for some help figuring out why incoming network
> packets aren't being processed fast enough.
> 
> After a recent userspace app change, we've started seeing packets being
> dropped by the ethernet hardware (e1000, NAPI is enabled).  The
> error/dropped/fifo counts are going up in ethtool:
> 
>      rx_packets: 32180834
>      rx_bytes: 5480756958
>      rx_errors: 862506
>      rx_dropped: 771345
>      rx_length_errors: 0
>      rx_over_errors: 0
>      rx_crc_errors: 0
>      rx_frame_errors: 0
>      rx_fifo_errors: 91161
>      rx_missed_errors: 91161
> 
> This link is receiving roughly 13K packets/sec, and we're dropping
> roughly 51 packets/sec due to fifo errors.
> 
> Increasing the rx descriptor ring size from 256 up to around 3000 or so
> seems to make the problem stop, but it seems to me that this is just a
> workaround for the latency in processing the incoming packets.
> 
> So, I'm looking for some suggestions on how to fix this or to figure out
> where the latency is coming from.
> 
> Some additional information:
> 
> 
> 1) Interrupts are being processed on both cpus:
> 
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>            CPU0       CPU1
>  30:    1703756    4530785  U3-MPIC Level     eth0
> 
> 
> 
> 
> 2) "top" shows a fair amount of time processing softirqs, but very
> little time in ksoftirqd (or is that a sampling artifact?).
> 
> 
> Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
> Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
> Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
> Mem:  4007812k total, 2199148k used,  1808664k free,     0k buffers
> Swap:   0k total,       0k used,      0k free,   219844k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5375 root      15   0 2682m 1.8g 6640 S 99.9 46.7  31:17.68
> SigtranServices
>  7696 root      17   0  6952 3212 1192 S  7.3  0.1   0:15.75
> schedmon.ppc210
>  7859 root      16   0  2688 1228  964 R  0.7  0.0   0:00.04 top
>  2956 root       8  -8 18940 7436 5776 S  0.3  0.2   0:01.35 blademtc
>     1 root      16   0  1660  620  532 S  0.0  0.0   0:30.62 init
>     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/0
>     3 root      15   0     0    0    0 S  0.0  0.0   0:00.55 ksoftirqd/0
>     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
>     5 root      15   0     0    0    0 S  0.0  0.0   0:00.43 ksoftirqd/1
> 
> 
> 3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300
> 
> 
> So...anyone have any ideas/suggestions?

You're using 2.6.10... you can always replace the e1000 module with the
out-of-tree version from e1000.sf.net, this might help a bit - the version in the
2.6.10 kernel is very very old.

it also appears that your app is eating up CPU time. perhaps setting the app to a
nicer nice level might mitigate things a bit. Also turn off the in-kernel irq
mitigation, it just causes cache misses and you really need the network irq to sit
on a single cpu at most (if not all) the time to get the best performance. Use the
userspace irqbalance daemon instead to achieve this.

Auke


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 17:37 ` Kok, Auke
@ 2008-01-10 18:12   ` Chris Friesen
  2008-01-10 18:26     ` Kok, Auke
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2008-01-10 18:12 UTC (permalink / raw)
  To: Kok, Auke; +Cc: netdev, linux-kernel

Kok, Auke wrote:

> You're using 2.6.10... you can always replace the e1000 module with the
> out-of-tree version from e1000.sf.net, this might help a bit - the version in the
> 2.6.10 kernel is very very old.

Do you have any reason to believe this would improve things?  It seems 
like the problem lies in the NAPI/softirq code rather than in the e1000 
driver itself, no?

> it also appears that your app is eating up CPU time. perhaps setting the app to a
> nicer nice level might mitigate things a bit.

If we're not handling the softirq work from ksoftirqd how would changing 
scheduler settings affect anything?

 > Also turn off the in-kernel irq
> mitigation, it just causes cache misses and you really need the network irq to sit
> on a single cpu at most (if not all) the time to get the best performance. Use the
> userspace irqbalance daemon instead to achieve this.

Using userspace irqbalance would be some effort to test and deploy 
properly.  However, as a quick test I tried setting the irq affinity for 
this device and it didn't help.

One thing that might be of interest is that it seems to be bursty rather 
than gradual.  Here are some timestamps (in seconds) along with the 
number of overruns on eth0:

6552.15  overruns:260097
6552.69  overruns:260097
6553.32  overruns:260097
6553.83  overruns:260097
6554.35  overruns:260097
6554.87  overruns:260097
6555.41  overruns:260097
6555.94  overruns:260097
6556.51  overruns:260097
6557.07  overruns:260282
6557.58  overruns:260282
6558.23  overruns:260282


Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 17:24 questions on NAPI processing latency and dropped network packets Chris Friesen
  2008-01-10 17:37 ` Kok, Auke
@ 2008-01-10 18:25 ` James Chapman
  2008-01-10 21:29   ` Chris Friesen
  2008-01-10 18:41 ` Rick Jones
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: James Chapman @ 2008-01-10 18:25 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

Chris Friesen wrote:
> Hi all,
> 
> I've got an issue that's popped up with a deployed system running
> 2.6.10.  I'm looking for some help figuring out why incoming network
> packets aren't being processed fast enough.
> 
> After a recent userspace app change, we've started seeing packets being
> dropped by the ethernet hardware (e1000, NAPI is enabled).

What's changed in your application? Any real-time threads in there?

>From the top output below, looks like SigtranServices is consuming all
your CPU...

> The
> error/dropped/fifo counts are going up in ethtool:
> 
>      rx_packets: 32180834
>      rx_bytes: 5480756958
>      rx_errors: 862506
>      rx_dropped: 771345
>      rx_length_errors: 0
>      rx_over_errors: 0
>      rx_crc_errors: 0
>      rx_frame_errors: 0
>      rx_fifo_errors: 91161
>      rx_missed_errors: 91161
> 
> This link is receiving roughly 13K packets/sec, and we're dropping
> roughly 51 packets/sec due to fifo errors.
> 
> Increasing the rx descriptor ring size from 256 up to around 3000 or so
> seems to make the problem stop, but it seems to me that this is just a
> workaround for the latency in processing the incoming packets.
> 
> So, I'm looking for some suggestions on how to fix this or to figure out
> where the latency is coming from.
> 
> Some additional information:
> 
> 
> 1) Interrupts are being processed on both cpus:
> 
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>            CPU0       CPU1
>  30:    1703756    4530785  U3-MPIC Level     eth0
> 
> 
> 
> 
> 2) "top" shows a fair amount of time processing softirqs, but very
> little time in ksoftirqd (or is that a sampling artifact?).
> 
> 
> Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
> Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
> Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
> Mem:  4007812k total, 2199148k used,  1808664k free,     0k buffers
> Swap:   0k total,       0k used,      0k free,   219844k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5375 root      15   0 2682m 1.8g 6640 S 99.9 46.7  31:17.68
> SigtranServices
>  7696 root      17   0  6952 3212 1192 S  7.3  0.1   0:15.75
> schedmon.ppc210
>  7859 root      16   0  2688 1228  964 R  0.7  0.0   0:00.04 top
>  2956 root       8  -8 18940 7436 5776 S  0.3  0.2   0:01.35 blademtc
>     1 root      16   0  1660  620  532 S  0.0  0.0   0:30.62 init
>     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/0
>     3 root      15   0     0    0    0 S  0.0  0.0   0:00.55 ksoftirqd/0
>     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
>     5 root      15   0     0    0    0 S  0.0  0.0   0:00.43 ksoftirqd/1
> 
> 
> 3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300
> 
> 
> So...anyone have any ideas/suggestions?
> 
> Thanks,
> 
> Chris

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 18:12   ` Chris Friesen
@ 2008-01-10 18:26     ` Kok, Auke
  0 siblings, 0 replies; 36+ messages in thread
From: Kok, Auke @ 2008-01-10 18:26 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

Chris Friesen wrote:
> Kok, Auke wrote:
> 
>> You're using 2.6.10... you can always replace the e1000 module with the
>> out-of-tree version from e1000.sf.net, this might help a bit - the
>> version in the
>> 2.6.10 kernel is very very old.
> 
> Do you have any reason to believe this would improve things?  It seems
> like the problem lies in the NAPI/softirq code rather than in the e1000
> driver itself, no?

your real issue is that your userspace app is hogging the CPU. While network is
not really cpu intensive, it does require that ample time at many intervals is
given to the CPU to run cleanups and prevent FIFO issues.

alternatively, you can increase your rx/tx ring descriptor count (with ethtool),
which basically makes it easier for the hardware not to be serviced for a longer
period, since there are more buffers available and the card can go longer on when
userspace is hogging the CPU.

>> it also appears that your app is eating up CPU time. perhaps setting
>> the app to a
>> nicer nice level might mitigate things a bit.
> 
> If we're not handling the softirq work from ksoftirqd how would changing
> scheduler settings affect anything?

correct, it might not.

>> Also turn off the in-kernel irq
>> mitigation, it just causes cache misses and you really need the
>> network irq to sit
>> on a single cpu at most (if not all) the time to get the best
>> performance. Use the
>> userspace irqbalance daemon instead to achieve this.
> 
> Using userspace irqbalance would be some effort to test and deploy
> properly.  However, as a quick test I tried setting the irq affinity for
> this device and it didn't help.

irqbalance is a simple userspace app that drops into any system seemlessly and
does the best job all around - often it beats manual tuning of smp_affinity even ;)

Auke

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 17:24 questions on NAPI processing latency and dropped network packets Chris Friesen
  2008-01-10 17:37 ` Kok, Auke
  2008-01-10 18:25 ` James Chapman
@ 2008-01-10 18:41 ` Rick Jones
  2008-01-10 19:01   ` Kok, Auke
  2008-01-11  1:20 ` David Miller
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Rick Jones @ 2008-01-10 18:41 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

> 1) Interrupts are being processed on both cpus:
> 
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>            CPU0       CPU1
>  30:    1703756    4530785  U3-MPIC Level     eth0

IIRC none of the e1000 driven cards are multi-queue, so while the above 
shows that interrupts from eth0 have been processed on both CPUs at 
various points in the past, it doesn't necessarily mean that they are 
being processed on both CPUs at the same time right?

rick jones

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 18:41 ` Rick Jones
@ 2008-01-10 19:01   ` Kok, Auke
  0 siblings, 0 replies; 36+ messages in thread
From: Kok, Auke @ 2008-01-10 19:01 UTC (permalink / raw)
  To: Rick Jones; +Cc: Chris Friesen, netdev, linux-kernel

Rick Jones wrote:
>> 1) Interrupts are being processed on both cpus:
>>
>> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>>            CPU0       CPU1
>>  30:    1703756    4530785  U3-MPIC Level     eth0
> 
> IIRC none of the e1000 driven cards are multi-queue

the pci-express variants are, but the functionality is almost always disabled (and
relatively new anyway).

even with multiqueue, you can still have only a single irq line (which defeats the
purpose of course mostly).

>, so while the above 
> shows that interrupts from eth0 have been processed on both CPUs at
> various points in the past, it doesn't necessarily mean that they are
> being processed on both CPUs at the same time right?

never will, an irq can only be processed on one cpu at a time anyway, obviously
the irq here has been migrated ONCE from one of the cpu's to the other.
unfortunately you can't see from /proc/interrupts whether this happens frequently
or not, or how many times it happened before.

Auke

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 18:25 ` James Chapman
@ 2008-01-10 21:29   ` Chris Friesen
  0 siblings, 0 replies; 36+ messages in thread
From: Chris Friesen @ 2008-01-10 21:29 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, linux-kernel

James Chapman wrote:

> What's changed in your application? Any real-time threads in there?
> 
>>From the top output below, looks like SigtranServices is consuming all
> your CPU...

There are two cpus, and SigtranServices is multithreaded with many 
threads.  Most of these threads are affined to cpu0, a couple to cpu1. 
None of the threads are realtime.

Top is showing 37% idle on cpu0, and 6% idle on cpu1, so not all the cpu 
is being consumed.  However, I'm wondering if we're hitting bursty bits 
and we're just running out of time.

I'm going to try a system with MAX_SOFTIRQ_RESTART bumped up a bit, and 
also enable profiling.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 17:24 questions on NAPI processing latency and dropped network packets Chris Friesen
                   ` (2 preceding siblings ...)
  2008-01-10 18:41 ` Rick Jones
@ 2008-01-11  1:20 ` David Miller
  2008-01-11 14:59   ` Chris Friesen
  2008-01-12  5:37 ` Ray Lee
  2008-01-21 19:53 ` Chris Friesen
  5 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2008-01-11  1:20 UTC (permalink / raw)
  To: cfriesen; +Cc: netdev, linux-kernel

From: "Chris Friesen" <cfriesen@nortel.com>
Date: Thu, 10 Jan 2008 11:24:19 -0600

> I've got an issue that's popped up with a deployed system running 
> 2.6.10.
 ...
> So...anyone have any ideas/suggestions?

You have to be kidding, coming here for help with a nearly
4 year old kernel.

The networking stack as well as the e1000 driver has been
practically rewritten several times over since then.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-11  1:20 ` David Miller
@ 2008-01-11 14:59   ` Chris Friesen
  2008-01-11 22:29     ` Herbert Xu
  2008-01-12  1:53     ` David Miller
  0 siblings, 2 replies; 36+ messages in thread
From: Chris Friesen @ 2008-01-11 14:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel

David Miller wrote:

> You have to be kidding, coming here for help with a nearly
> 4 year old kernel.

I figured it couldn't hurt to ask...if I can't ask the original authors, 
who else is there?

I'd love to work on newer kernels, but we have a commitment to our 
customers to support multiple releases for a significant amount of time.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-11 14:59   ` Chris Friesen
@ 2008-01-11 22:29     ` Herbert Xu
  2008-01-12  1:53     ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: Herbert Xu @ 2008-01-11 22:29 UTC (permalink / raw)
  To: Chris Friesen; +Cc: davem, netdev, linux-kernel

Chris Friesen <cfriesen@nortel.com> wrote:
>
> I'd love to work on newer kernels, but we have a commitment to our 
> customers to support multiple releases for a significant amount of time.

Since you've made the commitment, you should stick to it and resolve
the issues without asking us to contribute.  After all we haven't made
that commitment to you or your customers.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-11 14:59   ` Chris Friesen
  2008-01-11 22:29     ` Herbert Xu
@ 2008-01-12  1:53     ` David Miller
  2008-01-14 15:58       ` Chris Friesen
  1 sibling, 1 reply; 36+ messages in thread
From: David Miller @ 2008-01-12  1:53 UTC (permalink / raw)
  To: cfriesen; +Cc: netdev, linux-kernel

From: "Chris Friesen" <cfriesen@nortel.com>
Date: Fri, 11 Jan 2008 08:59:26 -0600

> I'd love to work on newer kernels, but we have a commitment to our 
> customers to support multiple releases for a significant amount of time.

And by asking here for people to dig into it for you, you are asking
people for free help providing that support.

That's why there is such negative backlash to asking questions about
such ancient kernel here, you're asking us to do your work, for free.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 17:24 questions on NAPI processing latency and dropped network packets Chris Friesen
                   ` (3 preceding siblings ...)
  2008-01-11  1:20 ` David Miller
@ 2008-01-12  5:37 ` Ray Lee
  2008-01-14 15:49   ` Chris Friesen
  2008-01-21 19:53 ` Chris Friesen
  5 siblings, 1 reply; 36+ messages in thread
From: Ray Lee @ 2008-01-12  5:37 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

On Jan 10, 2008 9:24 AM, Chris Friesen <cfriesen@nortel.com> wrote:
> After a recent userspace app change, we've started seeing packets being
> dropped by the ethernet hardware (e1000, NAPI is enabled).  The
> error/dropped/fifo counts are going up in ethtool:

(These are perhaps too obvious, but I didn't see the questions or
answers in the thread.)

Can you reproduce it with a simple userspace cpu hog? (Two, really,
one per cpu.)

Can you reproduce it with the newer e1000?

Can you reproduce it with git head?

If the answer to the first one is yes, the last no, then bisect until
you get a kernel that doesn't show the problem. Backport the fix,
unless the fix happens to be CFS. However, I suspect that your
userpace app is just starving the system from time to time.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-12  5:37 ` Ray Lee
@ 2008-01-14 15:49   ` Chris Friesen
  2008-01-14 16:56     ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2008-01-14 15:49 UTC (permalink / raw)
  To: Ray Lee; +Cc: netdev, linux-kernel

Ray Lee wrote:
> On Jan 10, 2008 9:24 AM, Chris Friesen <cfriesen@nortel.com> wrote:

>>After a recent userspace app change, we've started seeing packets being
>>dropped by the ethernet hardware (e1000, NAPI is enabled).  The
>>error/dropped/fifo counts are going up in ethtool:

> Can you reproduce it with a simple userspace cpu hog? (Two, really,
> one per cpu.)
> Can you reproduce it with the newer e1000?

Hmm...good questions and I haven't checked either.  The first one is 
relatively straightforward.  The second is a bit trickier...last time I 
tried the latest e1000 driver the card wouldn't boot (we use netboot).

> Can you reproduce it with git head?

Unfortunately, I don't think I'll be able to try this.  We require 
kernel mods for our userspace to run, and I doubt I'd be able to get the 
time to port all the changes forward to git head.

> If the answer to the first one is yes, the last no, then bisect until
> you get a kernel that doesn't show the problem. Backport the fix,
> unless the fix happens to be CFS. However, I suspect that your
> userpace app is just starving the system from time to time.

It's conceivable that userspace is starving the kernel, but we have do 
about 45% idle on one cpu, and 7-10% idle on the other.

We also have an odd situation where on an initial test run after bootup 
we have 18-24% idle on cpu1, but resetting the test tool drops that to 
the 7-10% I mentioned above.

Based on profiling and instrumentation it seems like the cost of 
sctp_endpoint_lookup_assoc() more than triples, which means that the 
amount of time that bottom halves are disabled in that function also 
triples.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-12  1:53     ` David Miller
@ 2008-01-14 15:58       ` Chris Friesen
  2008-01-15  7:19         ` Jarek Poplawski
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2008-01-14 15:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel

David Miller wrote:
> From: "Chris Friesen" <cfriesen@nortel.com>
> Date: Fri, 11 Jan 2008 08:59:26 -0600

>>I'd love to work on newer kernels, but we have a commitment to our 
>>customers to support multiple releases for a significant amount of time.

> And by asking here for people to dig into it for you, you are asking
> people for free help providing that support.

I'm not asking people to spend significant amounts of time...more like 
if anyone has any ideas "off the top of their heads".

> That's why there is such negative backlash to asking questions about
> such ancient kernel here, you're asking us to do your work, for free.

I hadn't realized that you felt this strongly about asking questions 
regarding old kernels.

How close to bleeding edge do we need to be for it to be considered 
acceptable to ask questions on netdev?

Given that the embedded space tends to be perpetually stuck on older 
kernels (our "current" release is based on 2.6.14) do you have any 
suggestion on how we can obtain information that would be available on 
netdev if we were using newer kernels?

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-14 15:49   ` Chris Friesen
@ 2008-01-14 16:56     ` Eric Dumazet
  2008-01-14 19:25       ` Chris Friesen
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2008-01-14 16:56 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Ray Lee, netdev, linux-kernel

Chris Friesen a écrit :
> Ray Lee wrote:
>> On Jan 10, 2008 9:24 AM, Chris Friesen <cfriesen@nortel.com> wrote:
>
>>> After a recent userspace app change, we've started seeing packets being
>>> dropped by the ethernet hardware (e1000, NAPI is enabled).  The
>>> error/dropped/fifo counts are going up in ethtool:
>
>> Can you reproduce it with a simple userspace cpu hog? (Two, really,
>> one per cpu.)
>> Can you reproduce it with the newer e1000?
>
> Hmm...good questions and I haven't checked either.  The first one is 
> relatively straightforward.  The second is a bit trickier...last time 
> I tried the latest e1000 driver the card wouldn't boot (we use netboot).
>
>> Can you reproduce it with git head?
>
> Unfortunately, I don't think I'll be able to try this.  We require 
> kernel mods for our userspace to run, and I doubt I'd be able to get 
> the time to port all the changes forward to git head.
>
>> If the answer to the first one is yes, the last no, then bisect until
>> you get a kernel that doesn't show the problem. Backport the fix,
>> unless the fix happens to be CFS. However, I suspect that your
>> userpace app is just starving the system from time to time.
>
> It's conceivable that userspace is starving the kernel, but we have do 
> about 45% idle on one cpu, and 7-10% idle on the other.
>
> We also have an odd situation where on an initial test run after 
> bootup we have 18-24% idle on cpu1, but resetting the test tool drops 
> that to the 7-10% I mentioned above.
>
> Based on profiling and instrumentation it seems like the cost of 
> sctp_endpoint_lookup_assoc() more than triples, which means that the 
> amount of time that bottom halves are disabled in that function also 
> triples.
Any idea of the size of sctp hash size you have ?
(your dmesg probably includes a message starting with SCTP: Hash tables 
configured... 

How many concurrent sctp sockets are handled ?

Maybe sctp_assoc_hashfn() is too weak for your use, and some chains are 
*really* long.





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-14 16:56     ` Eric Dumazet
@ 2008-01-14 19:25       ` Chris Friesen
  2008-01-14 19:33         ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2008-01-14 19:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ray Lee, netdev, linux-kernel

Eric Dumazet wrote:
> Chris Friesen a écrit :

>> Based on profiling and instrumentation it seems like the cost of 
>> sctp_endpoint_lookup_assoc() more than triples, which means that the 
>> amount of time that bottom halves are disabled in that function also 
>> triples.
> 
> Any idea of the size of sctp hash size you have ?
> (your dmesg probably includes a message starting with SCTP: Hash tables 
> configured...
> How many concurrent sctp sockets are handled ?

Our lab is currently rebooting, but I'll try and get this once it's back up.

> Maybe sctp_assoc_hashfn() is too weak for your use, and some chains are 
> *really* long.

Based on the profiling information we're spending time in 
sctp_endpoint_lookup_assoc() which doesn't actually use hashes, so I 
can't see how the hash would be related.  I'm pretty new to SCTP though, 
so I may be missing something.

Here's the top results from readprofile, unfortunately these are 
aggregated across both cpus so they don't really show what's going on. 
The key thing is that sctp_endpoint_lookup_assoc() is the most expensive 
kernel routine on this entire system.

   3147 .power4_idle                              22.4786
   1712 .native_idle                              20.3810
   1234 .sctp_endpoint_lookup_assoc                2.1725
   1212 ._spin_unlock_irqrestore                   6.4468
    778 .do_futex                                  0.3791
    447 ._spin_unlock_irq                          4.2981
    313 .fget                                      1.7784
    277 .fput                                      3.8472
    275 .kfree                                     0.7473
    234 .__kmalloc                                 0.5571
    131 SystemCall_common                          0.3411
    130 .sctp_assoc_is_match                       0.6373
    123 .lock_sock                                 0.4155
    119 .find_vma                                  0.6919
    116 .kmem_cache_alloc                          0.3580
    111 .kmem_cache_free                           0.3343
    106 .skb_release_data                          0.4907
    102 .__copy_tofrom_user                        0.0724
    100 .exit_elf_binfmt                           1.9231
    100 .do_select                                 0.0820


Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-14 19:25       ` Chris Friesen
@ 2008-01-14 19:33         ` Eric Dumazet
  2008-01-14 20:02           ` Chris Friesen
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2008-01-14 19:33 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Ray Lee, netdev, linux-kernel

Chris Friesen a écrit :
> Eric Dumazet wrote:
>> Chris Friesen a écrit :
>
>>> Based on profiling and instrumentation it seems like the cost of 
>>> sctp_endpoint_lookup_assoc() more than triples, which means that the 
>>> amount of time that bottom halves are disabled in that function also 
>>> triples.
>>
>> Any idea of the size of sctp hash size you have ?
>> (your dmesg probably includes a message starting with SCTP: Hash 
>> tables configured...
>> How many concurrent sctp sockets are handled ?
>
> Our lab is currently rebooting, but I'll try and get this once it's 
> back up.
>
>> Maybe sctp_assoc_hashfn() is too weak for your use, and some chains 
>> are *really* long.
>
> Based on the profiling information we're spending time in 
> sctp_endpoint_lookup_assoc() which doesn't actually use hashes, so I 
> can't see how the hash would be related.  I'm pretty new to SCTP 
> though, so I may be missing something.
Well, it does use hashes :)

        hash = sctp_assoc_hashfn(ep->base.bind_addr.port, rport);
        head = &sctp_assoc_hashtable[hash];
        read_lock(&head->lock);
        sctp_for_each_hentry(epb, node, &head->chain) {
            /* maybe your machine is traversing here a *really* long 
chain */
            }

>
> Here's the top results from readprofile, unfortunately these are 
> aggregated across both cpus so they don't really show what's going on. 
> The key thing is that sctp_endpoint_lookup_assoc() is the most 
> expensive kernel routine on this entire system.
>
>   3147 .power4_idle                              22.4786
>   1712 .native_idle                              20.3810
>   1234 .sctp_endpoint_lookup_assoc                2.1725

>   1212 ._spin_unlock_irqrestore                   6.4468
>    778 .do_futex                                  0.3791
>    447 ._spin_unlock_irq                          4.2981
>    313 .fget                                      1.7784
>    277 .fput                                      3.8472
>    275 .kfree                                     0.7473
>    234 .__kmalloc                                 0.5571
>    131 SystemCall_common                          0.3411
>    130 .sctp_assoc_is_match                       0.6373
>    123 .lock_sock                                 0.4155
>    119 .find_vma                                  0.6919
>    116 .kmem_cache_alloc                          0.3580
>    111 .kmem_cache_free                           0.3343
>    106 .skb_release_data                          0.4907
>    102 .__copy_tofrom_user                        0.0724
>    100 .exit_elf_binfmt                           1.9231
>    100 .do_select                                 0.0820
>
>
> Chris
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-14 19:33         ` Eric Dumazet
@ 2008-01-14 20:02           ` Chris Friesen
  2008-01-15 15:09             ` Vlad Yasevich
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2008-01-14 20:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ray Lee, netdev, linux-kernel

Eric Dumazet wrote:
> Chris Friesen a écrit :

>> Based on the profiling information we're spending time in 
>> sctp_endpoint_lookup_assoc() which doesn't actually use hashes, so I 
>> can't see how the hash would be related.  I'm pretty new to SCTP 
>> though, so I may be missing something.
> 
> Well, it does use hashes :)
> 
>        hash = sctp_assoc_hashfn(ep->base.bind_addr.port, rport);
>        head = &sctp_assoc_hashtable[hash];
>        read_lock(&head->lock);
>        sctp_for_each_hentry(epb, node, &head->chain) {
>            /* maybe your machine is traversing here a *really* long 
> chain */
>            }


The latest released kernel doesn't have this code, it was only added in 
November.  The SCTP maintainer just pointed me to the patch, and made 
some other suggestions as well.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-14 15:58       ` Chris Friesen
@ 2008-01-15  7:19         ` Jarek Poplawski
  2008-01-15 14:47           ` Chris Friesen
  0 siblings, 1 reply; 36+ messages in thread
From: Jarek Poplawski @ 2008-01-15  7:19 UTC (permalink / raw)
  To: Chris Friesen; +Cc: David Miller, netdev, linux-kernel

On 14-01-2008 16:58, Chris Friesen wrote:
...
> How close to bleeding edge do we need to be for it to be considered 
> acceptable to ask questions on netdev?
> 
> Given that the embedded space tends to be perpetually stuck on older 
> kernels (our "current" release is based on 2.6.14) do you have any 
> suggestion on how we can obtain information that would be available on 
> netdev if we were using newer kernels?

IMHO, checking this with a current stable, which probably you are going
to do some day, anyway, should be 100% acceptable: giving some input to
netdev, while still working for yourself.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-15  7:19         ` Jarek Poplawski
@ 2008-01-15 14:47           ` Chris Friesen
  2008-01-15 15:17             ` Radoslaw Szkodzinski
  2008-01-15 20:29             ` Jarek Poplawski
  0 siblings, 2 replies; 36+ messages in thread
From: Chris Friesen @ 2008-01-15 14:47 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, linux-kernel

Jarek Poplawski wrote:

> IMHO, checking this with a current stable, which probably you are going
> to do some day, anyway, should be 100% acceptable: giving some input to
> netdev, while still working for yourself.

While I would love to do this, it's not that simple.

Some of our hardware is not supported on mainline, so we need per-kernel 
version patches to even bring up the blade.  The blades netboot via a 
jumbo-frame network, so kernel extensions are needed to handle setting 
the MTU before mounting the rootfs.  The blade in question uses CKRM 
which doesn't exist for newer kernels so the problem may simply be 
hidden by scheduling differences.  The userspace application uses other 
kernel features that are not in mainline (and likely some of them won't 
ever be in mainline--I know because I've tried).

Basically, the number of changes required for our environment makes it 
difficult to just boot up the latest kernel.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-14 20:02           ` Chris Friesen
@ 2008-01-15 15:09             ` Vlad Yasevich
  0 siblings, 0 replies; 36+ messages in thread
From: Vlad Yasevich @ 2008-01-15 15:09 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Eric Dumazet, Ray Lee, netdev, linux-kernel

Chris Friesen wrote:
> Eric Dumazet wrote:
>> Chris Friesen a écrit :
> 
>>> Based on the profiling information we're spending time in 
>>> sctp_endpoint_lookup_assoc() which doesn't actually use hashes, so I 
>>> can't see how the hash would be related.  I'm pretty new to SCTP 
>>> though, so I may be missing something.
>>
>> Well, it does use hashes :)
>>
>>        hash = sctp_assoc_hashfn(ep->base.bind_addr.port, rport);
>>        head = &sctp_assoc_hashtable[hash];
>>        read_lock(&head->lock);
>>        sctp_for_each_hentry(epb, node, &head->chain) {
>>            /* maybe your machine is traversing here a *really* long 
>> chain */
>>            }
> 
> 
> The latest released kernel doesn't have this code, it was only added in 
> November.  The SCTP maintainer just pointed me to the patch, and made 
> some other suggestions as well.
> 

Yes,  the hash code only got added to 2.6.24.  Before that, it was
a linear list traversal which sucked.

I need to take a look at the locking to see if we can further reduce
the time that bottom half is disabled.

-vlad

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-15 14:47           ` Chris Friesen
@ 2008-01-15 15:17             ` Radoslaw Szkodzinski
  2008-01-15 17:14               ` Chris Friesen
  2008-01-15 20:29             ` Jarek Poplawski
  1 sibling, 1 reply; 36+ messages in thread
From: Radoslaw Szkodzinski @ 2008-01-15 15:17 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Jarek Poplawski, David Miller, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On Tue, 15 Jan 2008 08:47:07 -0600
"Chris Friesen" <cfriesen@nortel.com> wrote:

> Jarek Poplawski wrote:
> 
> > IMHO, checking this with a current stable, which probably you are going
> > to do some day, anyway, should be 100% acceptable: giving some input to
> > netdev, while still working for yourself.
> 
> While I would love to do this, it's not that simple.
> 
> Some of our hardware is not supported on mainline, so we need per-kernel 
> version patches to even bring up the blade.  The blades netboot via a 
> jumbo-frame network, so kernel extensions are needed to handle setting 
> the MTU before mounting the rootfs.

Why? Can't you use a small initramfs to set it up?

> The blade in question uses CKRM 
> which doesn't exist for newer kernels so the problem may simply be 
> hidden by scheduling differences.

Current spiritual successor is Ingo's realtime patchset I guess.

> The userspace application uses other 
> kernel features that are not in mainline (and likely some of them won't 
> ever be in mainline--I know because I've tried).

What would these be? Some proc or sysfs files that were removed or
renamed?
Maybe they can be worked around w/o changing the application at all, or
very minor changes.

Also, be sure to check if there are pauses with other CPU hogs.

> Basically, the number of changes required for our environment makes it 
> difficult to just boot up the latest kernel.

Yes, adding an initramfs is non-trivial (some busybox + shell
scripting), but not all that hard either.
And maintaining one is easier than patching the kernel.
(usually 0 maintenance, maybe minor fixes sometimes)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-15 15:17             ` Radoslaw Szkodzinski
@ 2008-01-15 17:14               ` Chris Friesen
  2008-01-15 17:23                 ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2008-01-15 17:14 UTC (permalink / raw)
  To: Radoslaw Szkodzinski (AstralStorm); +Cc: Jarek Poplawski, netdev, linux-kernel

Radoslaw Szkodzinski (AstralStorm) wrote:
> On Tue, 15 Jan 2008 08:47:07 -0600
> "Chris Friesen" <cfriesen@nortel.com> wrote:

>>Some of our hardware is not supported on mainline, so we need per-kernel 
>>version patches to even bring up the blade.  The blades netboot via a 
>>jumbo-frame network, so kernel extensions are needed to handle setting 
>>the MTU before mounting the rootfs.

> Why? Can't you use a small initramfs to set it up?

Sure, if we changed our build system, engineered a suitable small 
userspace, etc.  At this point it's easier to maintain a small patch to 
the kernel dhcp parsing code so that we can specify mtu.

>>The blade in question uses CKRM 
>>which doesn't exist for newer kernels so the problem may simply be 
>>hidden by scheduling differences.

> Current spiritual successor is Ingo's realtime patchset I guess.

I think the group scheduling stuff for CFS is closer, but there are 
design and API differences that would require us to rework our system.

>>The userspace application uses other 
>>kernel features that are not in mainline (and likely some of them won't 
>>ever be in mainline--I know because I've tried).

> What would these be? Some proc or sysfs files that were removed or
> renamed?
> Maybe they can be worked around w/o changing the application at all, or
> very minor changes.

No, more than proc/sysfs.  Things like notification of process state 
change (think like SIGCHLD but sent to arbitrary processes), additional 
messaging protocol families, oom-killer protection, dirty page 
monitoring, backwards compatibility for "dcbz" on the ppc970, nested 
alternate signal stacks, and many others.  Between our kernel vendor's 
patches and our own, there are something like 600 patches applied to the 
mainline kernel.

> Also, be sure to check if there are pauses with other CPU hogs.

With the sctp hash patch applied we're now sitting with 45% cpu free on 
one cpu, and about 25% free on the other, and we're still seeing 
periodic bursts of rx fifo loss.  It's wierd.  Still trying to figure 
out what's going on.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-15 17:14               ` Chris Friesen
@ 2008-01-15 17:23                 ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2008-01-15 17:23 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Radoslaw Szkodzinski (AstralStorm),
	Jarek Poplawski, netdev, linux-kernel

On Tue, 15 Jan 2008 11:14:25 -0600
"Chris Friesen" <cfriesen@nortel.com> wrote:

> Radoslaw Szkodzinski (AstralStorm) wrote:
> > On Tue, 15 Jan 2008 08:47:07 -0600
> > "Chris Friesen" <cfriesen@nortel.com> wrote:
> 
> >>Some of our hardware is not supported on mainline, so we need per-kernel 
> >>version patches to even bring up the blade.  The blades netboot via a 
> >>jumbo-frame network, so kernel extensions are needed to handle setting 
> >>the MTU before mounting the rootfs.
> 
> > Why? Can't you use a small initramfs to set it up?
> 
> Sure, if we changed our build system, engineered a suitable small 
> userspace, etc.  At this point it's easier to maintain a small patch to 
> the kernel dhcp parsing code so that we can specify mtu.
> 
> >>The blade in question uses CKRM 
> >>which doesn't exist for newer kernels so the problem may simply be 
> >>hidden by scheduling differences.
> 
> > Current spiritual successor is Ingo's realtime patchset I guess.
> 
> I think the group scheduling stuff for CFS is closer, but there are 
> design and API differences that would require us to rework our system.
> 
> >>The userspace application uses other 
> >>kernel features that are not in mainline (and likely some of them won't 
> >>ever be in mainline--I know because I've tried).
> 
> > What would these be? Some proc or sysfs files that were removed or
> > renamed?
> > Maybe they can be worked around w/o changing the application at all, or
> > very minor changes.
> 
> No, more than proc/sysfs.  Things like notification of process state 
> change (think like SIGCHLD but sent to arbitrary processes), additional 
> messaging protocol families, oom-killer protection, dirty page 
> monitoring, backwards compatibility for "dcbz" on the ppc970, nested 
> alternate signal stacks, and many others.  Between our kernel vendor's 
> patches and our own, there are something like 600 patches applied to the 
> mainline kernel.
> 
> > Also, be sure to check if there are pauses with other CPU hogs.
> 
> With the sctp hash patch applied we're now sitting with 45% cpu free on 
> one cpu, and about 25% free on the other, and we're still seeing 
> periodic bursts of rx fifo loss.  It's wierd.  Still trying to figure 
> out what's going on.

On old kernels, some softirq functions can hog cpu for quite long times.

NIC RX rings can be filled and eventually losses occur.

Recent kernels tried to move some time consuming task from softirq to workqueue.

Examples:

http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.25.git;a=commit;h=29b7e29480029c896e7782ce2a45843abeea2f7a

http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.25.git;a=commit;h=d90bf5a976793edfa88d3bb2393f0231eb8ce1e5

http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.25.git;a=commit;h=39c90ece7565f5c47110c2fa77409d7a9478bd5b

http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.25.git;a=commit;h=86bba269d08f0c545ae76c90b56727f65d62d57f

...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-15 14:47           ` Chris Friesen
  2008-01-15 15:17             ` Radoslaw Szkodzinski
@ 2008-01-15 20:29             ` Jarek Poplawski
  2008-01-16  0:17               ` Herbert Xu
  1 sibling, 1 reply; 36+ messages in thread
From: Jarek Poplawski @ 2008-01-15 20:29 UTC (permalink / raw)
  To: Chris Friesen; +Cc: David Miller, netdev, linux-kernel

On Tue, Jan 15, 2008 at 08:47:07AM -0600, Chris Friesen wrote:
> Jarek Poplawski wrote:
>
>> IMHO, checking this with a current stable, which probably you are going
>> to do some day, anyway, should be 100% acceptable: giving some input to
>> netdev, while still working for yourself.
>
> While I would love to do this, it's not that simple.

...Hmm... As a matter of fact, I expected you'd treat my point less
literally... Of course, I know it could be sometimes very hard to get
something working even after upgrading one version, let alone several
at once.

So, it was more a rhetorical trick (sorry!) to suggest, that such a
business model of being always late with kernels might be quite
practical and reasonable for many companies, but looks like the
worst possible development model for Linux.

On the other hand, it seems there are not so much, nor expensive
changes needed (a bit more perspective thinking?) to make everybody
happy...

Jarek P.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-15 20:29             ` Jarek Poplawski
@ 2008-01-16  0:17               ` Herbert Xu
  2008-01-16  6:58                 ` Jarek Poplawski
  0 siblings, 1 reply; 36+ messages in thread
From: Herbert Xu @ 2008-01-16  0:17 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: cfriesen, davem, netdev, linux-kernel

Jarek Poplawski <jarkao2@gmail.com> wrote:
>
> So, it was more a rhetorical trick (sorry!) to suggest, that such a
> business model of being always late with kernels might be quite
> practical and reasonable for many companies, but looks like the
> worst possible development model for Linux.

Well people are always going to operate on this model for commercial
reasons.  FWIW I used to work for a company that stuck to a specific
version of the Linux kernel, and I suppose I still do even now :)

But the important thing is that if you're going to do that, then the
cost that comes with it should be borne by the company and not the
community.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-16  0:17               ` Herbert Xu
@ 2008-01-16  6:58                 ` Jarek Poplawski
  2008-01-16 20:04                   ` Willy Tarreau
  0 siblings, 1 reply; 36+ messages in thread
From: Jarek Poplawski @ 2008-01-16  6:58 UTC (permalink / raw)
  To: Herbert Xu; +Cc: cfriesen, davem, netdev, linux-kernel

On Wed, Jan 16, 2008 at 11:17:08AM +1100, Herbert Xu wrote:
...
> Well people are always going to operate on this model for commercial
> reasons.  FWIW I used to work for a company that stuck to a specific
> version of the Linux kernel, and I suppose I still do even now :)
> 
> But the important thing is that if you're going to do that, then the
> cost that comes with it should be borne by the company and not the
> community.

Sure. But the most sad thing is there seems to be not so much savings
in this (unless a company isn't sure of its near future). Trying to
upgrade and test current products with current kernels, even if not
necessary, should be always useful and make developing of new products
faster and better fit (and of course, BTW, make the kernel better on
time).

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-16  6:58                 ` Jarek Poplawski
@ 2008-01-16 20:04                   ` Willy Tarreau
  2008-01-16 22:42                     ` Jarek Poplawski
  0 siblings, 1 reply; 36+ messages in thread
From: Willy Tarreau @ 2008-01-16 20:04 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Herbert Xu, cfriesen, davem, netdev, linux-kernel

On Wed, Jan 16, 2008 at 07:58:36AM +0100, Jarek Poplawski wrote:
> On Wed, Jan 16, 2008 at 11:17:08AM +1100, Herbert Xu wrote:
> ...
> > Well people are always going to operate on this model for commercial
> > reasons.  FWIW I used to work for a company that stuck to a specific
> > version of the Linux kernel, and I suppose I still do even now :)
> > 
> > But the important thing is that if you're going to do that, then the
> > cost that comes with it should be borne by the company and not the
> > community.
> 
> Sure. But the most sad thing is there seems to be not so much savings
> in this (unless a company isn't sure of its near future). Trying to
> upgrade and test current products with current kernels, even if not
> necessary, should be always useful and make developing of new products
> faster and better fit (and of course, BTW, make the kernel better on
> time).

you can work with latest release provided that you always have a fallback
to an earlier one. That way, you don't bet too much on something you don't
completely control. If it works, it tells you you'll be able to completely
exploit its new possibilities in next product release, and if it breaks,
it's easy to issue a fix to get back to earlier, well-tested version.

Regards,
Willy


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-16 20:04                   ` Willy Tarreau
@ 2008-01-16 22:42                     ` Jarek Poplawski
  0 siblings, 0 replies; 36+ messages in thread
From: Jarek Poplawski @ 2008-01-16 22:42 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Herbert Xu, cfriesen, davem, netdev, linux-kernel

On Wed, Jan 16, 2008 at 09:04:58PM +0100, Willy Tarreau wrote:
...
> you can work with latest release provided that you always have a fallback
> to an earlier one. That way, you don't bet too much on something you don't
> completely control. If it works, it tells you you'll be able to completely
> exploit its new possibilities in next product release, and if it breaks,
> it's easy to issue a fix to get back to earlier, well-tested version.

Of course, this way looks preferable, but sometimes maybe too costly,
especially with some complex systems. Actually, I don't even think
this have to be fully (production ready) implemented or workable.
Probably there would be even enough to maintain some simplified kind
of test checking how current kernel changes could affect such a system,
and how new versions of this system are better, BTW.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-10 17:24 questions on NAPI processing latency and dropped network packets Chris Friesen
                   ` (4 preceding siblings ...)
  2008-01-12  5:37 ` Ray Lee
@ 2008-01-21 19:53 ` Chris Friesen
  2008-01-21 21:11   ` Ben Greear
  2008-01-21 21:31   ` Eric Dumazet
  5 siblings, 2 replies; 36+ messages in thread
From: Chris Friesen @ 2008-01-21 19:53 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

I've done some further digging, and it appears that one of the problems 
we may be facing is very high instantaneous traffic rates.

Instrumentation showed up to 222K packets/sec for short periods (at 
least 1.1 ms, possibly longer), although the long-term average is down 
around 14-16K packets/sec.

If I bump the rx ring size up to 4096, we can handle all the packets and 
we still have 44% idle on cpu0 and 27% idle on cpu1.

Is there anything else we can do to minimize the latency of network 
packet processing and avoid having to crank the rx ring size up so high?

Thanks,

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-21 19:53 ` Chris Friesen
@ 2008-01-21 21:11   ` Ben Greear
  2008-01-21 23:15     ` Chris Friesen
  2008-01-21 21:31   ` Eric Dumazet
  1 sibling, 1 reply; 36+ messages in thread
From: Ben Greear @ 2008-01-21 21:11 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

Chris Friesen wrote:

> Is there anything else we can do to minimize the latency of network 
> packet processing and avoid having to crank the rx ring size up so high?

Why is it such a big deal to crank up the rx queue length?  Seems like
a perfectly normal way to handle bursts like this...

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-21 19:53 ` Chris Friesen
  2008-01-21 21:11   ` Ben Greear
@ 2008-01-21 21:31   ` Eric Dumazet
  2008-01-21 23:25     ` Chris Friesen
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2008-01-21 21:31 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

Chris Friesen a écrit :
> I've done some further digging, and it appears that one of the problems 
> we may be facing is very high instantaneous traffic rates.
> 
> Instrumentation showed up to 222K packets/sec for short periods (at 
> least 1.1 ms, possibly longer), although the long-term average is down 
> around 14-16K packets/sec.

Instrumentation done where exactly ?

tcpdump on the host will lie, since timestamps are probably going to be host 
timestamps, giving the time of RX dequeing, and not time of packets arrival 
given by NIC.

> 
> If I bump the rx ring size up to 4096, we can handle all the packets and 
> we still have 44% idle on cpu0 and 27% idle on cpu1.
> 
> Is there anything else we can do to minimize the latency of network 
> packet processing and avoid having to crank the rx ring size up so high?
> 


You have some tasks that disable softirqs too long. Sometimes, bumping RX ring 
size is OK (but you will still have delays), sometimes it is not an option, 
since 4096 is the limit on current hardware.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-21 21:11   ` Ben Greear
@ 2008-01-21 23:15     ` Chris Friesen
  0 siblings, 0 replies; 36+ messages in thread
From: Chris Friesen @ 2008-01-21 23:15 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev, linux-kernel

Ben Greear wrote:
> Chris Friesen wrote:
> 
>> Is there anything else we can do to minimize the latency of network 
>> packet processing and avoid having to crank the rx ring size up so high?
> 
> 
> Why is it such a big deal to crank up the rx queue length?  Seems like
> a perfectly normal way to handle bursts like this...

It means that the latency for handling those packets is higher than it 
could be.  Draining 4096 packets from the queue will take a while.

Ideally we'd like to bring the latency down as much as possible, and 
then bump up the rx queue length to handle the rest.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-21 21:31   ` Eric Dumazet
@ 2008-01-21 23:25     ` Chris Friesen
  2008-01-22  5:46       ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2008-01-21 23:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

Eric Dumazet wrote:
> Chris Friesen a écrit :
> 
>> I've done some further digging, and it appears that one of the 
>> problems we may be facing is very high instantaneous traffic rates.
>>
>> Instrumentation showed up to 222K packets/sec for short periods (at 
>> least 1.1 ms, possibly longer), although the long-term average is down 
>> around 14-16K packets/sec.
> 
> 
> Instrumentation done where exactly ?

I added some code to e1000_clean_rx_irq() to track rx_fifo drops, total 
packets received, and an accurate timestamp.

If rx_fifo errors changed, it would dump the information.

>> Is there anything else we can do to minimize the latency of network 
>> packet processing and avoid having to crank the rx ring size up so high?

> You have some tasks that disable softirqs too long. Sometimes, bumping 
> RX ring size is OK (but you will still have delays), sometimes it is not 
> an option, since 4096 is the limit on current hardware.

I added some instrumentation to take timestamps in __do_softirq() as 
well.  Based on these timestamps, I can see the following code sequence:

2374604616 usec, start processing softirqs in __do_softirq()
2374610337 usec, log values in e1000_clean_rx_irq()
2374611411 usec, log values in e1000_clean_rx_irq()

In between the successive calls to e1000_clean_rx_irq() the rx_fifo 
counts went up.

Does anyone have any patchsets to track down what softirqs are taking a 
long time, and/or who's disabling softirqs?

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: questions on NAPI processing latency and dropped network packets
  2008-01-21 23:25     ` Chris Friesen
@ 2008-01-22  5:46       ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2008-01-22  5:46 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel

Chris Friesen a écrit :
> Eric Dumazet wrote:
>> Chris Friesen a écrit :
>>
>>> I've done some further digging, and it appears that one of the 
>>> problems we may be facing is very high instantaneous traffic rates.
>>>
>>> Instrumentation showed up to 222K packets/sec for short periods (at 
>>> least 1.1 ms, possibly longer), although the long-term average is 
>>> down around 14-16K packets/sec.
>>
>>
>> Instrumentation done where exactly ?
> 
> I added some code to e1000_clean_rx_irq() to track rx_fifo drops, total 
> packets received, and an accurate timestamp.
> 
> If rx_fifo errors changed, it would dump the information.
> 
>>> Is there anything else we can do to minimize the latency of network 
>>> packet processing and avoid having to crank the rx ring size up so high?
> 
>> You have some tasks that disable softirqs too long. Sometimes, bumping 
>> RX ring size is OK (but you will still have delays), sometimes it is 
>> not an option, since 4096 is the limit on current hardware.
> 
> I added some instrumentation to take timestamps in __do_softirq() as 
> well.  Based on these timestamps, I can see the following code sequence:
> 
> 2374604616 usec, start processing softirqs in __do_softirq()
> 2374610337 usec, log values in e1000_clean_rx_irq()
> 2374611411 usec, log values in e1000_clean_rx_irq()
> 
> In between the successive calls to e1000_clean_rx_irq() the rx_fifo 
> counts went up.
> 
> Does anyone have any patchsets to track down what softirqs are taking a 
> long time, and/or who's disabling softirqs?
> 

Not for linux-2.6.10 unfortunatly.

Check net/ipv4/route.c, where many improvements can be done, especially if you 
have a large rt cache

grep . /proc/sys/net/ipv4/route/*


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2008-01-22  5:52 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-10 17:24 questions on NAPI processing latency and dropped network packets Chris Friesen
2008-01-10 17:37 ` Kok, Auke
2008-01-10 18:12   ` Chris Friesen
2008-01-10 18:26     ` Kok, Auke
2008-01-10 18:25 ` James Chapman
2008-01-10 21:29   ` Chris Friesen
2008-01-10 18:41 ` Rick Jones
2008-01-10 19:01   ` Kok, Auke
2008-01-11  1:20 ` David Miller
2008-01-11 14:59   ` Chris Friesen
2008-01-11 22:29     ` Herbert Xu
2008-01-12  1:53     ` David Miller
2008-01-14 15:58       ` Chris Friesen
2008-01-15  7:19         ` Jarek Poplawski
2008-01-15 14:47           ` Chris Friesen
2008-01-15 15:17             ` Radoslaw Szkodzinski
2008-01-15 17:14               ` Chris Friesen
2008-01-15 17:23                 ` Eric Dumazet
2008-01-15 20:29             ` Jarek Poplawski
2008-01-16  0:17               ` Herbert Xu
2008-01-16  6:58                 ` Jarek Poplawski
2008-01-16 20:04                   ` Willy Tarreau
2008-01-16 22:42                     ` Jarek Poplawski
2008-01-12  5:37 ` Ray Lee
2008-01-14 15:49   ` Chris Friesen
2008-01-14 16:56     ` Eric Dumazet
2008-01-14 19:25       ` Chris Friesen
2008-01-14 19:33         ` Eric Dumazet
2008-01-14 20:02           ` Chris Friesen
2008-01-15 15:09             ` Vlad Yasevich
2008-01-21 19:53 ` Chris Friesen
2008-01-21 21:11   ` Ben Greear
2008-01-21 23:15     ` Chris Friesen
2008-01-21 21:31   ` Eric Dumazet
2008-01-21 23:25     ` Chris Friesen
2008-01-22  5:46       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).