All of lore.kernel.org
 help / color / mirror / Atom feed
* Multicast packet loss
@ 2009-01-30 17:49 Kenny Chang
  2009-01-30 19:04 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Kenny Chang @ 2009-01-30 17:49 UTC (permalink / raw)
  To: netdev

Hi all,

We've been having some issues with multicast packet loss, we were wondering
if anyone knows anything about the behavior we're seeing.

Background: we use multicast messaging with lots of messages per sec for our
work. We recently transitioned many of our systems from an Ubuntu Dapper Drake
ia32 distribution to Ubuntu Hardy Heron x86_64. Since the transition, we've
noticed much more multicast packet loss, and we think it's related to the
transition. Our particular theory is that it's specifically a 32 vs 64-bit
issue.

We narrowed the problem down to the attached program (mcasttest.cc).  Run
"mcasttest server" on one machine -- it'll send 500,000 messages small message
to a multicast group, 50,000 messages per second.  If we run "mcasttest client"
on another machine, it'll receive all those messages and print a count at the
end of how many messages it sees. It almost never loses any messages. However,
if we run 4 copies of the client on the same machine, receiving the same data,
then the programs usually sees fewer than 500,000 messages. We're running with:

for i in $(seq 1 4); do (./mcasttest client &); done

We know this because the program prints a count, but dropped packets also
show up in ifconfig's "RX packets" section.

Things we're curious about: do other people see similar problems?  The tests
we've done: we've tried this program on a bunch of different machines, all of
which are running either dapper ia32 or hardy x86_64. Uniformly, the dapper
machines have no problems but on certain machines, Hardy shows significant loss. 
We did some experiments on a troubled machine, varying the OS install, 
including mixed installations where the kernel was 64-bit and the userspace was
32-bit. This is what we found:

On machines that exhibit this problem, the ksoftirqd process seems to be 
pegged to 100% CPU when receiving packets.

Note: while we're on Ubuntu, we've tried this with other distros and have seen
similar results, we just haven't tabulated them.

> ----------------------------------------------------------------------------
> userland | userland arch | kernel           | kernel arch | mode           
> ----------------------------------------------------------------------------
> Dapper   |            32 | 2.6.15-28-server |          32 | no packet loss
> Dapper   |            32 | 2.6.22-generic   |          32 | no packet loss 
> Dapper   |            32 | 2.6.22-server    |          32 | no packet loss 
> Hardy    |            32 | 2.6.24-rt        |          32 | no packet loss
> Hardy    |            32 | 2.6.24-generic   |          32 | ~5% packet loss
> Hardy    |            32 | 2.6.24-server    |          32 | ~10% packet loss

> Hardy    |            32 | 2.6.22-server    |          64 | no packet loss
> Hardy    |            32 | 2.6.24-rt        |          64 | no packet loss
> Hardy    |            32 | 2.6.24-generic   |          64 | 14% packet loss
> Hardy    |            32 | 2.6.24-server    |          64 | 12% packet loss

> Hardy    |            64 | 2.6.22-vanilla   |          64 | packet loss
> Hardy    |            64 | 2.6.24-rt        |          64 | ~5% packet loss
> Hardy    |            64 | 2.6.24-server    |          64 | ~30% packet loss
> Hardy    |            64 | 2.6.24-generic   |          64 | ~5% packet loss
> ----------------------------------------------------------------------------

It's not exactly clear what exactly the problem is but dapper shows no issues 
regardless of what we try. For hardy, userspace seem to matter: 
2.6.24-rt kernel shows no packet loss for 32&64bit kernels, as long as the userspace 
is 32-bit.

Kernel comments:
2.6.15-28-server: This is Ubuntu Dapper's stock kernel build.
2.6.24-*: This is Ubuntu Hardy's stock kernel.
2.6.22-{generic,server}: This is a custom, in-house kernel build, built for ia32.
2.6.22-vanilla: This is our custom, in-house kernel build, built for x86_64.

We don't think it's related to our custom kernels, because the same phenomena
show up with the Ubuntu stock kernels.

Hardware:

The benchmark machine We've been using is an Intel Xeon E5440 @2.83GHz
dual-cpu quad-core with Broadcom NetXtreme II BCM5708 bnx2 networking.

We've also tried AMD machines, as well as machines with Tigon3
partno(BCM95704A6) tg3 network cards, they all show consistent behavior.

Our hardy x86_64 server machines all appear to have this problem, new and old.

On the other hand, a desktop with Intel Q6600 quad core 2.4GHz and Intel 82566DC GigE
seem to work fine.

All of the dapper ia32 machines have no trouble, even our older hardware.


Thanks,
Kenny Chang
Athena Capital Research



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 17:49 Multicast packet loss Kenny Chang
@ 2009-01-30 19:04 ` Eric Dumazet
  2009-01-30 19:17 ` Denys Fedoryschenko
  2009-01-30 20:03 ` Neil Horman
  2 siblings, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-01-30 19:04 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

Kenny Chang a écrit :
> Hi all,
> 
> We've been having some issues with multicast packet loss, we were wondering
> if anyone knows anything about the behavior we're seeing.
> 
> Background: we use multicast messaging with lots of messages per sec for
> our
> work. We recently transitioned many of our systems from an Ubuntu Dapper
> Drake
> ia32 distribution to Ubuntu Hardy Heron x86_64. Since the transition, we've
> noticed much more multicast packet loss, and we think it's related to the
> transition. Our particular theory is that it's specifically a 32 vs 64-bit
> issue.
> 
> We narrowed the problem down to the attached program (mcasttest.cc).  Run
> "mcasttest server" on one machine -- it'll send 500,000 messages small
> message
> to a multicast group, 50,000 messages per second.  If we run "mcasttest
> client"
> on another machine, it'll receive all those messages and print a count
> at the
> end of how many messages it sees. It almost never loses any messages.
> However,
> if we run 4 copies of the client on the same machine, receiving the same
> data,
> then the programs usually sees fewer than 500,000 messages. We're
> running with:
> 
> for i in $(seq 1 4); do (./mcasttest client &); done
> 
> We know this because the program prints a count, but dropped packets also
> show up in ifconfig's "RX packets" section.
> 
> Things we're curious about: do other people see similar problems?  The
> tests
> we've done: we've tried this program on a bunch of different machines,
> all of
> which are running either dapper ia32 or hardy x86_64. Uniformly, the dapper
> machines have no problems but on certain machines, Hardy shows
> significant loss. We did some experiments on a troubled machine, varying
> the OS install, including mixed installations where the kernel was
> 64-bit and the userspace was
> 32-bit. This is what we found:
> 
> On machines that exhibit this problem, the ksoftirqd process seems to be
> pegged to 100% CPU when receiving packets.
> 
> Note: while we're on Ubuntu, we've tried this with other distros and
> have seen
> similar results, we just haven't tabulated them.
> 
>> ----------------------------------------------------------------------------
>>
>> userland | userland arch | kernel           | kernel arch |
>> mode          
>> ----------------------------------------------------------------------------
>>
>> Dapper   |            32 | 2.6.15-28-server |          32 | no packet
>> loss
>> Dapper   |            32 | 2.6.22-generic   |          32 | no packet
>> loss Dapper   |            32 | 2.6.22-server    |          32 | no
>> packet loss Hardy    |            32 | 2.6.24-rt        |          32
>> | no packet loss
>> Hardy    |            32 | 2.6.24-generic   |          32 | ~5% packet
>> loss
>> Hardy    |            32 | 2.6.24-server    |          32 | ~10%
>> packet loss
> 
>> Hardy    |            32 | 2.6.22-server    |          64 | no packet
>> loss
>> Hardy    |            32 | 2.6.24-rt        |          64 | no packet
>> loss
>> Hardy    |            32 | 2.6.24-generic   |          64 | 14% packet
>> loss
>> Hardy    |            32 | 2.6.24-server    |          64 | 12% packet
>> loss
> 
>> Hardy    |            64 | 2.6.22-vanilla   |          64 | packet loss
>> Hardy    |            64 | 2.6.24-rt        |          64 | ~5% packet
>> loss
>> Hardy    |            64 | 2.6.24-server    |          64 | ~30%
>> packet loss
>> Hardy    |            64 | 2.6.24-generic   |          64 | ~5% packet
>> loss
>> ----------------------------------------------------------------------------
>>
> 
> It's not exactly clear what exactly the problem is but dapper shows no
> issues regardless of what we try. For hardy, userspace seem to matter:
> 2.6.24-rt kernel shows no packet loss for 32&64bit kernels, as long as
> the userspace is 32-bit.
> 
> Kernel comments:
> 2.6.15-28-server: This is Ubuntu Dapper's stock kernel build.
> 2.6.24-*: This is Ubuntu Hardy's stock kernel.
> 2.6.22-{generic,server}: This is a custom, in-house kernel build, built
> for ia32.
> 2.6.22-vanilla: This is our custom, in-house kernel build, built for
> x86_64.
> 
> We don't think it's related to our custom kernels, because the same
> phenomena
> show up with the Ubuntu stock kernels.
> 
> Hardware:
> 
> The benchmark machine We've been using is an Intel Xeon E5440 @2.83GHz
> dual-cpu quad-core with Broadcom NetXtreme II BCM5708 bnx2 networking.
> 
> We've also tried AMD machines, as well as machines with Tigon3
> partno(BCM95704A6) tg3 network cards, they all show consistent behavior.
> 
> Our hardy x86_64 server machines all appear to have this problem, new
> and old.
> 
> On the other hand, a desktop with Intel Q6600 quad core 2.4GHz and Intel
> 82566DC GigE
> seem to work fine.
> 
> All of the dapper ia32 machines have no trouble, even our older hardware.
> 
>

Hi Kenny

Interesting... You forgot the mcasttest.cc program

Any chance you try a recent kernel (2.6.29-rcX) ?

Could you post "cat /proc/interrupts" results (one for working
 setup, another for non working/droping setup)



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 17:49 Multicast packet loss Kenny Chang
  2009-01-30 19:04 ` Eric Dumazet
@ 2009-01-30 19:17 ` Denys Fedoryschenko
  2009-01-30 20:03 ` Neil Horman
  2 siblings, 0 replies; 70+ messages in thread
From: Denys Fedoryschenko @ 2009-01-30 19:17 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

On Friday 30 January 2009 19:49:48 Kenny Chang wrote:
> Hi all,
>
> We've been having some issues with multicast packet loss, we were wondering
> if anyone knows anything about the behavior we're seeing.

I didn't work much on multicast, but i have heavy unicast udp streaming (PEP 
for satellite).

First thing to check
net.core.wmem_max = 131071
net.core.wmem_default = 124928
net.ipv4.udp_mem = 379008       505344  758016

Usually they are too small by default.

next
netstat -s

Important part
Udp:
    1263992126 packets received
    260196 packets to unknown port received.
    627001 packet receive errors
    74235906 packets sent
    RcvbufErrors: 56683
    SndbufErrors: 4295851


in your case     SndbufErrors matter.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 17:49 Multicast packet loss Kenny Chang
  2009-01-30 19:04 ` Eric Dumazet
  2009-01-30 19:17 ` Denys Fedoryschenko
@ 2009-01-30 20:03 ` Neil Horman
  2009-01-30 22:29   ` Kenny Chang
  2 siblings, 1 reply; 70+ messages in thread
From: Neil Horman @ 2009-01-30 20:03 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

On Fri, Jan 30, 2009 at 12:49:48PM -0500, Kenny Chang wrote:
> Hi all,
>
> We've been having some issues with multicast packet loss, we were wondering
> if anyone knows anything about the behavior we're seeing.
>
> Background: we use multicast messaging with lots of messages per sec for our
> work. We recently transitioned many of our systems from an Ubuntu Dapper Drake
> ia32 distribution to Ubuntu Hardy Heron x86_64. Since the transition, we've
> noticed much more multicast packet loss, and we think it's related to the
> transition. Our particular theory is that it's specifically a 32 vs 64-bit
> issue.
>
> We narrowed the problem down to the attached program (mcasttest.cc).  Run
> "mcasttest server" on one machine -- it'll send 500,000 messages small message
> to a multicast group, 50,000 messages per second.  If we run "mcasttest client"
> on another machine, it'll receive all those messages and print a count at the
> end of how many messages it sees. It almost never loses any messages. However,
> if we run 4 copies of the client on the same machine, receiving the same data,
> then the programs usually sees fewer than 500,000 messages. We're running with:
>
> for i in $(seq 1 4); do (./mcasttest client &); done
>
> We know this because the program prints a count, but dropped packets also
> show up in ifconfig's "RX packets" section.
>
> Things we're curious about: do other people see similar problems?  The tests
> we've done: we've tried this program on a bunch of different machines, all of
> which are running either dapper ia32 or hardy x86_64. Uniformly, the dapper
> machines have no problems but on certain machines, Hardy shows 
> significant loss. We did some experiments on a troubled machine, varying 
> the OS install, including mixed installations where the kernel was 64-bit 
> and the userspace was
> 32-bit. This is what we found:
>
> On machines that exhibit this problem, the ksoftirqd process seems to be  
> pegged to 100% CPU when receiving packets.
>
> Note: while we're on Ubuntu, we've tried this with other distros and have seen
> similar results, we just haven't tabulated them.
>
>> ----------------------------------------------------------------------------
>> userland | userland arch | kernel           | kernel arch | mode        
>>     
>> ----------------------------------------------------------------------------
>> Dapper   |            32 | 2.6.15-28-server |          32 | no packet loss
>> Dapper   |            32 | 2.6.22-generic   |          32 | no packet 
>> loss Dapper   |            32 | 2.6.22-server    |          32 | no 
>> packet loss Hardy    |            32 | 2.6.24-rt        |          32 | 
>> no packet loss
>> Hardy    |            32 | 2.6.24-generic   |          32 | ~5% packet loss
>> Hardy    |            32 | 2.6.24-server    |          32 | ~10% packet loss
>
>> Hardy    |            32 | 2.6.22-server    |          64 | no packet loss
>> Hardy    |            32 | 2.6.24-rt        |          64 | no packet loss
>> Hardy    |            32 | 2.6.24-generic   |          64 | 14% packet loss
>> Hardy    |            32 | 2.6.24-server    |          64 | 12% packet loss
>
>> Hardy    |            64 | 2.6.22-vanilla   |          64 | packet loss
>> Hardy    |            64 | 2.6.24-rt        |          64 | ~5% packet loss
>> Hardy    |            64 | 2.6.24-server    |          64 | ~30% packet loss
>> Hardy    |            64 | 2.6.24-generic   |          64 | ~5% packet loss
>> ----------------------------------------------------------------------------
>
> It's not exactly clear what exactly the problem is but dapper shows no 
> issues regardless of what we try. For hardy, userspace seem to matter:  
> 2.6.24-rt kernel shows no packet loss for 32&64bit kernels, as long as 
> the userspace is 32-bit.
>
> Kernel comments:
> 2.6.15-28-server: This is Ubuntu Dapper's stock kernel build.
> 2.6.24-*: This is Ubuntu Hardy's stock kernel.
> 2.6.22-{generic,server}: This is a custom, in-house kernel build, built for ia32.
> 2.6.22-vanilla: This is our custom, in-house kernel build, built for x86_64.
>
> We don't think it's related to our custom kernels, because the same phenomena
> show up with the Ubuntu stock kernels.
>
> Hardware:
>
> The benchmark machine We've been using is an Intel Xeon E5440 @2.83GHz
> dual-cpu quad-core with Broadcom NetXtreme II BCM5708 bnx2 networking.
>
> We've also tried AMD machines, as well as machines with Tigon3
> partno(BCM95704A6) tg3 network cards, they all show consistent behavior.
>
> Our hardy x86_64 server machines all appear to have this problem, new and old.
>
> On the other hand, a desktop with Intel Q6600 quad core 2.4GHz and Intel 82566DC GigE
> seem to work fine.
>
> All of the dapper ia32 machines have no trouble, even our older hardware.
>

Like Eric mentioned, I'd start with a latest kernel if at all possible.  If it
doesn't happen there, you're work is half over, you just need to figure out what
changed, and tell Canonical to backport it.

>From there, you can solve this like most packet loss issues are solved:

1) Determine if its a rx or tx packet loss.  From your comments above it sounds
like this is an rx side issue

2) Look at statistics from the hardware to the application.  Use ethtool &
/proc/net/dev to get hardware packet loss stats, /proc/net/snmp netstat -s to
get core network loss stats

3) Use those stats to identify where and why packets are getting dropped.
Posting some summary of that data here is something we can help with if need be

4) Determine how to reduce the loss (i.e. code change vs. tuning)

5) Lather, rinse repeat (given that eliminating a drop cause in one location
will likely increase througput, potentially putting strain on another location
in the code path, possibly leading to more drops elsewhere. 


You had mentioned that ifconfig was showing rx drops, which indicates that your
hardware rx buffer is likely overflowing.  Usually the best way to fix that is
to:

1) modify any available interrupt coalescing parameters on the driver such that
interrupts have less latency between packet arrival and assertion

2) increase (if possible) the napi weight (I think thats still the right term)
so that each napi poll interation receives more frames on the interface,
draining that queue more quickly.

Neil


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 20:03 ` Neil Horman
@ 2009-01-30 22:29   ` Kenny Chang
  2009-01-30 22:41     ` Eric Dumazet
  2009-02-02 13:53     ` Eric Dumazet
  0 siblings, 2 replies; 70+ messages in thread
From: Kenny Chang @ 2009-01-30 22:29 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 7188 bytes --]

Ah, sorry, here's the test program attached.

We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the 
2.6.29.-rcX.

Right now, we are trying to step through the kernel versions until we 
see where the performance drops significantly.  We'll try 2.6.29-rc soon 
and post the result.

Neil Norman wrote:

1) Determine if its a rx or tx packet loss.  From your comments above it sounds
like this is an rx side issue

   We're pretty sure it's an rx issue.  Other machines receiving at the 
same time will
   get all the packets.

I'll gather the information mentioned and summarize in a subsequent email.

Thanks!
Kenny

Neil Horman wrote:
> On Fri, Jan 30, 2009 at 12:49:48PM -0500, Kenny Chang wrote:
>   
>> Hi all,
>>
>> We've been having some issues with multicast packet loss, we were wondering
>> if anyone knows anything about the behavior we're seeing.
>>
>> Background: we use multicast messaging with lots of messages per sec for our
>> work. We recently transitioned many of our systems from an Ubuntu Dapper Drake
>> ia32 distribution to Ubuntu Hardy Heron x86_64. Since the transition, we've
>> noticed much more multicast packet loss, and we think it's related to the
>> transition. Our particular theory is that it's specifically a 32 vs 64-bit
>> issue.
>>
>> We narrowed the problem down to the attached program (mcasttest.cc).  Run
>> "mcasttest server" on one machine -- it'll send 500,000 messages small message
>> to a multicast group, 50,000 messages per second.  If we run "mcasttest client"
>> on another machine, it'll receive all those messages and print a count at the
>> end of how many messages it sees. It almost never loses any messages. However,
>> if we run 4 copies of the client on the same machine, receiving the same data,
>> then the programs usually sees fewer than 500,000 messages. We're running with:
>>
>> for i in $(seq 1 4); do (./mcasttest client &); done
>>
>> We know this because the program prints a count, but dropped packets also
>> show up in ifconfig's "RX packets" section.
>>
>> Things we're curious about: do other people see similar problems?  The tests
>> we've done: we've tried this program on a bunch of different machines, all of
>> which are running either dapper ia32 or hardy x86_64. Uniformly, the dapper
>> machines have no problems but on certain machines, Hardy shows 
>> significant loss. We did some experiments on a troubled machine, varying 
>> the OS install, including mixed installations where the kernel was 64-bit 
>> and the userspace was
>> 32-bit. This is what we found:
>>
>> On machines that exhibit this problem, the ksoftirqd process seems to be  
>> pegged to 100% CPU when receiving packets.
>>
>> Note: while we're on Ubuntu, we've tried this with other distros and have seen
>> similar results, we just haven't tabulated them.
>>
>>     
>>> ----------------------------------------------------------------------------
>>> userland | userland arch | kernel           | kernel arch | mode        
>>>     
>>> ----------------------------------------------------------------------------
>>> Dapper   |            32 | 2.6.15-28-server |          32 | no packet loss
>>> Dapper   |            32 | 2.6.22-generic   |          32 | no packet 
>>> loss Dapper   |            32 | 2.6.22-server    |          32 | no 
>>> packet loss Hardy    |            32 | 2.6.24-rt        |          32 | 
>>> no packet loss
>>> Hardy    |            32 | 2.6.24-generic   |          32 | ~5% packet loss
>>> Hardy    |            32 | 2.6.24-server    |          32 | ~10% packet loss
>>>       
>>> Hardy    |            32 | 2.6.22-server    |          64 | no packet loss
>>> Hardy    |            32 | 2.6.24-rt        |          64 | no packet loss
>>> Hardy    |            32 | 2.6.24-generic   |          64 | 14% packet loss
>>> Hardy    |            32 | 2.6.24-server    |          64 | 12% packet loss
>>>       
>>> Hardy    |            64 | 2.6.22-vanilla   |          64 | packet loss
>>> Hardy    |            64 | 2.6.24-rt        |          64 | ~5% packet loss
>>> Hardy    |            64 | 2.6.24-server    |          64 | ~30% packet loss
>>> Hardy    |            64 | 2.6.24-generic   |          64 | ~5% packet loss
>>> ----------------------------------------------------------------------------
>>>       
>> It's not exactly clear what exactly the problem is but dapper shows no 
>> issues regardless of what we try. For hardy, userspace seem to matter:  
>> 2.6.24-rt kernel shows no packet loss for 32&64bit kernels, as long as 
>> the userspace is 32-bit.
>>
>> Kernel comments:
>> 2.6.15-28-server: This is Ubuntu Dapper's stock kernel build.
>> 2.6.24-*: This is Ubuntu Hardy's stock kernel.
>> 2.6.22-{generic,server}: This is a custom, in-house kernel build, built for ia32.
>> 2.6.22-vanilla: This is our custom, in-house kernel build, built for x86_64.
>>
>> We don't think it's related to our custom kernels, because the same phenomena
>> show up with the Ubuntu stock kernels.
>>
>> Hardware:
>>
>> The benchmark machine We've been using is an Intel Xeon E5440 @2.83GHz
>> dual-cpu quad-core with Broadcom NetXtreme II BCM5708 bnx2 networking.
>>
>> We've also tried AMD machines, as well as machines with Tigon3
>> partno(BCM95704A6) tg3 network cards, they all show consistent behavior.
>>
>> Our hardy x86_64 server machines all appear to have this problem, new and old.
>>
>> On the other hand, a desktop with Intel Q6600 quad core 2.4GHz and Intel 82566DC GigE
>> seem to work fine.
>>
>> All of the dapper ia32 machines have no trouble, even our older hardware.
>>
>>     
>
> Like Eric mentioned, I'd start with a latest kernel if at all possible.  If it
> doesn't happen there, you're work is half over, you just need to figure out what
> changed, and tell Canonical to backport it.
>
> From there, you can solve this like most packet loss issues are solved:
>
> 1) Determine if its a rx or tx packet loss.  From your comments above it sounds
> like this is an rx side issue
>
> 2) Look at statistics from the hardware to the application.  Use ethtool &
> /proc/net/dev to get hardware packet loss stats, /proc/net/snmp netstat -s to
> get core network loss stats
>
> 3) Use those stats to identify where and why packets are getting dropped.
> Posting some summary of that data here is something we can help with if need be
>
> 4) Determine how to reduce the loss (i.e. code change vs. tuning)
>
> 5) Lather, rinse repeat (given that eliminating a drop cause in one location
> will likely increase througput, potentially putting strain on another location
> in the code path, possibly leading to more drops elsewhere. 
>
>
> You had mentioned that ifconfig was showing rx drops, which indicates that your
> hardware rx buffer is likely overflowing.  Usually the best way to fix that is
> to:
>
> 1) modify any available interrupt coalescing parameters on the driver such that
> interrupts have less latency between packet arrival and assertion
>
> 2) increase (if possible) the napi weight (I think thats still the right term)
> so that each napi poll interation receives more frames on the interface,
> draining that queue more quickly.
>
> Neil
>
>   


[-- Attachment #2: mcasttest.c --]
[-- Type: text/x-csrc, Size: 3166 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <arpa/inet.h>
#include <sys/epoll.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/select.h>
#include <unistd.h>


void error(const char *s)
{
    fprintf(stderr, "%s\n", s);
    exit(1);
}

void check(int v)
{
    int myerr = errno;
    char *myerrstr = strerror(myerr);
    if(!v)
        error("bad return code");
}

const char *g_mcastaddr = "239.100.0.99";
int g_port = 10100;

int main(int argc, char **argv)
{
    if(argc != 2)
        error("usage: mcasttest (server|client)");
    if(strcmp(argv[1], "client") == 0)
    {
        // Client program: subscribes to a multicast group, receives messages
        // and prints a count of messages received once it's done.

        int s = socket(AF_INET, SOCK_DGRAM, 0);
        check(s > 0);
        int val = 1;
        check(setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val)) == 0);

        struct sockaddr_in addr;
        memset(&addr, 0, sizeof(addr));
        addr.sin_family = AF_INET;
        addr.sin_port = htons(g_port);
        addr.sin_addr.s_addr = htonl(INADDR_ANY);
        check(bind(s, (struct sockaddr *) &addr, sizeof(addr)) == 0);

        struct ip_mreqn mreq;
        memset(&mreq, 0, sizeof(mreq));
        check(inet_pton(AF_INET, g_mcastaddr, &mreq.imr_multiaddr));
        mreq.imr_address.s_addr = htonl(INADDR_ANY);
        mreq.imr_ifindex = 0;
        check(setsockopt(s, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq)) == 0);

        int bufSz;
        socklen_t len = sizeof(bufSz);
        getsockopt(s, SOL_SOCKET, SO_RCVBUF, (char*)(&bufSz), &len);
        printf("bufsz: %d\n", bufSz);

        int npackets = 0;
        char buf[1000];
        memset(buf, 0, sizeof(buf));
        while(1)
        {
            struct sockaddr_in from;
            socklen_t fromlen = sizeof(from);
            check(recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen) == 100);
            ++npackets;
            if(buf[0] == 1) // exit message
                break;
        }
        printf("received %d packets\n", npackets);
    }
    else if(strcmp(argv[1], "server") == 0)
    {
        // Server program: sends 50,000 packets per second to a multicast address,
        // for 10 seconds.
        int s = socket(AF_INET, SOCK_DGRAM, 0);
        int val = 1;
        int i = 1;
        check(s > 0);

        struct sockaddr_in addr;
        memset(&addr, 0, sizeof(addr));
        addr.sin_family = AF_INET;
        addr.sin_port = htons(g_port);
        check(inet_pton(AF_INET, g_mcastaddr, &addr.sin_addr.s_addr));
        check(connect(s, (struct sockaddr *) &addr, sizeof(addr)) == 0);

        int npackets = 500000;
        char buf[100];
        memset(buf, 0, sizeof(buf));
        for(i = 1; i < npackets; ++i)
        {
            check(send(s, buf, sizeof(buf), 0) > 0);
            usleep(20); // 50,000 messages per second
        }

        buf[0] = 1;
        for(i = 1; i < 5; ++i)
        {
            check(send(s, buf, sizeof(buf), 0) > 0);
            sleep(1);
        }
    }
    else
        error("unknown mode");
    return 0;
}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 22:29   ` Kenny Chang
@ 2009-01-30 22:41     ` Eric Dumazet
  2009-01-31 16:03       ` Neil Horman
                         ` (2 more replies)
  2009-02-02 13:53     ` Eric Dumazet
  1 sibling, 3 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-01-30 22:41 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

Kenny Chang a écrit :
> Ah, sorry, here's the test program attached.
> 
> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
> 2.6.29.-rcX.
> 
> Right now, we are trying to step through the kernel versions until we
> see where the performance drops significantly.  We'll try 2.6.29-rc soon
> and post the result.

2.6.29-rc contains UDP receive improvements (lockless)

Problem is multicast handling was not yet updated, but could be :)


I was asking you "cat /proc/interrupts" because I believe you might
have a problem NIC interrupts being handled by one CPU only (when having problems)



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 22:41     ` Eric Dumazet
@ 2009-01-31 16:03       ` Neil Horman
  2009-02-02 16:13         ` Kenny Chang
  2009-02-02 16:48         ` Kenny Chang
  2009-02-01 12:40       ` Eric Dumazet
  2009-02-27 18:40       ` Christoph Lameter
  2 siblings, 2 replies; 70+ messages in thread
From: Neil Horman @ 2009-01-31 16:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kenny Chang, netdev

On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
> Kenny Chang a écrit :
> > Ah, sorry, here's the test program attached.
> > 
> > We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
> > 2.6.29.-rcX.
> > 
> > Right now, we are trying to step through the kernel versions until we
> > see where the performance drops significantly.  We'll try 2.6.29-rc soon
> > and post the result.
> 
> 2.6.29-rc contains UDP receive improvements (lockless)
> 
> Problem is multicast handling was not yet updated, but could be :)
> 
> 
> I was asking you "cat /proc/interrupts" because I believe you might
> have a problem NIC interrupts being handled by one CPU only (when having problems)
> 
That would be expected (if irqbalance is running), and desireable, since
spreading high volume interrupts like NICS accross multiple cores (or more
specifically multiple L2 caches), is going increase your cache line miss rate
significantly and decrease rx throughput.

Although you do have a point here, if the system isn't running irqbalance, and
the NICS irq affinity is spread accross multiple L2 caches, that would be a
point of improvement performance-wise.  

Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
and your stats that I asked about earlier, that would be a big help.

Regards
Neil


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 22:41     ` Eric Dumazet
  2009-01-31 16:03       ` Neil Horman
@ 2009-02-01 12:40       ` Eric Dumazet
  2009-02-02 13:45         ` Neil Horman
  2009-02-27 18:40       ` Christoph Lameter
  2 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-02-01 12:40 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

Eric Dumazet a écrit :
> Kenny Chang a écrit :
>> Ah, sorry, here's the test program attached.
>>
>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>> 2.6.29.-rcX.
>>
>> Right now, we are trying to step through the kernel versions until we
>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>> and post the result.
> 

I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)

With 8 clients, about 10% packet loss, 

Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
wakeups per second... But at least UDP receive path seems OK.

Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
readers...

I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)

I tried playing with ethtool -C|c G|g params...
And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)

I can have 0% packet loss if booting with msi_disable and

echo 1 >/proc/irq/16/smp_affinities

(16 being interrupt of eth0 NIC)

then, a second run gave me errors, about 2%, oh well...


oprofile numbers without playing IRQ affinities:

CPU: Core 2, speed 2999.89 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
327928   10.1427  schedule
259625    8.0301  mwait_idle
187337    5.7943  __skb_recv_datagram
109854    3.3977  lock_sock_nested
104713    3.2387  tick_nohz_stop_sched_tick
98831     3.0568  select_nohz_load_balancer
88163     2.7268  skb_release_data
78552     2.4296  update_curr
75241     2.3272  getnstimeofday
71400     2.2084  set_next_entity
67629     2.0917  get_next_timer_interrupt
67375     2.0839  sched_clock_tick
58112     1.7974  enqueue_entity
56462     1.7463  udp_recvmsg
55049     1.7026  copy_to_user
54277     1.6788  sched_clock_cpu
54031     1.6712  __copy_skb_header
51859     1.6040  __slab_free
51786     1.6017  prepare_to_wait_exclusive
51776     1.6014  sock_def_readable
50062     1.5484  try_to_wake_up
42182     1.3047  __switch_to
41631     1.2876  read_tsc
38337     1.1857  tick_nohz_restart_sched_tick
34358     1.0627  cpu_idle
34194     1.0576  native_sched_clock
33812     1.0458  pick_next_task_fair
33685     1.0419  resched_task
33340     1.0312  sys_recvfrom
33287     1.0296  dst_release
32439     1.0033  kmem_cache_free
32131     0.9938  hrtimer_start_range_ns
29807     0.9219  udp_queue_rcv_skb
27815     0.8603  task_rq_lock
26875     0.8312  __update_sched_clock
23912     0.7396  sock_queue_rcv_skb
21583     0.6676  __wake_up_sync
21001     0.6496  effective_load
20531     0.6350  hrtick_start_fair




With IRQ affinities and msi_disable (no packet drops)

CPU: Core 2, speed 3000.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
79788    10.3815  schedule
69422     9.0328  mwait_idle
44877     5.8391  __skb_recv_datagram
28629     3.7250  tick_nohz_stop_sched_tick
27252     3.5459  select_nohz_load_balancer
24320     3.1644  lock_sock_nested
20833     2.7107  getnstimeofday
20666     2.6889  skb_release_data
18612     2.4217  set_next_entity
17785     2.3141  get_next_timer_interrupt
17691     2.3018  udp_recvmsg
17271     2.2472  sched_clock_tick
16032     2.0860  copy_to_user
14785     1.9237  update_curr
12512     1.6280  prepare_to_wait_exclusive
12498     1.6262  __slab_free
11380     1.4807  read_tsc
11145     1.4501  sched_clock_cpu
10598     1.3789  __switch_to
9588      1.2475  pick_next_task_fair
9480      1.2335  cpu_idle
9218      1.1994  sys_recvfrom
9008      1.1721  tick_nohz_restart_sched_tick
8977      1.1680  dst_release
8930      1.1619  native_sched_clock
8392      1.0919  kmem_cache_free
8124      1.0570  hrtimer_start_range_ns
7274      0.9464  bnx2_interrupt
7175      0.9336  __copy_skb_header
7006      0.9116  try_to_wake_up
6949      0.9042  sock_def_readable
6787      0.8831  enqueue_entity
6772      0.8811  __update_sched_clock
6349      0.8261  finish_task_switch
6164      0.8020  copy_from_user
5096      0.6631  resched_task
5007      0.6515  sysenter_past_esp


I will try to investigate a litle bit more in following days if time permits.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-01 12:40       ` Eric Dumazet
@ 2009-02-02 13:45         ` Neil Horman
  2009-02-02 16:57           ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Horman @ 2009-02-02 13:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kenny Chang, netdev

On Sun, Feb 01, 2009 at 01:40:39PM +0100, Eric Dumazet wrote:
> Eric Dumazet a écrit :
> > Kenny Chang a écrit :
> >> Ah, sorry, here's the test program attached.
> >>
> >> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
> >> 2.6.29.-rcX.
> >>
> >> Right now, we are trying to step through the kernel versions until we
> >> see where the performance drops significantly.  We'll try 2.6.29-rc soon
> >> and post the result.
> > 
> 
> I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)
> 
> With 8 clients, about 10% packet loss, 
> 
> Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
> wakeups per second... But at least UDP receive path seems OK.
> 
> Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
> readers...
> 
> I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)
> 
> I tried playing with ethtool -C|c G|g params...
> And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)
> 
> I can have 0% packet loss if booting with msi_disable and
> 
> echo 1 >/proc/irq/16/smp_affinities
> 
> (16 being interrupt of eth0 NIC)
> 
> then, a second run gave me errors, about 2%, oh well...
> 
> 
> oprofile numbers without playing IRQ affinities:
> 
> CPU: Core 2, speed 2999.89 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        symbol name
> 327928   10.1427  schedule
> 259625    8.0301  mwait_idle
> 187337    5.7943  __skb_recv_datagram
> 109854    3.3977  lock_sock_nested
> 104713    3.2387  tick_nohz_stop_sched_tick
> 98831     3.0568  select_nohz_load_balancer
> 88163     2.7268  skb_release_data
> 78552     2.4296  update_curr
> 75241     2.3272  getnstimeofday
> 71400     2.2084  set_next_entity
> 67629     2.0917  get_next_timer_interrupt
> 67375     2.0839  sched_clock_tick
> 58112     1.7974  enqueue_entity
> 56462     1.7463  udp_recvmsg
> 55049     1.7026  copy_to_user
> 54277     1.6788  sched_clock_cpu
> 54031     1.6712  __copy_skb_header
> 51859     1.6040  __slab_free
> 51786     1.6017  prepare_to_wait_exclusive
> 51776     1.6014  sock_def_readable
> 50062     1.5484  try_to_wake_up
> 42182     1.3047  __switch_to
> 41631     1.2876  read_tsc
> 38337     1.1857  tick_nohz_restart_sched_tick
> 34358     1.0627  cpu_idle
> 34194     1.0576  native_sched_clock
> 33812     1.0458  pick_next_task_fair
> 33685     1.0419  resched_task
> 33340     1.0312  sys_recvfrom
> 33287     1.0296  dst_release
> 32439     1.0033  kmem_cache_free
> 32131     0.9938  hrtimer_start_range_ns
> 29807     0.9219  udp_queue_rcv_skb
> 27815     0.8603  task_rq_lock
> 26875     0.8312  __update_sched_clock
> 23912     0.7396  sock_queue_rcv_skb
> 21583     0.6676  __wake_up_sync
> 21001     0.6496  effective_load
> 20531     0.6350  hrtick_start_fair
> 
> 
> 
> 
> With IRQ affinities and msi_disable (no packet drops)
> 
> CPU: Core 2, speed 3000.13 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        symbol name
> 79788    10.3815  schedule
> 69422     9.0328  mwait_idle
> 44877     5.8391  __skb_recv_datagram
> 28629     3.7250  tick_nohz_stop_sched_tick
> 27252     3.5459  select_nohz_load_balancer
> 24320     3.1644  lock_sock_nested
> 20833     2.7107  getnstimeofday
> 20666     2.6889  skb_release_data
> 18612     2.4217  set_next_entity
> 17785     2.3141  get_next_timer_interrupt
> 17691     2.3018  udp_recvmsg
> 17271     2.2472  sched_clock_tick
> 16032     2.0860  copy_to_user
> 14785     1.9237  update_curr
> 12512     1.6280  prepare_to_wait_exclusive
> 12498     1.6262  __slab_free
> 11380     1.4807  read_tsc
> 11145     1.4501  sched_clock_cpu
> 10598     1.3789  __switch_to
> 9588      1.2475  pick_next_task_fair
> 9480      1.2335  cpu_idle
> 9218      1.1994  sys_recvfrom
> 9008      1.1721  tick_nohz_restart_sched_tick
> 8977      1.1680  dst_release
> 8930      1.1619  native_sched_clock
> 8392      1.0919  kmem_cache_free
> 8124      1.0570  hrtimer_start_range_ns
> 7274      0.9464  bnx2_interrupt
> 7175      0.9336  __copy_skb_header
> 7006      0.9116  try_to_wake_up
> 6949      0.9042  sock_def_readable
> 6787      0.8831  enqueue_entity
> 6772      0.8811  __update_sched_clock
> 6349      0.8261  finish_task_switch
> 6164      0.8020  copy_from_user
> 5096      0.6631  resched_task
> 5007      0.6515  sysenter_past_esp
> 
> 
> I will try to investigate a litle bit more in following days if time permits.
> 
I'm not 100% versed on this, but IIRC, some hardware simply can't set irq
affinity when operating in msi interrupt mode.  If this is the case with this
particular bnx2 card, then I would expect some packet loss, simply due to the
constant cache misses.  It would be interesting to re-run your oprofile cases,
counting L2 cache hits/misses (if your cpu supports that class of counter) for
both bnx2 running in msi enabled mode and msi disabled mode.  It would also be
interesting to use a different card, that can set irq affinity, and compare loss
with irqbalance on, and irqbalance off with irq afninty set to all cpus.

Neil

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 22:29   ` Kenny Chang
  2009-01-30 22:41     ` Eric Dumazet
@ 2009-02-02 13:53     ` Eric Dumazet
  1 sibling, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-02-02 13:53 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

Kenny Chang a écrit :
> Ah, sorry, here's the test program attached.
> 
> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
> 2.6.29.-rcX.
> 
> Right now, we are trying to step through the kernel versions until we
> see where the performance drops significantly.  We'll try 2.6.29-rc soon
> and post the result.
> 
> Neil Norman wrote:

On latest kernels, we have a "timer_slack_ns" default of 50.000 ns, aka 50us

So usleep(20) sleeps much more than expected.

You might add in your program a call to prcrl() to setup a smaller timer_slack :

#ifndef PR_SET_TIMERSLACK
#define PR_SET_TIMERSLACK 29
#endif
/*
 * Setup a timer resolution of 1000 ns : 1 us
 */
prctl(PR_SET_TIMERSLACK, 1000); 





^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-31 16:03       ` Neil Horman
@ 2009-02-02 16:13         ` Kenny Chang
  2009-02-02 16:48         ` Kenny Chang
  1 sibling, 0 replies; 70+ messages in thread
From: Kenny Chang @ 2009-02-02 16:13 UTC (permalink / raw)
  To: netdev

Neil Horman wrote:
> On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
>   
>> Kenny Chang a écrit :
>>     
>>> Ah, sorry, here's the test program attached.
>>>
>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>> 2.6.29.-rcX.
>>>
>>> Right now, we are trying to step through the kernel versions until we
>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>> and post the result.
>>>       
>> 2.6.29-rc contains UDP receive improvements (lockless)
>>
>> Problem is multicast handling was not yet updated, but could be :)
>>
>>
>> I was asking you "cat /proc/interrupts" because I believe you might
>> have a problem NIC interrupts being handled by one CPU only (when having problems)
>>
>>     
> That would be expected (if irqbalance is running), and desireable, since
> spreading high volume interrupts like NICS accross multiple cores (or more
> specifically multiple L2 caches), is going increase your cache line miss rate
> significantly and decrease rx throughput.
>
> Although you do have a point here, if the system isn't running irqbalance, and
> the NICS irq affinity is spread accross multiple L2 caches, that would be a
> point of improvement performance-wise.  
>
> Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
> and your stats that I asked about earlier, that would be a big help.
>
> Regards
> Neil
>
>   
Hi Neil,

Here's the information you requested.

Kenny

kchang@beast8:~$ uname -a
Linux beast8 2.6.24-19-server #1 SMP Wed Aug 20 18:43:06 UTC 2008 x86_64 
GNU/Linux
kchang@beast8:~$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5322.91
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 1
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.03
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 2
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 2
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.06
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 3
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 3
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.06
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 4
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 1
siblings    : 4
core id        : 0
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.07
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 5
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 1
siblings    : 4
core id        : 1
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.07
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 6
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 1
siblings    : 4
core id        : 2
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.08
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 7
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 1
siblings    : 4
core id        : 3
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.08
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

kchang@beast8:~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       
CPU5       CPU6       CPU7      
  0:         67          0          1          0          0          
0          0          0   IO-APIC-edge      timer
  1:          0          0          0          0          0          
0          0          0   IO-APIC-edge      i8042
  8:          0          0          0          1          0          
0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0          0          
0          0          0   IO-APIC-fasteoi   acpi
 14:         12         13         13         13         12         
10         13         13   IO-APIC-edge      libata
 15:          0          0          0          0          0          
0          0          0   IO-APIC-edge      libata
 17:        294        295        293        294        294        
296        293        288   IO-APIC-fasteoi   aacraid
 22:          6          5          5          5          6          
6          6          6   IO-APIC-fasteoi   uhci_hcd:usb3
 23:          7          8          8          7          7          
8          7          8   IO-APIC-fasteoi   ehci_hcd:usb1, 
uhci_hcd:usb2, uhci_hcd:usb4
2294:         48         46         48         48         49         
47         48         51   PCI-MSI-edge      eth0
NMI:          0          0          0          0          0          
0          0          0   Non-maskable interrupts
LOC:       5088       3394       3129       2835       2561       
2938       2576       2798   Local timer interrupts
RES:         59        119         58         36         34         
71         50         17   Rescheduling interrupts
CAL:        132        128        149        141        138        
140        152        140   function call interrupts
TLB:        285        178        278        183        297        
191        295        157   TLB shootdowns
TRM:          0          0          0          0          0          
0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          
0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0          0          
0          0          0   Spurious interrupts
ERR:          0


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-31 16:03       ` Neil Horman
  2009-02-02 16:13         ` Kenny Chang
@ 2009-02-02 16:48         ` Kenny Chang
  2009-02-03 11:55           ` Neil Horman
  1 sibling, 1 reply; 70+ messages in thread
From: Kenny Chang @ 2009-02-02 16:48 UTC (permalink / raw)
  To: netdev

Neil Horman wrote:
> On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
>   
>> Kenny Chang a écrit :
>>     
>>> Ah, sorry, here's the test program attached.
>>>
>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>> 2.6.29.-rcX.
>>>
>>> Right now, we are trying to step through the kernel versions until we
>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>> and post the result.
>>>       
>> 2.6.29-rc contains UDP receive improvements (lockless)
>>
>> Problem is multicast handling was not yet updated, but could be :)
>>
>>
>> I was asking you "cat /proc/interrupts" because I believe you might
>> have a problem NIC interrupts being handled by one CPU only (when having problems)
>>
>>     
> That would be expected (if irqbalance is running), and desireable, since
> spreading high volume interrupts like NICS accross multiple cores (or more
> specifically multiple L2 caches), is going increase your cache line miss rate
> significantly and decrease rx throughput.
>
> Although you do have a point here, if the system isn't running irqbalance, and
> the NICS irq affinity is spread accross multiple L2 caches, that would be a
> point of improvement performance-wise.  
>
> Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
> and your stats that I asked about earlier, that would be a big help.
>
> Regards
> Neil
>
>   
This is for a working setup.

-Kenny

kchang@fiji:~$ uname -a
Linux fiji 2.6.24-19-generic #1 SMP Wed Aug 20 17:53:40 UTC 2008 x86_64 
GNU/Linux
kchang@fiji:~$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping        : 11
cpu MHz         : 1600.000
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr lahf_lm
bogomips    : 4791.31
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping    : 11
cpu MHz        : 1600.000
cache size    : 4096 KB
physical id    : 0
siblings    : 4
core id        : 1
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr lahf_lm
bogomips    : 4788.05
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 2
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping    : 11
cpu MHz        : 1600.000
cache size    : 4096 KB
physical id    : 0
siblings    : 4
core id        : 2
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr lahf_lm
bogomips    : 4788.08
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 3
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping    : 11
cpu MHz        : 1600.000
cache size    : 4096 KB
physical id    : 0
siblings    : 4
core id        : 3
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr lahf_lm
bogomips    : 4788.07
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

kchang@fiji:~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3      
  0:        165          0          0          0   IO-APIC-edge      timer
  1:          2          0          0          0   IO-APIC-edge      i8042
  8:          1          0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          4          0          0          0   IO-APIC-edge      i8042
 16:   25614400          0          0          0   IO-APIC-fasteoi   
uhci_hcd:usb3, HDA Intel, eth1, nvidia
 17:     571932          0          0          0   IO-APIC-fasteoi   
uhci_hcd:usb4, uhci_hcd:usb6
 18:     102824          0          0          0   IO-APIC-fasteoi   
uhci_hcd:usb7
 22:          0          0          0          0   IO-APIC-fasteoi   
ehci_hcd:usb1
 23:    1636819          0          0          0   IO-APIC-fasteoi   
ehci_hcd:usb2, uhci_hcd:usb5
507:   12542966          0          0          0   PCI-MSI-edge      eth0
508:    1201118          0          0          0   PCI-MSI-edge      ahci
NMI:          0          0          0          0   Non-maskable interrupts
LOC:   29214662   20141857   21777347   14279251   Local timer interrupts
RES:     205758     173268     238058     123958   Rescheduling interrupts
CAL:       2623       3732       3814       2747   function call interrupts
TLB:      29961      56621      31440      55783   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 13:45         ` Neil Horman
@ 2009-02-02 16:57           ` Eric Dumazet
  2009-02-02 18:22             ` Neil Horman
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-02-02 16:57 UTC (permalink / raw)
  To: Neil Horman; +Cc: Kenny Chang, netdev

Neil Horman a écrit :
> On Sun, Feb 01, 2009 at 01:40:39PM +0100, Eric Dumazet wrote:
>> Eric Dumazet a écrit :
>>> Kenny Chang a écrit :
>>>> Ah, sorry, here's the test program attached.
>>>>
>>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>>> 2.6.29.-rcX.
>>>>
>>>> Right now, we are trying to step through the kernel versions until we
>>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>>> and post the result.
>> I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)
>>
>> With 8 clients, about 10% packet loss, 
>>
>> Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
>> wakeups per second... But at least UDP receive path seems OK.
>>
>> Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
>> readers...
>>
>> I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)
>>
>> I tried playing with ethtool -C|c G|g params...
>> And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)
>>
>> I can have 0% packet loss if booting with msi_disable and
>>
>> echo 1 >/proc/irq/16/smp_affinities
>>
>> (16 being interrupt of eth0 NIC)
>>
>> then, a second run gave me errors, about 2%, oh well...
>>
>>
>> oprofile numbers without playing IRQ affinities:
>>
>> CPU: Core 2, speed 2999.89 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
>> samples  %        symbol name
>> 327928   10.1427  schedule
>> 259625    8.0301  mwait_idle
>> 187337    5.7943  __skb_recv_datagram
>> 109854    3.3977  lock_sock_nested
>> 104713    3.2387  tick_nohz_stop_sched_tick
>> 98831     3.0568  select_nohz_load_balancer
>> 88163     2.7268  skb_release_data
>> 78552     2.4296  update_curr
>> 75241     2.3272  getnstimeofday
>> 71400     2.2084  set_next_entity
>> 67629     2.0917  get_next_timer_interrupt
>> 67375     2.0839  sched_clock_tick
>> 58112     1.7974  enqueue_entity
>> 56462     1.7463  udp_recvmsg
>> 55049     1.7026  copy_to_user
>> 54277     1.6788  sched_clock_cpu
>> 54031     1.6712  __copy_skb_header
>> 51859     1.6040  __slab_free
>> 51786     1.6017  prepare_to_wait_exclusive
>> 51776     1.6014  sock_def_readable
>> 50062     1.5484  try_to_wake_up
>> 42182     1.3047  __switch_to
>> 41631     1.2876  read_tsc
>> 38337     1.1857  tick_nohz_restart_sched_tick
>> 34358     1.0627  cpu_idle
>> 34194     1.0576  native_sched_clock
>> 33812     1.0458  pick_next_task_fair
>> 33685     1.0419  resched_task
>> 33340     1.0312  sys_recvfrom
>> 33287     1.0296  dst_release
>> 32439     1.0033  kmem_cache_free
>> 32131     0.9938  hrtimer_start_range_ns
>> 29807     0.9219  udp_queue_rcv_skb
>> 27815     0.8603  task_rq_lock
>> 26875     0.8312  __update_sched_clock
>> 23912     0.7396  sock_queue_rcv_skb
>> 21583     0.6676  __wake_up_sync
>> 21001     0.6496  effective_load
>> 20531     0.6350  hrtick_start_fair
>>
>>
>>
>>
>> With IRQ affinities and msi_disable (no packet drops)
>>
>> CPU: Core 2, speed 3000.13 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
>> samples  %        symbol name
>> 79788    10.3815  schedule
>> 69422     9.0328  mwait_idle
>> 44877     5.8391  __skb_recv_datagram
>> 28629     3.7250  tick_nohz_stop_sched_tick
>> 27252     3.5459  select_nohz_load_balancer
>> 24320     3.1644  lock_sock_nested
>> 20833     2.7107  getnstimeofday
>> 20666     2.6889  skb_release_data
>> 18612     2.4217  set_next_entity
>> 17785     2.3141  get_next_timer_interrupt
>> 17691     2.3018  udp_recvmsg
>> 17271     2.2472  sched_clock_tick
>> 16032     2.0860  copy_to_user
>> 14785     1.9237  update_curr
>> 12512     1.6280  prepare_to_wait_exclusive
>> 12498     1.6262  __slab_free
>> 11380     1.4807  read_tsc
>> 11145     1.4501  sched_clock_cpu
>> 10598     1.3789  __switch_to
>> 9588      1.2475  pick_next_task_fair
>> 9480      1.2335  cpu_idle
>> 9218      1.1994  sys_recvfrom
>> 9008      1.1721  tick_nohz_restart_sched_tick
>> 8977      1.1680  dst_release
>> 8930      1.1619  native_sched_clock
>> 8392      1.0919  kmem_cache_free
>> 8124      1.0570  hrtimer_start_range_ns
>> 7274      0.9464  bnx2_interrupt
>> 7175      0.9336  __copy_skb_header
>> 7006      0.9116  try_to_wake_up
>> 6949      0.9042  sock_def_readable
>> 6787      0.8831  enqueue_entity
>> 6772      0.8811  __update_sched_clock
>> 6349      0.8261  finish_task_switch
>> 6164      0.8020  copy_from_user
>> 5096      0.6631  resched_task
>> 5007      0.6515  sysenter_past_esp
>>
>>
>> I will try to investigate a litle bit more in following days if time permits.
>>
> I'm not 100% versed on this, but IIRC, some hardware simply can't set irq
> affinity when operating in msi interrupt mode.  If this is the case with this
> particular bnx2 card, then I would expect some packet loss, simply due to the
> constant cache misses.  It would be interesting to re-run your oprofile cases,
> counting L2 cache hits/misses (if your cpu supports that class of counter) for
> both bnx2 running in msi enabled mode and msi disabled mode.  It would also be
> interesting to use a different card, that can set irq affinity, and compare loss
> with irqbalance on, and irqbalance off with irq afninty set to all cpus.

booted with msi_disable=1, IRQ of eth0 handled by CPU0 only, so that
oprofile results sorted on CPU0 numbers.

We can see scheduler has hard time to cope with this workload with more of two CPUS

OK up to 30.000 (* 8 sockets) packets per second. 

CPU0 is 100% handling softirq (ksoftirqd/0)


CPU: Core 2, speed 3000.31 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        symbol name
6152     12.5595  3         0.0098  3         0.0090  5         0.0156  1         0.0582  0              0  2         0.0065  3         0.0169  enqueue_entity
4453      9.0909  2         0.0065  3         0.0090  4         0.0125  5         0.2910  0              0  1         0.0033  2         0.0113  try_to_wake_up
3837      7.8333  3         0.0098  8         0.0241  0              0  0              0  0              0  0              0  0              0  sock_def_readable
3694      7.5414  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __copy_skb_header
2320      4.7363  1         0.0033  2         0.0060  2         0.0062  1         0.0582  1         0.0028  2         0.0065  0              0  resched_task
1818      3.7115  6         0.0196  32        0.0962  0              0  0              0  0              0  0              0  0              0  sock_queue_rcv_skb
1776      3.6257  0              0  0              0  0              0  0              0  0              0  0              0  0              0  udp_queue_rcv_skb
1677      3.4236  0              0  1         0.0030  0              0  1         0.0582  1         0.0028  0              0  0              0  __slab_alloc
1658      3.3848  260       0.8496  303       0.9109  289       0.9021  24        1.3970  418       1.1730  326       1.0626  173       0.9733  sched_clock_cpu
1614      3.2950  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __wake_up_sync
1600      3.2664  0              0  1         0.0030  0              0  1         0.0582  1         0.0028  0              0  0              0  select_task_rq_fair
1569      3.2032  1299      4.2447  1530      4.5996  1271      3.9675  6         0.3492  1677      4.7062  1275      4.1559  759       4.2703  update_curr
1532      3.1276  4         0.0131  4         0.0120  0              0  2         0.1164  1         0.0028  1         0.0033  1         0.0056  task_rq_lock
1325      2.7050  1         0.0033  7         0.0210  0              0  0              0  0              0  0              0  0              0  skb_queue_tail
1273      2.5989  1         0.0033  1         0.0030  1         0.0031  0              0  0              0  1         0.0033  0              0  enqueue_task_fair
1227      2.5050  0              0  0              0  0              0  0              0  0              0  0              0  0              0  effective_load
1071      2.1865  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __udp4_lib_rcv
1009      2.0599  0              0  0              0  0              0  2         0.1164  1         0.0028  0              0  0              0  activate_task
940       1.9190  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __wake_up_common
930       1.8986  0              0  2         0.0060  0              0  0              0  0              0  0              0  1         0.0056  account_scheduler_latency
859       1.7537  0              0  0              0  0              0  0              0  0              0  0              0  1         0.0056  __skb_clone
609       1.2433  0              0  0              0  0              0  1         0.0582  1         0.0028  1         0.0033  0              0  enqueue_task
588       1.2004  3         0.0098  2         0.0060  5         0.0156  8         0.4657  2         0.0056  3         0.0098  2         0.0113  kmem_cache_alloc
477       0.9738  307       1.0032  322       0.9680  358       1.1175  27        1.5716  338       0.9485  315       1.0268  203       1.1421  native_sched_clock
441       0.9003  0              0  0              0  0              0  0              0  0              0  0              0  0              0  skb_clone
408       0.8329  0              0  0              0  0              0  0              0  0              0  0              0  0              0  ip_route_input
375       0.7656  0              0  0              0  0              0  0              0  0              0  0              0  0              0  bnx2_poll_work
366       0.7472  248       0.8104  269       0.8087  293       0.9146  22        1.2806  289       0.8110  332       1.0822  157       0.8833  __update_sched_clock
327       0.6676  1         0.0033  0              0  0              0  0              0  0              0  2         0.0065  1         0.0056  place_entity
265       0.5410  54        0.1765  62        0.1864  39        0.1217  3         0.1746  84        0.2357  61        0.1988  12        0.0675  rb_insert_color
194       0.3961  2662      8.6985  3291      9.8936  3231     10.0858  372      21.6531  2994      8.4021  3299     10.7533  1719      9.6714  mwait_idle



This problem completely disapears if I launch all clients bounded to CPU1
(and NIC irq still on CPU0)

taskset -p2 ./mcasttest.sh

(No packet loss, while CPU1 has 0% idle time...)
We have less context switches. Once awakened, a task can read several packets in its socket.

CPU: Core 2, speed 3000.31 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        symbol name
25316    13.6664  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __copy_skb_header
14584     7.8729  13        0.0083  2         0.2670  0              0  0              0  1         0.0218  1         0.1577  1         0.2604  task_rq_lock
11624     6.2750  10657     6.7650  3         0.4005  2         0.5013  5         0.7278  36        0.7862  2         0.3155  1         0.2604  update_curr
10038     5.4188  318       0.2019  0              0  0              0  0              0  0              0  0              0  0              0  sock_def_readable
10021     5.4097  0              0  0              0  0              0  0              0  0              0  0              0  0              0  bnx2_interrupt
7777      4.1983  11        0.0070  1         0.1335  2         0.5013  2         0.2911  8         0.1747  3         0.4732  1         0.2604  try_to_wake_up
6559      3.5408  0              0  0              0  0              0  0              0  0              0  0              0  0              0  udp_queue_rcv_skb
6389      3.4490  257       0.1631  0              0  0              0  0              0  0              0  0              0  0              0  sock_queue_rcv_skb
6305      3.4036  6         0.0038  0              0  0              0  0              0  0              0  0              0  1         0.2604  __slab_alloc
5661      3.0560  44        0.0279  2         0.2670  1         0.2506  5         0.7278  0              0  1         0.1577  0              0  kmem_cache_alloc
5529      2.9847  5         0.0032  1         0.1335  4         1.0025  0              0  14        0.3057  3         0.4732  1         0.2604  enqueue_entity
4706      2.5404  64        0.0406  0              0  0              0  0              0  0              0  0              0  0              0  skb_queue_tail
4390      2.3699  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __udp4_lib_rcv
4043      2.1825  0              0  0              0  0              0  0              0  0              0  0              0  0              0  uhci_irq
3897      2.1037  0              0  0              0  0              0  0              0  0              0  0              0  0              0  bnx2_poll_work
3556      1.9196  0              0  296      39.5194  261      65.4135  258      37.5546  650      14.1952  263      41.4826  257      66.9271  mwait_idle
3449      1.8619  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __skb_clone
3348      1.8074  0              0  0              0  0              0  0              0  0              0  0              0  0              0  skb_clone
3243      1.7507  1653      1.0493  7         0.9346  3         0.7519  2         0.2911  63        1.3758  1         0.1577  1         0.2604  sched_clock_cpu
3068      1.6562  1        6.3e-04  0              0  0              0  0              0  0              0  0              0  0              0  __wake_up_sync
2923      1.5779  1        6.3e-04  0              0  0              0  0              0  0              0  0              0  0              0  check_preempt_wakeup
2588      1.3971  1        6.3e-04  4         0.5340  2         0.5013  1         0.1456  1         0.0218  2         0.3155  0              0  enqueue_task_fair
2399      1.2951  0              0  0              0  0              0  0              0  0              0  0              0  0              0  ip_route_input
1986      1.0721  5         0.0032  0              0  0              0  0              0  1         0.0218  0              0  0              0  __wake_up_common
1777      0.9593  132       0.0838  0              0  0              0  3         0.4367  8         0.1747  1         0.1577  1         0.2604  rb_insert_color
1754      0.9469  34        0.0216  0              0  0              0  0              0  0              0  0              0  0              0  __sk_mem_schedule
1550      0.8367  0              0  0              0  0              0  0              0  0              0  0              0  0              0  irq_entries_start
1527      0.8243  0              0  13        1.7356  2         0.5013  0              0  104       2.2712  6         0.9464  0              0  get_next_timer_interrupt
1398      0.7547  6         0.0038  0              0  0              0  1         0.1456  1         0.0218  0              0  0              0  select_task_rq_fair
1159      0.6257  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __alloc_skb


Exchanging CPU0/CPU1 to get oprofile number sorted on the CPU used by user application :

CPU: Core 2, speed 3000.31 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        symbol name
6040     10.1815  8         0.0134  4         0.3208  6         1.3699  1         0.1580  3         1.0909  5         0.6039  0              0  schedule
6014     10.1377  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __skb_recv_datagram
4730      7.9733  36        0.0604  0              0  0              0  0              0  0              0  0              0  0              0  skb_release_data
4014      6.7663  2         0.0034  33        2.6464  1         0.2283  6         0.9479  0              0  1         0.1208  0              0  copy_to_user
3732      6.2910  3708      6.2167  4         0.3208  1         0.2283  3         0.4739  3         1.0909  2         0.2415  0              0  update_curr
3446      5.8089  0              0  0              0  0              0  0              0  0              0  0              0  0              0  lock_sock_nested
2430      4.0962  0              0  0              0  0              0  0              0  0              0  0              0  0              0  udp_recvmsg
2028      3.4186  73        0.1224  0              0  0              0  2         0.3160  0              0  1         0.1208  1         0.3861  __slab_free
1898      3.1994  43        0.0721  0              0  0              0  0              0  0              0  0              0  0              0  dst_release
1645      2.7730  0              0  0              0  0              0  0              0  0              0  0              0  0              0  memcpy_toiovec
1635      2.7561  0              0  0              0  0              0  1         0.1580  0              0  0              0  0              0  copy_from_user
1407      2.3718  2         0.0034  0              0  2         0.4566  6         0.9479  0              0  5         0.6039  1         0.3861  sysenter_past_esp
1389      2.3414  58        0.0972  3         0.2406  3         0.6849  4         0.6319  2         0.7273  3         0.3623  0              0  kmem_cache_free
1135      1.9133  1         0.0017  0              0  0              0  0              0  0              0  0              0  0              0  release_sock
1069      1.8020  0              0  0              0  0              0  0              0  0              0  0              0  0              0  prepare_to_wait_exclusive
1031      1.7379  3         0.0050  0              0  0              0  0              0  0              0  1         0.1208  0              0  put_prev_task_fair
1007      1.6975  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sock_rfree
926       1.5609  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sys_recvfrom
838       1.4126  0              0  0              0  0              0  0              0  0              0  0              0  0              0  skb_copy_datagram_iovec
697       1.1749  27        0.0453  1         0.0802  0              0  0              0  1         0.3636  0              0  0              0  add_partial
697       1.1749  0              0  0              0  0              0  0              0  0              0  1         0.1208  0              0  dequeue_task
604       1.0182  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sock_recvmsg
582       0.9811  933       1.5642  17        1.3633  1         0.2283  2         0.3160  1         0.3636  6         0.7246  0              0  sched_clock_cpu
525       0.8850  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __sk_mem_reclaim
512       0.8631  0              0  0              0  1         0.2283  0              0  0              0  1         0.1208  1         0.3861  __switch_to
489       0.8243  0              0  0              0  0              0  0              0  0              0  0              0  0              0  fget_light
450       0.7586  1         0.0017  2         0.1604  1         0.2283  0              0  0              0  4         0.4831  0              0  rb_erase
409       0.6894  95        0.1593  0              0  0              0  0              0  0              0  0              0  0              0  wakeup_preempt_entity
360       0.6068  0              0  0              0  0              0  0              0  0              0  0              0  0              0  move_addr_to_user
348       0.5866  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sys_socketcall
347       0.5849  1         0.0017  0              0  0              0  0              0  0              0  0              0  1         0.3861  set_next_entity


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 16:57           ` Eric Dumazet
@ 2009-02-02 18:22             ` Neil Horman
  2009-02-02 19:51               ` Wes Chow
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Horman @ 2009-02-02 18:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kenny Chang, netdev

On Mon, Feb 02, 2009 at 05:57:24PM +0100, Eric Dumazet wrote:
> Neil Horman a écrit :
> > On Sun, Feb 01, 2009 at 01:40:39PM +0100, Eric Dumazet wrote:
> >> Eric Dumazet a écrit :
> >>> Kenny Chang a écrit :
> >>>> Ah, sorry, here's the test program attached.
> >>>>
> >>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
> >>>> 2.6.29.-rcX.
> >>>>
> >>>> Right now, we are trying to step through the kernel versions until we
> >>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
> >>>> and post the result.
> >> I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)
> >>
> >> With 8 clients, about 10% packet loss, 
> >>
> >> Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
> >> wakeups per second... But at least UDP receive path seems OK.
> >>
> >> Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
> >> readers...
> >>
> >> I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)
> >>
> >> I tried playing with ethtool -C|c G|g params...
> >> And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)
> >>
> >> I can have 0% packet loss if booting with msi_disable and
> >>
> >> echo 1 >/proc/irq/16/smp_affinities
> >>
> >> (16 being interrupt of eth0 NIC)
> >>
> >> then, a second run gave me errors, about 2%, oh well...
> >>
> >>
> >> oprofile numbers without playing IRQ affinities:
> >>
> >> CPU: Core 2, speed 2999.89 MHz (estimated)
> >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> >> samples  %        symbol name
> >> 327928   10.1427  schedule
> >> 259625    8.0301  mwait_idle
> >> 187337    5.7943  __skb_recv_datagram
> >> 109854    3.3977  lock_sock_nested
> >> 104713    3.2387  tick_nohz_stop_sched_tick
> >> 98831     3.0568  select_nohz_load_balancer
> >> 88163     2.7268  skb_release_data
> >> 78552     2.4296  update_curr
> >> 75241     2.3272  getnstimeofday
> >> 71400     2.2084  set_next_entity
> >> 67629     2.0917  get_next_timer_interrupt
> >> 67375     2.0839  sched_clock_tick
> >> 58112     1.7974  enqueue_entity
> >> 56462     1.7463  udp_recvmsg
> >> 55049     1.7026  copy_to_user
> >> 54277     1.6788  sched_clock_cpu
> >> 54031     1.6712  __copy_skb_header
> >> 51859     1.6040  __slab_free
> >> 51786     1.6017  prepare_to_wait_exclusive
> >> 51776     1.6014  sock_def_readable
> >> 50062     1.5484  try_to_wake_up
> >> 42182     1.3047  __switch_to
> >> 41631     1.2876  read_tsc
> >> 38337     1.1857  tick_nohz_restart_sched_tick
> >> 34358     1.0627  cpu_idle
> >> 34194     1.0576  native_sched_clock
> >> 33812     1.0458  pick_next_task_fair
> >> 33685     1.0419  resched_task
> >> 33340     1.0312  sys_recvfrom
> >> 33287     1.0296  dst_release
> >> 32439     1.0033  kmem_cache_free
> >> 32131     0.9938  hrtimer_start_range_ns
> >> 29807     0.9219  udp_queue_rcv_skb
> >> 27815     0.8603  task_rq_lock
> >> 26875     0.8312  __update_sched_clock
> >> 23912     0.7396  sock_queue_rcv_skb
> >> 21583     0.6676  __wake_up_sync
> >> 21001     0.6496  effective_load
> >> 20531     0.6350  hrtick_start_fair
> >>
> >>
> >>
> >>
> >> With IRQ affinities and msi_disable (no packet drops)
> >>
> >> CPU: Core 2, speed 3000.13 MHz (estimated)
> >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> >> samples  %        symbol name
> >> 79788    10.3815  schedule
> >> 69422     9.0328  mwait_idle
> >> 44877     5.8391  __skb_recv_datagram
> >> 28629     3.7250  tick_nohz_stop_sched_tick
> >> 27252     3.5459  select_nohz_load_balancer
> >> 24320     3.1644  lock_sock_nested
> >> 20833     2.7107  getnstimeofday
> >> 20666     2.6889  skb_release_data
> >> 18612     2.4217  set_next_entity
> >> 17785     2.3141  get_next_timer_interrupt
> >> 17691     2.3018  udp_recvmsg
> >> 17271     2.2472  sched_clock_tick
> >> 16032     2.0860  copy_to_user
> >> 14785     1.9237  update_curr
> >> 12512     1.6280  prepare_to_wait_exclusive
> >> 12498     1.6262  __slab_free
> >> 11380     1.4807  read_tsc
> >> 11145     1.4501  sched_clock_cpu
> >> 10598     1.3789  __switch_to
> >> 9588      1.2475  pick_next_task_fair
> >> 9480      1.2335  cpu_idle
> >> 9218      1.1994  sys_recvfrom
> >> 9008      1.1721  tick_nohz_restart_sched_tick
> >> 8977      1.1680  dst_release
> >> 8930      1.1619  native_sched_clock
> >> 8392      1.0919  kmem_cache_free
> >> 8124      1.0570  hrtimer_start_range_ns
> >> 7274      0.9464  bnx2_interrupt
> >> 7175      0.9336  __copy_skb_header
> >> 7006      0.9116  try_to_wake_up
> >> 6949      0.9042  sock_def_readable
> >> 6787      0.8831  enqueue_entity
> >> 6772      0.8811  __update_sched_clock
> >> 6349      0.8261  finish_task_switch
> >> 6164      0.8020  copy_from_user
> >> 5096      0.6631  resched_task
> >> 5007      0.6515  sysenter_past_esp
> >>
> >>
> >> I will try to investigate a litle bit more in following days if time permits.
> >>
> > I'm not 100% versed on this, but IIRC, some hardware simply can't set irq
> > affinity when operating in msi interrupt mode.  If this is the case with this
> > particular bnx2 card, then I would expect some packet loss, simply due to the
> > constant cache misses.  It would be interesting to re-run your oprofile cases,
> > counting L2 cache hits/misses (if your cpu supports that class of counter) for
> > both bnx2 running in msi enabled mode and msi disabled mode.  It would also be
> > interesting to use a different card, that can set irq affinity, and compare loss
> > with irqbalance on, and irqbalance off with irq afninty set to all cpus.
> 
> booted with msi_disable=1, IRQ of eth0 handled by CPU0 only, so that
> oprofile results sorted on CPU0 numbers.
> 
> We can see scheduler has hard time to cope with this workload with more of two CPUS
> 
> OK up to 30.000 (* 8 sockets) packets per second. 
> 
> CPU0 is 100% handling softirq (ksoftirqd/0)
> 

This explains alot.  if the application is scheduled to run on the same cpu that
has the irq for the NIC bound to it, you get a perf boost by not having to warm
up two caches (1 for the app cpu and one for the irq & softirq work), but you
loose it and then some fighting for cpu time.  If both the app and the irq are
on the same cpu, and we spend so much time in softirq context, we will
eventually overflow higher up the network stack, as the application doesn't have
enough time to dequeue frames.

It may also speak to the need to make the bnx2 napi routine more efficient :)

Neil


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 18:22             ` Neil Horman
@ 2009-02-02 19:51               ` Wes Chow
  2009-02-02 20:29                 ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Wes Chow @ 2009-02-02 19:51 UTC (permalink / raw)
  To: netdev



(I'm Kenny's colleague, and I've been doing the kernel builds)

First I'd like to note that there were a lot of bnx2 NAPI changes between 
2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts of packet loss,
whereas loss in 2.6.22 is significant.

Second, some CPU affinity info: if I do like Eric and pin all of the
apps onto a single CPU, I see no packet loss. Also, I do *not* see
ksoftirqd show up on top at all!

If I pin half the processes on one CPU and the other half on another CPU, one 
ksoftirqd processes shows up in top and completely pegs one CPU. My packet loss
in that case is significant (25%).

Now, the strange case: if I pin 3 processes to one CPU and 1 process to 
another, I get about 25% packet loss and ksoftirqd pins one CPU. However, one
of the apps takes significantly less CPU than the others, and all apps lose the
*exact same number of packets*. In all other situations where we see packet
loss, the actual number lost per application instance appears random.

We're about to plug in an Intel ethernet card into this machine to collect more 
rigorous testing data. Please note, though, that we have seen packet loss with
a tg3 chipset as well. For now, though, I'm assuming that this is purely a bnx2
problem.

If I understand correctly, when the nic signals a hardware interrupt, the 
kernel grabs it and defers the meaty work to the softirq handler -- how does it
decide which ksoftirqd gets the interrupts? Is this something determined by how
the driver implements the NAPI?


Wes



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 19:51               ` Wes Chow
@ 2009-02-02 20:29                 ` Eric Dumazet
  2009-02-02 21:09                   ` Wes Chow
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-02-02 20:29 UTC (permalink / raw)
  To: Wes Chow; +Cc: netdev

Wes Chow a écrit :
> 
> (I'm Kenny's colleague, and I've been doing the kernel builds)
> 
> First I'd like to note that there were a lot of bnx2 NAPI changes between 
> 2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts of packet loss,
> whereas loss in 2.6.22 is significant.
> 
> Second, some CPU affinity info: if I do like Eric and pin all of the
> apps onto a single CPU, I see no packet loss. Also, I do *not* see
> ksoftirqd show up on top at all!
> 
> If I pin half the processes on one CPU and the other half on another CPU, one 
> ksoftirqd processes shows up in top and completely pegs one CPU. My packet loss
> in that case is significant (25%).
> 
> Now, the strange case: if I pin 3 processes to one CPU and 1 process to 
> another, I get about 25% packet loss and ksoftirqd pins one CPU. However, one
> of the apps takes significantly less CPU than the others, and all apps lose the
> *exact same number of packets*. In all other situations where we see packet
> loss, the actual number lost per application instance appears random.

You see same number of packet lost because they are lost at NIC level

(check ifconfig eth0 for droped packets)

if softirq is too busy to process packets, we are not able to get them
from hardware in time.

> 
> We're about to plug in an Intel ethernet card into this machine to collect more 
> rigorous testing data. Please note, though, that we have seen packet loss with
> a tg3 chipset as well. For now, though, I'm assuming that this is purely a bnx2
> problem.
> 
> If I understand correctly, when the nic signals a hardware interrupt, the 
> kernel grabs it and defers the meaty work to the softirq handler -- how does it
> decide which ksoftirqd gets the interrupts? Is this something determined by how
> the driver implements the NAPI?


Normaly, softirq runs on same cpu (the one handling hard irq)


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 20:29                 ` Eric Dumazet
@ 2009-02-02 21:09                   ` Wes Chow
  2009-02-02 21:31                     ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Wes Chow @ 2009-02-02 21:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev



Eric Dumazet wrote:
> Wes Chow a écrit :
>> (I'm Kenny's colleague, and I've been doing the kernel builds)
>>
>> First I'd like to note that there were a lot of bnx2 NAPI changes between 
>> 2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts of packet loss,
>> whereas loss in 2.6.22 is significant.
>>
>> Second, some CPU affinity info: if I do like Eric and pin all of the
>> apps onto a single CPU, I see no packet loss. Also, I do *not* see
>> ksoftirqd show up on top at all!
>>
>> If I pin half the processes on one CPU and the other half on another CPU, one 
>> ksoftirqd processes shows up in top and completely pegs one CPU. My packet loss
>> in that case is significant (25%).
>>
>> Now, the strange case: if I pin 3 processes to one CPU and 1 process to 
>> another, I get about 25% packet loss and ksoftirqd pins one CPU. However, one
>> of the apps takes significantly less CPU than the others, and all apps lose the
>> *exact same number of packets*. In all other situations where we see packet
>> loss, the actual number lost per application instance appears random.
> 
> You see same number of packet lost because they are lost at NIC level

Understood.

I have a new observation: if I pin processes to just CPUs 0 and 1, I see 
no packet loss. Pinning to 0 and 2, I do see packet loss. Pinning 2 and 
3, no packet loss. 4 & 5 - no packet loss, 6 & 7 - no packet loss. Any 
other combination appears to produce loss (though I have not tried all 
28 combinations, this seems to be the case).

At first I thought maybe it had to do with processes pinned to the same 
CPU, but different cores. The machine is a dual quad core, which means 
that CPUs 0-3 should be a physical CPU, correct? Pinning to 0/2 and 0/3 
produce packet loss.

I've also noticed that it does not matter which of the working pairs I 
pin to. For example, pinning 5 processes in any combination on either 
0/1 produce no packet loss, pinning all 5 to just CPU 0 also produces no 
packet loss.

The failures are also sudden. In all of the working cases mentioned 
above, I don't see ksoftirqd on top at all. But when I run 6 processes 
on a single CPU, ksoftirqd shoots up to 100% and I lose a huge number of 
packets.

> 
> Normaly, softirq runs on same cpu (the one handling hard irq)

What determines which CPU the hard irq occurs on?


Wes


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 21:09                   ` Wes Chow
@ 2009-02-02 21:31                     ` Eric Dumazet
  2009-02-03 17:34                       ` Kenny Chang
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-02-02 21:31 UTC (permalink / raw)
  To: Wes Chow; +Cc: netdev

Wes Chow a écrit :
> 
> 
> Eric Dumazet wrote:
>> Wes Chow a écrit :
>>> (I'm Kenny's colleague, and I've been doing the kernel builds)
>>>
>>> First I'd like to note that there were a lot of bnx2 NAPI changes
>>> between 2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts
>>> of packet loss,
>>> whereas loss in 2.6.22 is significant.
>>>
>>> Second, some CPU affinity info: if I do like Eric and pin all of the
>>> apps onto a single CPU, I see no packet loss. Also, I do *not* see
>>> ksoftirqd show up on top at all!
>>>
>>> If I pin half the processes on one CPU and the other half on another
>>> CPU, one ksoftirqd processes shows up in top and completely pegs one
>>> CPU. My packet loss
>>> in that case is significant (25%).
>>>
>>> Now, the strange case: if I pin 3 processes to one CPU and 1 process
>>> to another, I get about 25% packet loss and ksoftirqd pins one CPU.
>>> However, one
>>> of the apps takes significantly less CPU than the others, and all
>>> apps lose the
>>> *exact same number of packets*. In all other situations where we see
>>> packet
>>> loss, the actual number lost per application instance appears random.
>>
>> You see same number of packet lost because they are lost at NIC level
> 
> Understood.
> 
> I have a new observation: if I pin processes to just CPUs 0 and 1, I see
> no packet loss. Pinning to 0 and 2, I do see packet loss. Pinning 2 and
> 3, no packet loss. 4 & 5 - no packet loss, 6 & 7 - no packet loss. Any
> other combination appears to produce loss (though I have not tried all
> 28 combinations, this seems to be the case).
> 
> At first I thought maybe it had to do with processes pinned to the same
> CPU, but different cores. The machine is a dual quad core, which means
> that CPUs 0-3 should be a physical CPU, correct? Pinning to 0/2 and 0/3
> produce packet loss.

a quad core is really a 2 x 2 core

L2 cache is splited on two blocks, one block used by CPU0/1, other by CPU2/3 

You are at the limit of the machine with such workload, so as soon as your
CPUs have to transfert 64 bytes lines between those two L2 blocks, you loose.


> 
> I've also noticed that it does not matter which of the working pairs I
> pin to. For example, pinning 5 processes in any combination on either
> 0/1 produce no packet loss, pinning all 5 to just CPU 0 also produces no
> packet loss.
> 
> The failures are also sudden. In all of the working cases mentioned
> above, I don't see ksoftirqd on top at all. But when I run 6 processes
> on a single CPU, ksoftirqd shoots up to 100% and I lose a huge number of
> packets.
> 
>>
>> Normaly, softirq runs on same cpu (the one handling hard irq)
> 
> What determines which CPU the hard irq occurs on?
> 

Check /proc/irq/{irqnumber}/smp_affinity

If you want IRQ16 only served by CPU0 :

echo 1 >/proc/irq/16/smp_affinity


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 16:48         ` Kenny Chang
@ 2009-02-03 11:55           ` Neil Horman
  2009-02-03 15:20             ` Kenny Chang
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Horman @ 2009-02-03 11:55 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

On Mon, Feb 02, 2009 at 11:48:25AM -0500, Kenny Chang wrote:
> Neil Horman wrote:
>> On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
>>   
>>> Kenny Chang a écrit :
>>>     
>>>> Ah, sorry, here's the test program attached.
>>>>
>>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>>> 2.6.29.-rcX.
>>>>
>>>> Right now, we are trying to step through the kernel versions until we
>>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>>> and post the result.
>>>>       
>>> 2.6.29-rc contains UDP receive improvements (lockless)
>>>
>>> Problem is multicast handling was not yet updated, but could be :)
>>>
>>>
>>> I was asking you "cat /proc/interrupts" because I believe you might
>>> have a problem NIC interrupts being handled by one CPU only (when having problems)
>>>
>>>     
>> That would be expected (if irqbalance is running), and desireable, since
>> spreading high volume interrupts like NICS accross multiple cores (or more
>> specifically multiple L2 caches), is going increase your cache line miss rate
>> significantly and decrease rx throughput.
>>
>> Although you do have a point here, if the system isn't running irqbalance, and
>> the NICS irq affinity is spread accross multiple L2 caches, that would be a
>> point of improvement performance-wise.  
>>
>> Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
>> and your stats that I asked about earlier, that would be a big help.
>>
>> Regards
>> Neil
>>
>>   
> This is for a working setup.
>

Are these quad core systems?  Or dual core w/ hyperthreading?  I ask because in
your working setup you have 1/2 the number of cpus' and was not sure if you
removed an entire package of if you just disabled hyperthreading.


Neil


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-03 11:55           ` Neil Horman
@ 2009-02-03 15:20             ` Kenny Chang
  2009-02-04  1:15               ` Neil Horman
  0 siblings, 1 reply; 70+ messages in thread
From: Kenny Chang @ 2009-02-03 15:20 UTC (permalink / raw)
  To: netdev

Neil Horman wrote:
> On Mon, Feb 02, 2009 at 11:48:25AM -0500, Kenny Chang wrote:
>   
>> Neil Horman wrote:
>>     
>>> On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
>>>   
>>>       
>>>> Kenny Chang a écrit :
>>>>     
>>>>         
>>>>> Ah, sorry, here's the test program attached.
>>>>>
>>>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>>>> 2.6.29.-rcX.
>>>>>
>>>>> Right now, we are trying to step through the kernel versions until we
>>>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>>>> and post the result.
>>>>>       
>>>>>           
>>>> 2.6.29-rc contains UDP receive improvements (lockless)
>>>>
>>>> Problem is multicast handling was not yet updated, but could be :)
>>>>
>>>>
>>>> I was asking you "cat /proc/interrupts" because I believe you might
>>>> have a problem NIC interrupts being handled by one CPU only (when having problems)
>>>>
>>>>     
>>>>         
>>> That would be expected (if irqbalance is running), and desireable, since
>>> spreading high volume interrupts like NICS accross multiple cores (or more
>>> specifically multiple L2 caches), is going increase your cache line miss rate
>>> significantly and decrease rx throughput.
>>>
>>> Although you do have a point here, if the system isn't running irqbalance, and
>>> the NICS irq affinity is spread accross multiple L2 caches, that would be a
>>> point of improvement performance-wise.  
>>>
>>> Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
>>> and your stats that I asked about earlier, that would be a big help.
>>>
>>> Regards
>>> Neil
>>>
>>>   
>>>       
>> This is for a working setup.
>>
>>     
>
> Are these quad core systems?  Or dual core w/ hyperthreading?  I ask because in
> your working setup you have 1/2 the number of cpus' and was not sure if you
> removed an entire package of if you just disabled hyperthreading.
>
>
> Neil
>
>   
Yeah, these are quad core systems.  The 8 cpu system is a dual-processor 
quad-core.  The other is my desktop, single cpu quad core.

Kenny


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-02 21:31                     ` Eric Dumazet
@ 2009-02-03 17:34                       ` Kenny Chang
  2009-02-04  1:21                         ` Neil Horman
  0 siblings, 1 reply; 70+ messages in thread
From: Kenny Chang @ 2009-02-03 17:34 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 4951 bytes --]

Eric Dumazet wrote:
> Wes Chow a écrit :
>   
>> Eric Dumazet wrote:
>>     
>>> Wes Chow a écrit :
>>>       
>>>> (I'm Kenny's colleague, and I've been doing the kernel builds)
>>>>
>>>> First I'd like to note that there were a lot of bnx2 NAPI changes
>>>> between 2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts
>>>> of packet loss,
>>>> whereas loss in 2.6.22 is significant.
>>>>
>>>> Second, some CPU affinity info: if I do like Eric and pin all of the
>>>> apps onto a single CPU, I see no packet loss. Also, I do *not* see
>>>> ksoftirqd show up on top at all!
>>>>
>>>> If I pin half the processes on one CPU and the other half on another
>>>> CPU, one ksoftirqd processes shows up in top and completely pegs one
>>>> CPU. My packet loss
>>>> in that case is significant (25%).
>>>>
>>>> Now, the strange case: if I pin 3 processes to one CPU and 1 process
>>>> to another, I get about 25% packet loss and ksoftirqd pins one CPU.
>>>> However, one
>>>> of the apps takes significantly less CPU than the others, and all
>>>> apps lose the
>>>> *exact same number of packets*. In all other situations where we see
>>>> packet
>>>> loss, the actual number lost per application instance appears random.
>>>>         
>>> You see same number of packet lost because they are lost at NIC level
>>>       
>> Understood.
>>
>> I have a new observation: if I pin processes to just CPUs 0 and 1, I see
>> no packet loss. Pinning to 0 and 2, I do see packet loss. Pinning 2 and
>> 3, no packet loss. 4 & 5 - no packet loss, 6 & 7 - no packet loss. Any
>> other combination appears to produce loss (though I have not tried all
>> 28 combinations, this seems to be the case).
>>
>> At first I thought maybe it had to do with processes pinned to the same
>> CPU, but different cores. The machine is a dual quad core, which means
>> that CPUs 0-3 should be a physical CPU, correct? Pinning to 0/2 and 0/3
>> produce packet loss.
>>     
>
> a quad core is really a 2 x 2 core
>
> L2 cache is splited on two blocks, one block used by CPU0/1, other by CPU2/3 
>
> You are at the limit of the machine with such workload, so as soon as your
> CPUs have to transfert 64 bytes lines between those two L2 blocks, you loose.
>
>
>   
>> I've also noticed that it does not matter which of the working pairs I
>> pin to. For example, pinning 5 processes in any combination on either
>> 0/1 produce no packet loss, pinning all 5 to just CPU 0 also produces no
>> packet loss.
>>
>> The failures are also sudden. In all of the working cases mentioned
>> above, I don't see ksoftirqd on top at all. But when I run 6 processes
>> on a single CPU, ksoftirqd shoots up to 100% and I lose a huge number of
>> packets.
>>
>>     
>>> Normaly, softirq runs on same cpu (the one handling hard irq)
>>>       
>> What determines which CPU the hard irq occurs on?
>>
>>     
>
> Check /proc/irq/{irqnumber}/smp_affinity
>
> If you want IRQ16 only served by CPU0 :
>
> echo 1 >/proc/irq/16/smp_affinity
>
>   
Hi everyone,

First, thanks for all the effort so far, I think we've learned so much 
more about the problem in the last couple of days than we had previously 
in a month.

Just to summarize where we are:

* pinning processes to specific cores/CPUs alleviate the problem
* issues exist from 2.6.22 up to 2.6.29-rc3
* issue does not appear to be isolated to 64-bit, 32-bits have problems 
too.
* I'm attaching an updated test program with the PR_SET_TIMERSTACK call 
added.
* on troubled machines, we are seeing high number of context switches 
and interrupts.
* we've ordered an Intel card to try in our machine to see if we can 
circumvent the issue with a different driver.

Kernel Version         Has Problem?     Notes
----------             ----------       ----------
2.6.15.x                N    
2.6.16.x                -
2.6.17.x                -               Doesn't build on Hardy
2.6.18.x                -               Doesn't boot (kernel panic)
2.6.19.7                N               ksoftirqd is up there, but not 
pegging a CPU.
                                        Takes roughly same amount of CPU 
as the other
                                        processes, all of which are from 
20-40%
2.6.20.21               N
2.6.21.7                N               sort of lopsided load, but no 
load from
                                        ksoftirqd -- strange
2.6.22.19               Y               First broken kernel
2.6.23.x                -
2.6.24-19               Y               (from Hardy)
2.6.25.x                -
2.6.26.x                -
2.6.27.x                Y               (from Intrepid)
2.6.28.1                Y
2.6.29-rc               Y


Correct me if I'm wrong, from what we've seen, it looks like its 
pointing to some inefficiency in the softirq handling.  The question is 
whether it's something in the driver or the kernel.  If we can isolate 
that, maybe we can take some action to have it fixed.

Kenny

[-- Attachment #2: mcasttest.c --]
[-- Type: text/x-csrc, Size: 3392 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <arpa/inet.h>
#include <sys/epoll.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/select.h>
#include <unistd.h>

#ifndef PR_SET_TIMERSLACK
#define PR_SET_TIMERSLACK 29
#endif

const char *g_mcastaddr = "239.100.0.99";
int g_port = 10100;

void error(const char *s)
{
    fprintf(stderr, "%s\n", s);
    exit(1);
}

void check(int v)
{
    int myerr = errno;
    char *myerrstr = strerror(myerr);
    if(!v)
        error("bad return code");
}

int main(int argc, char **argv)
{
    if(argc != 2)
        error("usage: mcasttest (server|client)");
    if(strcmp(argv[1], "client") == 0)
    {
        /*
         * Client program: subscribes to a multicast group, receives messages
         * and prints a count of messages received once it's done.
         */
        int s = socket(AF_INET, SOCK_DGRAM, 0);
        check(s > 0);
        int val = 1;
        check(setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val)) == 0);

        struct sockaddr_in addr;
        memset(&addr, 0, sizeof(addr));
        addr.sin_family = AF_INET;
        addr.sin_port = htons(g_port);
        addr.sin_addr.s_addr = htonl(INADDR_ANY);
        check(bind(s, (struct sockaddr *) &addr, sizeof(addr)) == 0);

        struct ip_mreqn mreq;
        memset(&mreq, 0, sizeof(mreq));
        check(inet_pton(AF_INET, g_mcastaddr, &mreq.imr_multiaddr));
        mreq.imr_address.s_addr = htonl(INADDR_ANY);
        mreq.imr_ifindex = 0;
        check(setsockopt(s, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq)) == 0);

        int bufSz;
        socklen_t len = sizeof(bufSz);
        getsockopt(s, SOL_SOCKET, SO_RCVBUF, (char*)(&bufSz), &len);
        printf("bufsz: %d\n", bufSz);

        int npackets = 0;
        char buf[1000];
        memset(buf, 0, sizeof(buf));
        while(1)
        {
            struct sockaddr_in from;
            socklen_t fromlen = sizeof(from);
            check(recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen) == 100);
            ++npackets;
            if(buf[0] == 1) // exit message
                break;
        }
        printf("received %d packets\n", npackets);
    }
    else if(strcmp(argv[1], "server") == 0)
    {
        /*
         * Setup a timer resolution of 1000 ns : 1 us
         */
        prctl(PR_SET_TIMERSLACK, 1000); 

        /*
         * Server program: sends 50,000 packets per second to a multicast address,
         * for 10 seconds.
         */
        int s = socket(AF_INET, SOCK_DGRAM, 0);
        int val = 1;
        int i = 1;
        check(s > 0);

        struct sockaddr_in addr;
        memset(&addr, 0, sizeof(addr));
        addr.sin_family = AF_INET;
        addr.sin_port = htons(g_port);
        check(inet_pton(AF_INET, g_mcastaddr, &addr.sin_addr.s_addr));
        check(connect(s, (struct sockaddr *) &addr, sizeof(addr)) == 0);

        int npackets = 500000;
        char buf[100];
        memset(buf, 0, sizeof(buf));
        for(i = 1; i < npackets; ++i)
        {
            check(send(s, buf, sizeof(buf), 0) > 0);
            usleep(20); // 50,000 messages per second
        }

        buf[0] = 1;
        for(i = 1; i < 5; ++i)
        {
            check(send(s, buf, sizeof(buf), 0) > 0);
            sleep(1);
        }
    }
    else
        error("unknown mode");
    return 0;
}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-03 15:20             ` Kenny Chang
@ 2009-02-04  1:15               ` Neil Horman
  2009-02-04 16:07                 ` Kenny Chang
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Horman @ 2009-02-04  1:15 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

On Tue, Feb 03, 2009 at 10:20:13AM -0500, Kenny Chang wrote:
> Neil Horman wrote:
>> On Mon, Feb 02, 2009 at 11:48:25AM -0500, Kenny Chang wrote:
>>   
>>> Neil Horman wrote:
>>>     
>>>> On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
>>>>         
>>>>> Kenny Chang a écrit :
>>>>>             
>>>>>> Ah, sorry, here's the test program attached.
>>>>>>
>>>>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>>>>> 2.6.29.-rcX.
>>>>>>
>>>>>> Right now, we are trying to step through the kernel versions until we
>>>>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>>>>> and post the result.
>>>>>>                 
>>>>> 2.6.29-rc contains UDP receive improvements (lockless)
>>>>>
>>>>> Problem is multicast handling was not yet updated, but could be :)
>>>>>
>>>>>
>>>>> I was asking you "cat /proc/interrupts" because I believe you might
>>>>> have a problem NIC interrupts being handled by one CPU only (when having problems)
>>>>>
>>>>>             
>>>> That would be expected (if irqbalance is running), and desireable, since
>>>> spreading high volume interrupts like NICS accross multiple cores (or more
>>>> specifically multiple L2 caches), is going increase your cache line miss rate
>>>> significantly and decrease rx throughput.
>>>>
>>>> Although you do have a point here, if the system isn't running irqbalance, and
>>>> the NICS irq affinity is spread accross multiple L2 caches, that would be a
>>>> point of improvement performance-wise.  
>>>>
>>>> Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
>>>> and your stats that I asked about earlier, that would be a big help.
>>>>
>>>> Regards
>>>> Neil
>>>>
>>>>         
>>> This is for a working setup.
>>>
>>>     
>>
>> Are these quad core systems?  Or dual core w/ hyperthreading?  I ask because in
>> your working setup you have 1/2 the number of cpus' and was not sure if you
>> removed an entire package of if you just disabled hyperthreading.
>>
>>
>> Neil
>>
>>   
> Yeah, these are quad core systems.  The 8 cpu system is a dual-processor  
> quad-core.  The other is my desktop, single cpu quad core.
>
Ok, so their separate systms then.  Did you actually experience drops on the
8-core system since the last reboot?  I ask because even when its distributed
across all 8 cores, you only have about 500 total interrupts from the NIC, and
if you did get drops, something more than just affinity is wrong.

Regardless, spreading interrupts across cores is definately a problem.  As eric
says, quad core chips are actually 2x2 cores, so you'll want to either just run
irqbalance to assign an apropriate affinity to the NIC, or manually look at each
cores physical id and sibling id, to assign affininty to a core or cores that
share an L2 cache.  If you need to, as you've found, you may need to disable msi
interrupt mode on your bnx2 driver.  That kinda stinks, but bnx2 IIRC isn't
multiqueue, so its not like msi provides you any real performance gain.

Neil

> Kenny
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-03 17:34                       ` Kenny Chang
@ 2009-02-04  1:21                         ` Neil Horman
  2009-02-26 17:15                           ` Kenny Chang
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Horman @ 2009-02-04  1:21 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

On Tue, Feb 03, 2009 at 12:34:54PM -0500, Kenny Chang wrote:
> Eric Dumazet wrote:
>> Wes Chow a écrit :
>>   
>>> Eric Dumazet wrote:
>>>     
>>>> Wes Chow a écrit :
>>>>       
>>>>> (I'm Kenny's colleague, and I've been doing the kernel builds)
>>>>>
>>>>> First I'd like to note that there were a lot of bnx2 NAPI changes
>>>>> between 2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts
>>>>> of packet loss,
>>>>> whereas loss in 2.6.22 is significant.
>>>>>
>>>>> Second, some CPU affinity info: if I do like Eric and pin all of the
>>>>> apps onto a single CPU, I see no packet loss. Also, I do *not* see
>>>>> ksoftirqd show up on top at all!
>>>>>
>>>>> If I pin half the processes on one CPU and the other half on another
>>>>> CPU, one ksoftirqd processes shows up in top and completely pegs one
>>>>> CPU. My packet loss
>>>>> in that case is significant (25%).
>>>>>
>>>>> Now, the strange case: if I pin 3 processes to one CPU and 1 process
>>>>> to another, I get about 25% packet loss and ksoftirqd pins one CPU.
>>>>> However, one
>>>>> of the apps takes significantly less CPU than the others, and all
>>>>> apps lose the
>>>>> *exact same number of packets*. In all other situations where we see
>>>>> packet
>>>>> loss, the actual number lost per application instance appears random.
>>>>>         
>>>> You see same number of packet lost because they are lost at NIC level
>>>>       
>>> Understood.
>>>
>>> I have a new observation: if I pin processes to just CPUs 0 and 1, I see
>>> no packet loss. Pinning to 0 and 2, I do see packet loss. Pinning 2 and
>>> 3, no packet loss. 4 & 5 - no packet loss, 6 & 7 - no packet loss. Any
>>> other combination appears to produce loss (though I have not tried all
>>> 28 combinations, this seems to be the case).
>>>
>>> At first I thought maybe it had to do with processes pinned to the same
>>> CPU, but different cores. The machine is a dual quad core, which means
>>> that CPUs 0-3 should be a physical CPU, correct? Pinning to 0/2 and 0/3
>>> produce packet loss.
>>>     
>>
>> a quad core is really a 2 x 2 core
>>
>> L2 cache is splited on two blocks, one block used by CPU0/1, other by 
>> CPU2/3 
>>
>> You are at the limit of the machine with such workload, so as soon as your
>> CPUs have to transfert 64 bytes lines between those two L2 blocks, you loose.
>>
>>
>>   
>>> I've also noticed that it does not matter which of the working pairs I
>>> pin to. For example, pinning 5 processes in any combination on either
>>> 0/1 produce no packet loss, pinning all 5 to just CPU 0 also produces no
>>> packet loss.
>>>
>>> The failures are also sudden. In all of the working cases mentioned
>>> above, I don't see ksoftirqd on top at all. But when I run 6 processes
>>> on a single CPU, ksoftirqd shoots up to 100% and I lose a huge number of
>>> packets.
>>>
>>>     
>>>> Normaly, softirq runs on same cpu (the one handling hard irq)
>>>>       
>>> What determines which CPU the hard irq occurs on?
>>>
>>>     
>>
>> Check /proc/irq/{irqnumber}/smp_affinity
>>
>> If you want IRQ16 only served by CPU0 :
>>
>> echo 1 >/proc/irq/16/smp_affinity
>>
>>   
> Hi everyone,
>
> First, thanks for all the effort so far, I think we've learned so much  
> more about the problem in the last couple of days than we had previously  
> in a month.
>
> Just to summarize where we are:
>
> * pinning processes to specific cores/CPUs alleviate the problem
> * issues exist from 2.6.22 up to 2.6.29-rc3
> * issue does not appear to be isolated to 64-bit, 32-bits have problems  
> too.
> * I'm attaching an updated test program with the PR_SET_TIMERSTACK call  
> added.
> * on troubled machines, we are seeing high number of context switches  
> and interrupts.
> * we've ordered an Intel card to try in our machine to see if we can  
> circumvent the issue with a different driver.
>
> Kernel Version         Has Problem?     Notes
> ----------             ----------       ----------
> 2.6.15.x                N    2.6.16.x                -
> 2.6.17.x                -               Doesn't build on Hardy
> 2.6.18.x                -               Doesn't boot (kernel panic)
> 2.6.19.7                N               ksoftirqd is up there, but not  
> pegging a CPU.
>                                        Takes roughly same amount of CPU  
> as the other
>                                        processes, all of which are from  
> 20-40%
> 2.6.20.21               N
> 2.6.21.7                N               sort of lopsided load, but no  
> load from
>                                        ksoftirqd -- strange
> 2.6.22.19               Y               First broken kernel
> 2.6.23.x                -
> 2.6.24-19               Y               (from Hardy)
> 2.6.25.x                -
> 2.6.26.x                -
> 2.6.27.x                Y               (from Intrepid)
> 2.6.28.1                Y
> 2.6.29-rc               Y
>
>
> Correct me if I'm wrong, from what we've seen, it looks like its  
> pointing to some inefficiency in the softirq handling.  The question is  
> whether it's something in the driver or the kernel.  If we can isolate  
> that, maybe we can take some action to have it fixed.
>
I don't think its sofirq ineffeciencies (oprofile would have shown that).  I
know I keep harping on this, but I still think irq affininty is your problem.
I'd be interested in knowning what your /proc/interrupts file looked like on
each of the above kenrels.  Perhaps its not that the bnx2 card you have can't
handle the setting of MSI interrupt affinities, but rather that something
changeed to break irq affinity on this card.

Neil

>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-04  1:15               ` Neil Horman
@ 2009-02-04 16:07                 ` Kenny Chang
  2009-02-04 16:46                   ` Wesley Chow
  2009-02-05 13:29                   ` Neil Horman
  0 siblings, 2 replies; 70+ messages in thread
From: Kenny Chang @ 2009-02-04 16:07 UTC (permalink / raw)
  To: netdev

Neil Horman wrote:
> On Tue, Feb 03, 2009 at 10:20:13AM -0500, Kenny Chang wrote:
>   
>> Neil Horman wrote:
>>     
>>> On Mon, Feb 02, 2009 at 11:48:25AM -0500, Kenny Chang wrote:
>>>   
>>>       
>>>> Neil Horman wrote:
>>>>     
>>>>         
>>>>> On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
>>>>>         
>>>>>           
>>>>>> Kenny Chang a écrit :
>>>>>>             
>>>>>>             
>>>>>>> Ah, sorry, here's the test program attached.
>>>>>>>
>>>>>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>>>>>> 2.6.29.-rcX.
>>>>>>>
>>>>>>> Right now, we are trying to step through the kernel versions until we
>>>>>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>>>>>> and post the result.
>>>>>>>                 
>>>>>>>               
>>>>>> 2.6.29-rc contains UDP receive improvements (lockless)
>>>>>>
>>>>>> Problem is multicast handling was not yet updated, but could be :)
>>>>>>
>>>>>>
>>>>>> I was asking you "cat /proc/interrupts" because I believe you might
>>>>>> have a problem NIC interrupts being handled by one CPU only (when having problems)
>>>>>>
>>>>>>             
>>>>>>             
>>>>> That would be expected (if irqbalance is running), and desireable, since
>>>>> spreading high volume interrupts like NICS accross multiple cores (or more
>>>>> specifically multiple L2 caches), is going increase your cache line miss rate
>>>>> significantly and decrease rx throughput.
>>>>>
>>>>> Although you do have a point here, if the system isn't running irqbalance, and
>>>>> the NICS irq affinity is spread accross multiple L2 caches, that would be a
>>>>> point of improvement performance-wise.  
>>>>>
>>>>> Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
>>>>> and your stats that I asked about earlier, that would be a big help.
>>>>>
>>>>> Regards
>>>>> Neil
>>>>>
>>>>>         
>>>>>           
>>>> This is for a working setup.
>>>>
>>>>     
>>>>         
>>> Are these quad core systems?  Or dual core w/ hyperthreading?  I ask because in
>>> your working setup you have 1/2 the number of cpus' and was not sure if you
>>> removed an entire package of if you just disabled hyperthreading.
>>>
>>>
>>> Neil
>>>
>>>   
>>>       
>> Yeah, these are quad core systems.  The 8 cpu system is a dual-processor  
>> quad-core.  The other is my desktop, single cpu quad core.
>>
>>     
> Ok, so their separate systms then.  Did you actually experience drops on the
> 8-core system since the last reboot?  I ask because even when its distributed
> across all 8 cores, you only have about 500 total interrupts from the NIC, and
> if you did get drops, something more than just affinity is wrong.
>
> Regardless, spreading interrupts across cores is definately a problem.  As eric
> says, quad core chips are actually 2x2 cores, so you'll want to either just run
> irqbalance to assign an apropriate affinity to the NIC, or manually look at each
> cores physical id and sibling id, to assign affininty to a core or cores that
> share an L2 cache.  If you need to, as you've found, you may need to disable msi
> interrupt mode on your bnx2 driver.  That kinda stinks, but bnx2 IIRC isn't
> multiqueue, so its not like msi provides you any real performance gain.
>
> Neil
>
>   
Hi Neil,

Yeah, we've been rebooting this system left and right switch kernels.  
The results are fairly consistent.  We were able to set the irq 
affinities, and as Wes had mentioned, what we see is that if we pin the 
softirq to 1 core, and pin the app to its sibling, we see really good 
performance, but as we load up other cores, the machine reaches a 
breaking point where all hell breaks loose and we drop a bunch.  (we 
hadn't turned off msi btw..)

While we were able to tune and adjust performance like that, in the end, 
it doesn't really explain the difference between earlier and recent 
kernels, also it doesn't quite explain the difference between machines.

You mentioned it would be good to see the interrupts for each kernel, in 
light of the above information, would it still be useful for me to 
provide that?

Kenny


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-04 16:07                 ` Kenny Chang
@ 2009-02-04 16:46                   ` Wesley Chow
  2009-02-04 18:11                     ` Eric Dumazet
  2009-02-05 13:29                   ` Neil Horman
  1 sibling, 1 reply; 70+ messages in thread
From: Wesley Chow @ 2009-02-04 16:46 UTC (permalink / raw)
  To: netdev; +Cc: Kenny Chang

>>>>>
>>>>>
>>>> Are these quad core systems?  Or dual core w/ hyperthreading?  I  
>>>> ask because in
>>>> your working setup you have 1/2 the number of cpus' and was not  
>>>> sure if you
>>>> removed an entire package of if you just disabled hyperthreading.
>>>>
>>>>
>>>> Neil
>>>>
>>>>
>>> Yeah, these are quad core systems.  The 8 cpu system is a dual- 
>>> processor  quad-core.  The other is my desktop, single cpu quad  
>>> core.
>>>
>>>


Just to be clear: on the 2 x quad core system, we can run with a  
2.6.15 kernel and see no packet drops. In fact, we can run with  
2.6.19, 2.6.20, and 2.6.21 just fine. 2.6.22 is the first kernel that  
shows problems.

Kenny posted results from a working setup on a different machine.

What I would really like to know is if whatever changed between 2.6.21  
and 2.6.22 that broke things is confined just to bnx2. To make this a  
rigorous test, we would need to use the same machine with a different  
nic, which we don't have quite yet. An Intel Pro 1000 ethernet card is  
in the mail as I type this.

I also tried forward porting the bnx2 driver in 2.6.21 to 2.6.22  
(unsuccessfully), and building the most recent driver from the  
Broadcom site to Ubuntu Hardy's 2.6.24. The most recent driver with  
hardy 2.6.24 showed similar packet dropping problems. Hm, perhaps I'll  
try to build the most recent broadcom driver against 2.6.21.


Wes


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-04 16:46                   ` Wesley Chow
@ 2009-02-04 18:11                     ` Eric Dumazet
  2009-02-05 13:33                       ` Neil Horman
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-02-04 18:11 UTC (permalink / raw)
  To: Wesley Chow; +Cc: netdev, Kenny Chang

Wesley Chow a écrit :
>>>>>>
>>>>>>
>>>>> Are these quad core systems?  Or dual core w/ hyperthreading?  I
>>>>> ask because in
>>>>> your working setup you have 1/2 the number of cpus' and was not
>>>>> sure if you
>>>>> removed an entire package of if you just disabled hyperthreading.
>>>>>
>>>>>
>>>>> Neil
>>>>>
>>>>>
>>>> Yeah, these are quad core systems.  The 8 cpu system is a
>>>> dual-processor  quad-core.  The other is my desktop, single cpu quad
>>>> core.
>>>>
>>>>
> 
> 
> Just to be clear: on the 2 x quad core system, we can run with a 2.6.15
> kernel and see no packet drops. In fact, we can run with 2.6.19, 2.6.20,
> and 2.6.21 just fine. 2.6.22 is the first kernel that shows problems.
> 
> Kenny posted results from a working setup on a different machine.
> 
> What I would really like to know is if whatever changed between 2.6.21
> and 2.6.22 that broke things is confined just to bnx2. To make this a
> rigorous test, we would need to use the same machine with a different
> nic, which we don't have quite yet. An Intel Pro 1000 ethernet card is
> in the mail as I type this.
> 
> I also tried forward porting the bnx2 driver in 2.6.21 to 2.6.22
> (unsuccessfully), and building the most recent driver from the Broadcom
> site to Ubuntu Hardy's 2.6.24. The most recent driver with hardy 2.6.24
> showed similar packet dropping problems. Hm, perhaps I'll try to build
> the most recent broadcom driver against 2.6.21.
> 

Try oprofile session, you shall see a scheduler effect (dont want to call
this a regression, no need for another flame war).

also give us "vmstat 1" results  (number of context switches per second)

On recent kernels, scheduler might be faster than before: You get more wakeups per
second and more work to do by softirq handler (it does more calls to scheduler,
thus less cpu cycles available for draining NIC RX queue in time)

opcontrol --vmlinux=/path/vmlinux --start
<run benchmark>
opreport -l /path/vmlinux | head -n 50

Recent schedulers tend to be optimum for lower latencies (and thus, on
a high level of wakeups, you get less bandwidth because of sofirq using
a whole CPU)

For example, if you have one tread receiving data on 4 or 8 sockets, you'll
probably notice better throughput (because it will sleep less often)

Multicast receiving on N sockets, with one thread waiting on each socket
is basically a way to trigger a scheduler storm. (N wakeups per packet).
So its more a benchmark to stress scheduler than stressing network stack...


Maybe its time to change user side, and not try to find an appropriate kernel :)

If you know you have to receive N frames per 20us units, then its better to :
Use non blocking sockets, and doing such loop :

{
usleep(20); // or try to compensate if this thread is slowed too much by following code
for (i = 0 ; i < N ; i++) {
	while (revfrom(socket[N], ....) != -1)
		receive_frame(...);
	}
}

That way, you are pretty sure network softirq handler wont have to spend time trying
to wakeup 400.000 time per second one thread. All cpu cycles can be spent in NIC driver
and network stack.

Your thread will do 50.000 calls to nanosleep() per second, that is not really expensive,
then N recvfrom() per iteration. It should work on all past , current and future kernels.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-04 16:07                 ` Kenny Chang
  2009-02-04 16:46                   ` Wesley Chow
@ 2009-02-05 13:29                   ` Neil Horman
  1 sibling, 0 replies; 70+ messages in thread
From: Neil Horman @ 2009-02-05 13:29 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev

On Wed, Feb 04, 2009 at 11:07:13AM -0500, Kenny Chang wrote:
> Neil Horman wrote:
>> On Tue, Feb 03, 2009 at 10:20:13AM -0500, Kenny Chang wrote:
>>   
>>> Neil Horman wrote:
>>>     
>>>> On Mon, Feb 02, 2009 at 11:48:25AM -0500, Kenny Chang wrote:
>>>>         
>>>>> Neil Horman wrote:
>>>>>             
>>>>>> On Fri, Jan 30, 2009 at 11:41:23PM +0100, Eric Dumazet wrote:
>>>>>>                   
>>>>>>> Kenny Chang a écrit :
>>>>>>>                         
>>>>>>>> Ah, sorry, here's the test program attached.
>>>>>>>>
>>>>>>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
>>>>>>>> 2.6.29.-rcX.
>>>>>>>>
>>>>>>>> Right now, we are trying to step through the kernel versions until we
>>>>>>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
>>>>>>>> and post the result.
>>>>>>>>                               
>>>>>>> 2.6.29-rc contains UDP receive improvements (lockless)
>>>>>>>
>>>>>>> Problem is multicast handling was not yet updated, but could be :)
>>>>>>>
>>>>>>>
>>>>>>> I was asking you "cat /proc/interrupts" because I believe you might
>>>>>>> have a problem NIC interrupts being handled by one CPU only (when having problems)
>>>>>>>
>>>>>>>                         
>>>>>> That would be expected (if irqbalance is running), and desireable, since
>>>>>> spreading high volume interrupts like NICS accross multiple cores (or more
>>>>>> specifically multiple L2 caches), is going increase your cache line miss rate
>>>>>> significantly and decrease rx throughput.
>>>>>>
>>>>>> Although you do have a point here, if the system isn't running irqbalance, and
>>>>>> the NICS irq affinity is spread accross multiple L2 caches, that would be a
>>>>>> point of improvement performance-wise.  
>>>>>>
>>>>>> Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
>>>>>> and your stats that I asked about earlier, that would be a big help.
>>>>>>
>>>>>> Regards
>>>>>> Neil
>>>>>>
>>>>>>                   
>>>>> This is for a working setup.
>>>>>
>>>>>             
>>>> Are these quad core systems?  Or dual core w/ hyperthreading?  I ask because in
>>>> your working setup you have 1/2 the number of cpus' and was not sure if you
>>>> removed an entire package of if you just disabled hyperthreading.
>>>>
>>>>
>>>> Neil
>>>>
>>>>         
>>> Yeah, these are quad core systems.  The 8 cpu system is a 
>>> dual-processor  quad-core.  The other is my desktop, single cpu quad 
>>> core.
>>>
>>>     
>> Ok, so their separate systms then.  Did you actually experience drops on the
>> 8-core system since the last reboot?  I ask because even when its distributed
>> across all 8 cores, you only have about 500 total interrupts from the NIC, and
>> if you did get drops, something more than just affinity is wrong.
>>
>> Regardless, spreading interrupts across cores is definately a problem.  As eric
>> says, quad core chips are actually 2x2 cores, so you'll want to either just run
>> irqbalance to assign an apropriate affinity to the NIC, or manually look at each
>> cores physical id and sibling id, to assign affininty to a core or cores that
>> share an L2 cache.  If you need to, as you've found, you may need to disable msi
>> interrupt mode on your bnx2 driver.  That kinda stinks, but bnx2 IIRC isn't
>> multiqueue, so its not like msi provides you any real performance gain.
>>
>> Neil
>>
>>   
> Hi Neil,
>
> Yeah, we've been rebooting this system left and right switch kernels.   
> The results are fairly consistent.  We were able to set the irq  
> affinities, and as Wes had mentioned, what we see is that if we pin the  
> softirq to 1 core, and pin the app to its sibling, we see really good  
> performance, but as we load up other cores, the machine reaches a  
> breaking point where all hell breaks loose and we drop a bunch.  (we  
> hadn't turned off msi btw..)
>
> While we were able to tune and adjust performance like that, in the end,  
> it doesn't really explain the difference between earlier and recent  
> kernels, also it doesn't quite explain the difference between machines.
>
> You mentioned it would be good to see the interrupts for each kernel, in  
> light of the above information, would it still be useful for me to  
> provide that?
>
In light of what you said, I probably don't need to see it no, although if you
go through testing on all these kernels again, I would suggest you take a look
at the /proc/interrupt numbers yourself.  Like you said, its odd that this
behavior changed, since the fast receive path is fairly consistent.  It may be
that te nic driver your using (bnx2 I think?), had a change that either broke
the ability to set affinity for msi interrupts, forcing an irq spread and
killing performance, or perhaps some large inefficiency was introduced either in
the interrupt handler itself, or in the napi poll method of the driver.  Another
good analysis technique would be to grab the latest kernel (which is 'broken' I
think your chart indicated), and the nic driver from the last working kernel.
Merge the driver into the latest kernel and see if the problem persists.  If
not, thats a pretty good indicator that a change in the driver has at least some
culpability.

Best
Neil

> Kenny
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-04 18:11                     ` Eric Dumazet
@ 2009-02-05 13:33                       ` Neil Horman
  2009-02-05 13:46                         ` Wesley Chow
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Horman @ 2009-02-05 13:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Wesley Chow, netdev, Kenny Chang

On Wed, Feb 04, 2009 at 07:11:36PM +0100, Eric Dumazet wrote:
> Wesley Chow a écrit :
> >>>>>>
> >>>>>>
> >>>>> Are these quad core systems?  Or dual core w/ hyperthreading?  I
> >>>>> ask because in
> >>>>> your working setup you have 1/2 the number of cpus' and was not
> >>>>> sure if you
> >>>>> removed an entire package of if you just disabled hyperthreading.
> >>>>>
> >>>>>
> >>>>> Neil
> >>>>>
> >>>>>
> >>>> Yeah, these are quad core systems.  The 8 cpu system is a
> >>>> dual-processor  quad-core.  The other is my desktop, single cpu quad
> >>>> core.
> >>>>
> >>>>
> > 
> > 
> > Just to be clear: on the 2 x quad core system, we can run with a 2.6.15
> > kernel and see no packet drops. In fact, we can run with 2.6.19, 2.6.20,
> > and 2.6.21 just fine. 2.6.22 is the first kernel that shows problems.
> > 
> > Kenny posted results from a working setup on a different machine.
> > 
> > What I would really like to know is if whatever changed between 2.6.21
> > and 2.6.22 that broke things is confined just to bnx2. To make this a
> > rigorous test, we would need to use the same machine with a different
> > nic, which we don't have quite yet. An Intel Pro 1000 ethernet card is
> > in the mail as I type this.
> > 
> > I also tried forward porting the bnx2 driver in 2.6.21 to 2.6.22
> > (unsuccessfully), and building the most recent driver from the Broadcom
> > site to Ubuntu Hardy's 2.6.24. The most recent driver with hardy 2.6.24
> > showed similar packet dropping problems. Hm, perhaps I'll try to build
> > the most recent broadcom driver against 2.6.21.
> > 
> 
> Try oprofile session, you shall see a scheduler effect (dont want to call
> this a regression, no need for another flame war).
> 
> also give us "vmstat 1" results  (number of context switches per second)
> 
> On recent kernels, scheduler might be faster than before: You get more wakeups per
> second and more work to do by softirq handler (it does more calls to scheduler,
> thus less cpu cycles available for draining NIC RX queue in time)
> 
> opcontrol --vmlinux=/path/vmlinux --start
> <run benchmark>
> opreport -l /path/vmlinux | head -n 50
> 
> Recent schedulers tend to be optimum for lower latencies (and thus, on
> a high level of wakeups, you get less bandwidth because of sofirq using
> a whole CPU)
> 
> For example, if you have one tread receiving data on 4 or 8 sockets, you'll
> probably notice better throughput (because it will sleep less often)
> 
> Multicast receiving on N sockets, with one thread waiting on each socket
> is basically a way to trigger a scheduler storm. (N wakeups per packet).
> So its more a benchmark to stress scheduler than stressing network stack...
> 
> 
> Maybe its time to change user side, and not try to find an appropriate kernel :)
> 
> If you know you have to receive N frames per 20us units, then its better to :
> Use non blocking sockets, and doing such loop :
> 
> {
> usleep(20); // or try to compensate if this thread is slowed too much by following code
> for (i = 0 ; i < N ; i++) {
> 	while (revfrom(socket[N], ....) != -1)
> 		receive_frame(...);
> 	}
> }
> 
> That way, you are pretty sure network softirq handler wont have to spend time trying
> to wakeup 400.000 time per second one thread. All cpu cycles can be spent in NIC driver
> and network stack.
> 
> Your thread will do 50.000 calls to nanosleep() per second, that is not really expensive,
> then N recvfrom() per iteration. It should work on all past , current and future kernels.
> 
+1 to this idea.  Since the last oprofile traces showed significant variance in
the time spent in schedule(), it might be worthwhile to investigate the affects
of the application behavior on this.  I might also be worth adding a systemtap
probe to sys_recvmsg, to count how many times we receive frames on a working and
non-working system.  If the app is behaving differently on different kernels,
and its affecting the number of times you go to get a frame out of the stack,
that would affect your drop rates, and it would show up in sys_recvmsg

Neil


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-05 13:33                       ` Neil Horman
@ 2009-02-05 13:46                         ` Wesley Chow
  0 siblings, 0 replies; 70+ messages in thread
From: Wesley Chow @ 2009-02-05 13:46 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Kenny Chang, Neil Horman


>>
>>
>> Maybe its time to change user side, and not try to find an  
>> appropriate kernel :)
>>
>> If you know you have to receive N frames per 20us units, then its  
>> better to :
>> Use non blocking sockets, and doing such loop :
>>
>> {
>> usleep(20); // or try to compensate if this thread is slowed too  
>> much by following code
>> for (i = 0 ; i < N ; i++) {
>> 	while (revfrom(socket[N], ....) != -1)
>> 		receive_frame(...);
>> 	}
>> }
>>
>> That way, you are pretty sure network softirq handler wont have to  
>> spend time trying
>> to wakeup 400.000 time per second one thread. All cpu cycles can be  
>> spent in NIC driver
>> and network stack.
>>
>> Your thread will do 50.000 calls to nanosleep() per second, that is  
>> not really expensive,
>> then N recvfrom() per iteration. It should work on all past ,  
>> current and future kernels.
>>
> +1 to this idea.  Since the last oprofile traces showed significant  
> variance in
> the time spent in schedule(), it might be worthwhile to investigate  
> the affects
> of the application behavior on this.  I might also be worth adding a  
> systemtap
> probe to sys_recvmsg, to count how many times we receive frames on a  
> working and
> non-working system.  If the app is behaving differently on different  
> kernels,
> and its affecting the number of times you go to get a frame out of  
> the stack,
> that would affect your drop rates, and it would show up in sys_recvmsg
>


I did some work to our test program to spin on a non-blocking socket  
and it indeed seems to fix the problem, at least for 2.6.28.1, which  
was a kernel we had problems with. The number of context switches  
drastically drops -- from 200,000+ to less than 50!

I haven't done totally comprehensive tests yet, so I don't want to  
officially state any results. I'm also out today, but Kenny may get a  
chance to play with it. Spinning on the socket is looking like an  
interesting solution, but we're a bit nervous about seeing our  
processes constantly running at 100% CPU. Does C++ have a  
MachineOnFire exception?


Wes


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-04  1:21                         ` Neil Horman
@ 2009-02-26 17:15                           ` Kenny Chang
  2009-02-28  8:51                             ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Kenny Chang @ 2009-02-26 17:15 UTC (permalink / raw)
  To: netdev

Neil Horman wrote:
> On Tue, Feb 03, 2009 at 12:34:54PM -0500, Kenny Chang wrote:
>> Eric Dumazet wrote:
>>> Wes Chow a écrit :
>>>   
>>>> Eric Dumazet wrote:
>>>>     
>>>>> Wes Chow a écrit :
>>>>>       
>>>>>> (I'm Kenny's colleague, and I've been doing the kernel builds)
>>>>>>
>>>>>> First I'd like to note that there were a lot of bnx2 NAPI changes
>>>>>> between 2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts
>>>>>> of packet loss,
>>>>>> whereas loss in 2.6.22 is significant.
>>>>>>
>>>>>> Second, some CPU affinity info: if I do like Eric and pin all of the
>>>>>> apps onto a single CPU, I see no packet loss. Also, I do *not* see
>>>>>> ksoftirqd show up on top at all!
>>>>>>
>>>>>> If I pin half the processes on one CPU and the other half on another
>>>>>> CPU, one ksoftirqd processes shows up in top and completely pegs one
>>>>>> CPU. My packet loss
>>>>>> in that case is significant (25%).
>>>>>>
>>>>>> Now, the strange case: if I pin 3 processes to one CPU and 1 process
>>>>>> to another, I get about 25% packet loss and ksoftirqd pins one CPU.
>>>>>> However, one
>>>>>> of the apps takes significantly less CPU than the others, and all
>>>>>> apps lose the
>>>>>> *exact same number of packets*. In all other situations where we see
>>>>>> packet
>>>>>> loss, the actual number lost per application instance appears random.
>>>>>>         
>>>>> You see same number of packet lost because they are lost at NIC level
>>>>>       
>>>> Understood.
>>>>
>>>> I have a new observation: if I pin processes to just CPUs 0 and 1, I see
>>>> no packet loss. Pinning to 0 and 2, I do see packet loss. Pinning 2 and
>>>> 3, no packet loss. 4 & 5 - no packet loss, 6 & 7 - no packet loss. Any
>>>> other combination appears to produce loss (though I have not tried all
>>>> 28 combinations, this seems to be the case).
>>>>
>>>> At first I thought maybe it had to do with processes pinned to the same
>>>> CPU, but different cores. The machine is a dual quad core, which means
>>>> that CPUs 0-3 should be a physical CPU, correct? Pinning to 0/2 and 0/3
>>>> produce packet loss.
>>>>     
>>> a quad core is really a 2 x 2 core
>>>
>>> L2 cache is splited on two blocks, one block used by CPU0/1, other by 
>>> CPU2/3 
>>>
>>> You are at the limit of the machine with such workload, so as soon as your
>>> CPUs have to transfert 64 bytes lines between those two L2 blocks, you loose.
>>>
>>>
>>>   
>>>> I've also noticed that it does not matter which of the working pairs I
>>>> pin to. For example, pinning 5 processes in any combination on either
>>>> 0/1 produce no packet loss, pinning all 5 to just CPU 0 also produces no
>>>> packet loss.
>>>>
>>>> The failures are also sudden. In all of the working cases mentioned
>>>> above, I don't see ksoftirqd on top at all. But when I run 6 processes
>>>> on a single CPU, ksoftirqd shoots up to 100% and I lose a huge number of
>>>> packets.
>>>>
>>>>     
>>>>> Normaly, softirq runs on same cpu (the one handling hard irq)
>>>>>       
>>>> What determines which CPU the hard irq occurs on?
>>>>
>>>>     
>>> Check /proc/irq/{irqnumber}/smp_affinity
>>>
>>> If you want IRQ16 only served by CPU0 :
>>>
>>> echo 1 >/proc/irq/16/smp_affinity
>>>
>>>   
>> Hi everyone,
>>
>> -snip-
>> Correct me if I'm wrong, from what we've seen, it looks like its  
>> pointing to some inefficiency in the softirq handling.  The question is  
>> whether it's something in the driver or the kernel.  If we can isolate  
>> that, maybe we can take some action to have it fixed.
>>
> I don't think its sofirq ineffeciencies (oprofile would have shown that).  I
> know I keep harping on this, but I still think irq affininty is your problem.
> I'd be interested in knowning what your /proc/interrupts file looked like on
> each of the above kenrels.  Perhaps its not that the bnx2 card you have can't
> handle the setting of MSI interrupt affinities, but rather that something
> changeed to break irq affinity on this card.
>
> Neil
>
>
It's been a while since I updated this thread.  We've been running 
through the different suggestions and tabulating their effects, as well 
as trying out an Intel card.  The short story is that setting affinity 
and MSI works to some extent, and the Intel card doesn't seem to change 
things significantly.  The results don't seem consistent enough for us 
to be able to point to a smoking gun.

It does look like the 2.6.29-rc4 kernel performs okay with the Intel 
card, but this is not a real-time build and it's not likely to be in a 
supported Ubuntu distribution real soon.  We've reached the point where 
we'd like to look for an expert dedicated to work on this problem for a 
period of time.  The final result being some sort of solution to produce 
a realtime configuration with a reasonably "aged" kernel (.24~.28) that 
has multicast performance greater than or equal to that of 2.6.15.

If anybody is interested in devoting some compensated time to this 
issue, we're offering up a bounty: 
http://www.athenacr.com/bounties/multicast-performance/

For completeness, here's the table of our experiment results:

====================== ================== ========= ========== 
=============== ============== ============== =================
Kernel                 flavor             IRQ       affinity   *4x 
mcasttest*  *5x mcasttest* *6x mcasttest*  *Mtools2* [4]_ 
====================== ================== ========= ========== 
=============== ============== ============== =================
Intel 
e1000e                                                                                                                  

-----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
2.6.24.19              rt                |          any       | 
OK              Maybe          X                             
2.6.24.19              rt                |          CPU0      | 
OK              OK             X                             
2.6.24.19              generic           |          any       | 
X                                                            
2.6.24.19              generic           |          CPU0      | 
OK                                                           
2.6.29-rc3             vanilla-server    |          any       | 
X                                                            
2.6.29-rc3             vanilla-server    |          CPU0      | 
OK                                                           
2.6.29-rc4             vanilla-generic   |          any       | 
X                                             OK             
2.6.29-rc4             vanilla-generic   |          CPU0      | OK   
           OK             OK [5]_        OK             
-----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
Broadcom 
BNX2                                                                                                                 

-----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
2.6.24-19              rt                | MSI      any       | 
OK              OK             X                             
2.6.24-19              rt                | MSI      CPU0      | 
OK              Maybe          X                             
2.6.24-19              rt                | APIC     any       | 
OK              OK             X                             
2.6.24-19              rt                | APIC     CPU0      | 
OK              Maybe          X                             
2.6.24-19-bnx-latest   rt                | APIC     CPU0      | 
OK              X                                            
2.6.24-19              server            | MSI      any       | 
X                                                            
2.6.24-19              server            | MSI      CPU0      | 
OK                                                           
2.6.24-19              generic           | APIC     any       | 
X                                                            
2.6.24-19              generic           | APIC     CPU0      | 
OK                                                           
2.6.27-11              generic           | APIC     any       | 
X                                                            
2.6.27-11              generic           | APIC     CPU0      | 
OK              10% drop                                      
2.6.28-8               generic           | APIC     any       | 
OK              X                                             
2.6.28-8               generic           | APIC     CPU0      | 
OK              OK             0.5% drop                      
2.6.29-rc3             vanilla-server    | MSI      any       | 
X                                                            
2.6.29-rc3             vanilla-server    | MSI      CPU0      | 
X                                                            
2.6.29-rc3             vanilla-server    | APIC     any       | 
OK              X                                            
2.6.29-rc3             vanilla-server    | APIC     CPU0      | 
OK              OK                                           
2.6.29-rc4             vanilla-generic   | APIC     any       | 
X                                                            
2.6.29-rc4             vanilla-generic   | APIC     CPU0      | 
OK              3% drop        10% drop       X              
====================== 
==================+=========+==========+===============+==============+==============+=================
* [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
* [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped 
nothing.

Kenny


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-01-30 22:41     ` Eric Dumazet
  2009-01-31 16:03       ` Neil Horman
  2009-02-01 12:40       ` Eric Dumazet
@ 2009-02-27 18:40       ` Christoph Lameter
  2009-02-27 18:56         ` Eric Dumazet
  2 siblings, 1 reply; 70+ messages in thread
From: Christoph Lameter @ 2009-02-27 18:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kenny Chang, netdev

On Fri, 30 Jan 2009, Eric Dumazet wrote:

> 2.6.29-rc contains UDP receive improvements (lockless)
>
> Problem is multicast handling was not yet updated, but could be :)

When will that happen?


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-27 18:40       ` Christoph Lameter
@ 2009-02-27 18:56         ` Eric Dumazet
  2009-02-27 19:45           ` Christoph Lameter
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-02-27 18:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Kenny Chang, netdev

Christoph Lameter a écrit :
> On Fri, 30 Jan 2009, Eric Dumazet wrote:
> 
>> 2.6.29-rc contains UDP receive improvements (lockless)
>>
>> Problem is multicast handling was not yet updated, but could be :)
> 
> When will that happen?
> 

When proven necessary :)

Kenny problem is about scheduling storm. The extra spin_lock() in UDP 
multicast receive is not a problem.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-27 18:56         ` Eric Dumazet
@ 2009-02-27 19:45           ` Christoph Lameter
  2009-02-27 20:12             ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Christoph Lameter @ 2009-02-27 19:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kenny Chang, netdev

On Fri, 27 Feb 2009, Eric Dumazet wrote:

> Christoph Lameter a ?crit :
> > On Fri, 30 Jan 2009, Eric Dumazet wrote:
> >> 2.6.29-rc contains UDP receive improvements (lockless)
> >> Problem is multicast handling was not yet updated, but could be :)
> > When will that happen?
> When proven necessary :)
>
> Kenny problem is about scheduling storm. The extra spin_lock() in UDP
> multicast receive is not a problem.

My tests here show that 2.6.29-rc5 still looses ~5usec vs. 2.6.22 via
UDP. This would fix a regression.....


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-27 19:45           ` Christoph Lameter
@ 2009-02-27 20:12             ` Eric Dumazet
  2009-02-27 21:36               ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-02-27 20:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Kenny Chang, netdev

Christoph Lameter a écrit :
> On Fri, 27 Feb 2009, Eric Dumazet wrote:
> 
>> Christoph Lameter a ?crit :
>>> On Fri, 30 Jan 2009, Eric Dumazet wrote:
>>>> 2.6.29-rc contains UDP receive improvements (lockless)
>>>> Problem is multicast handling was not yet updated, but could be :)
>>> When will that happen?
>> When proven necessary :)
>>
>> Kenny problem is about scheduling storm. The extra spin_lock() in UDP
>> multicast receive is not a problem.
> 
> My tests here show that 2.6.29-rc5 still looses ~5usec vs. 2.6.22 via
> UDP. This would fix a regression.....
> 

Could you elaborate ?

I just retried Kenny test here. As one cpu is looping in ksoftirqd, only this cpu
touches the spin_lock, so spin_lock()/spin_unlock() is free.

oprofile shows that udp stack is lightweight in this case. Problem is about wakeing up
so many threads...

CPU: Core 2, speed 3000.16 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
356857   356857        15.1789  15.1789    schedule
274028   630885        11.6557  26.8346    mwait_idle
189218   820103         8.0484  34.8829    __skb_recv_datagram
116903   937006         4.9725  39.8554    skb_release_data
103152   1040158        4.3876  44.2430    lock_sock_nested
89600    1129758        3.8111  48.0541    udp_recvmsg
74171    1203929        3.1549  51.2089    copy_to_user
72299    1276228        3.0752  54.2842    set_next_entity
60392    1336620        2.5688  56.8529    sched_clock_cpu
54026    1390646        2.2980  59.1509    __slab_free
50212    1440858        2.1358  61.2867    prepare_to_wait_exclusive
38689    1479547        1.6456  62.9323    cpu_idle
38142    1517689        1.6224  64.5547    __switch_to
36701    1554390        1.5611  66.1157    hrtick_start_fair
36673    1591063        1.5599  67.6756    dst_release
36268    1627331        1.5427  69.2183    sys_recvfrom
35052    1662383        1.4909  70.7092    kmem_cache_free
32680    1695063        1.3900  72.0992    pick_next_task_fair
31209    1726272        1.3275  73.4267    try_to_wake_up
30382    1756654        1.2923  74.7190    dequeue_task_fair
29048    1785702        1.2356  75.9545    __copy_skb_header
28801    1814503        1.2250  77.1796    sock_def_readable
28655    1843158        1.2188  78.3984    enqueue_task_fair
27232    1870390        1.1583  79.5567    update_curr
21688    1892078        0.9225  80.4792    copy_from_user
18832    1910910        0.8010  81.2802    sysenter_past_esp
17732    1928642        0.7542  82.0345    finish_task_switch
17583    1946225        0.7479  82.7824    resched_task
17367    1963592        0.7387  83.5211    native_sched_clock
15691    1979283        0.6674  84.1885    task_rq_lock
15352    1994635        0.6530  84.8415    sock_queue_rcv_skb
15022    2009657        0.6390  85.4804    udp_queue_rcv_skb
13999    2023656        0.5954  86.0759    __update_sched_clock
12284    2035940        0.5225  86.5984    skb_copy_datagram_iovec
11869    2047809        0.5048  87.1032    release_sock
10986    2058795        0.4673  87.5705    __wake_up_sync
10488    2069283        0.4461  88.0166    sock_recvmsg
9686     2078969        0.4120  88.4286    skb_queue_tail
9425     2088394        0.4009  88.8295    sys_socketcall



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-27 20:12             ` Eric Dumazet
@ 2009-02-27 21:36               ` Eric Dumazet
  0 siblings, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-02-27 21:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Kenny Chang, netdev

Eric Dumazet a écrit :
> Christoph Lameter a écrit :
>> On Fri, 27 Feb 2009, Eric Dumazet wrote:
>>
>>> Christoph Lameter a ?crit :
>>>> On Fri, 30 Jan 2009, Eric Dumazet wrote:
>>>>> 2.6.29-rc contains UDP receive improvements (lockless)
>>>>> Problem is multicast handling was not yet updated, but could be :)
>>>> When will that happen?
>>> When proven necessary :)
>>>
>>> Kenny problem is about scheduling storm. The extra spin_lock() in UDP
>>> multicast receive is not a problem.
>> My tests here show that 2.6.29-rc5 still looses ~5usec vs. 2.6.22 via
>> UDP. This would fix a regression.....
>>
> 
> Could you elaborate ?
> 
> I just retried Kenny test here. As one cpu is looping in ksoftirqd, only this cpu
> touches the spin_lock, so spin_lock()/spin_unlock() is free.
> 
> oprofile shows that udp stack is lightweight in this case. Problem is about wakeing up
> so many threads...
> 
> CPU: Core 2, speed 3000.16 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  cum. samples  %        cum. %     symbol name
> 356857   356857        15.1789  15.1789    schedule
> 274028   630885        11.6557  26.8346    mwait_idle
> 189218   820103         8.0484  34.8829    __skb_recv_datagram
> 116903   937006         4.9725  39.8554    skb_release_data
> 103152   1040158        4.3876  44.2430    lock_sock_nested
> 89600    1129758        3.8111  48.0541    udp_recvmsg
> 74171    1203929        3.1549  51.2089    copy_to_user
> 72299    1276228        3.0752  54.2842    set_next_entity
> 60392    1336620        2.5688  56.8529    sched_clock_cpu
> 54026    1390646        2.2980  59.1509    __slab_free
> 50212    1440858        2.1358  61.2867    prepare_to_wait_exclusive
> 38689    1479547        1.6456  62.9323    cpu_idle
> 38142    1517689        1.6224  64.5547    __switch_to
> 36701    1554390        1.5611  66.1157    hrtick_start_fair
> 36673    1591063        1.5599  67.6756    dst_release
> 36268    1627331        1.5427  69.2183    sys_recvfrom
> 35052    1662383        1.4909  70.7092    kmem_cache_free
> 32680    1695063        1.3900  72.0992    pick_next_task_fair
> 31209    1726272        1.3275  73.4267    try_to_wake_up
> 30382    1756654        1.2923  74.7190    dequeue_task_fair
> 29048    1785702        1.2356  75.9545    __copy_skb_header
> 28801    1814503        1.2250  77.1796    sock_def_readable
> 28655    1843158        1.2188  78.3984    enqueue_task_fair
> 27232    1870390        1.1583  79.5567    update_curr
> 21688    1892078        0.9225  80.4792    copy_from_user
> 18832    1910910        0.8010  81.2802    sysenter_past_esp
> 17732    1928642        0.7542  82.0345    finish_task_switch
> 17583    1946225        0.7479  82.7824    resched_task
> 17367    1963592        0.7387  83.5211    native_sched_clock
> 15691    1979283        0.6674  84.1885    task_rq_lock
> 15352    1994635        0.6530  84.8415    sock_queue_rcv_skb
> 15022    2009657        0.6390  85.4804    udp_queue_rcv_skb
> 13999    2023656        0.5954  86.0759    __update_sched_clock
> 12284    2035940        0.5225  86.5984    skb_copy_datagram_iovec
> 11869    2047809        0.5048  87.1032    release_sock
> 10986    2058795        0.4673  87.5705    __wake_up_sync
> 10488    2069283        0.4461  88.0166    sock_recvmsg
> 9686     2078969        0.4120  88.4286    skb_queue_tail
> 9425     2088394        0.4009  88.8295    sys_socketcall
> 
> 

My guess is commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting)
Hideo Aoki [Mon, 31 Dec 2007 08:29:24 +0000 (00:29 -0800)]

and 3ab224be6d69de912ee21302745ea45a99274dbc
[NET] CORE: Introducing new memory accounting interface.
Date:   Mon Dec 31 00:11:19 2007 -0800

are responsible for slowdown, because they add some
lock_sock()/release_sock() pairs.

function udp_recvmsg()

out_free:
+       lock_sock(sk);
        skb_free_datagram(sk, skb);
+       release_sock(sk);
 out:

I wonder why we can call __sk_mem_reclaim() when dequeing *one* UDP
frame in queue, while many others can still be in sk_receive_queue.
This defeats memory accounting, no ?

We should avoid lock_sock() if possible, or risk delaying
softirq RX in udp_queue_rcv_skb()



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-26 17:15                           ` Kenny Chang
@ 2009-02-28  8:51                             ` Eric Dumazet
  2009-03-01 17:03                               ` Eric Dumazet
  2009-03-04  8:16                               ` David Miller
  0 siblings, 2 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-02-28  8:51 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev, David S. Miller, Christoph Lameter

Kenny Chang a écrit :
> It's been a while since I updated this thread.  We've been running
> through the different suggestions and tabulating their effects, as well
> as trying out an Intel card.  The short story is that setting affinity
> and MSI works to some extent, and the Intel card doesn't seem to change
> things significantly.  The results don't seem consistent enough for us
> to be able to point to a smoking gun.
> 
> It does look like the 2.6.29-rc4 kernel performs okay with the Intel
> card, but this is not a real-time build and it's not likely to be in a
> supported Ubuntu distribution real soon.  We've reached the point where
> we'd like to look for an expert dedicated to work on this problem for a
> period of time.  The final result being some sort of solution to produce
> a realtime configuration with a reasonably "aged" kernel (.24~.28) that
> has multicast performance greater than or equal to that of 2.6.15.
> 
> If anybody is interested in devoting some compensated time to this
> issue, we're offering up a bounty:
> http://www.athenacr.com/bounties/multicast-performance/
> 
> For completeness, here's the table of our experiment results:
> 
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Kernel                 flavor             IRQ       affinity   *4x
> mcasttest*  *5x mcasttest* *6x mcasttest*  *Mtools2* [4]_
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Intel
> e1000e                                                                                                                 
> 
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> 2.6.24.19              rt                |          any       |
> OK              Maybe          X                            
> 2.6.24.19              rt                |          CPU0      |
> OK              OK             X                            
> 2.6.24.19              generic           |          any       |
> X                                                           
> 2.6.24.19              generic           |          CPU0      |
> OK                                                          
> 2.6.29-rc3             vanilla-server    |          any       |
> X                                                           
> 2.6.29-rc3             vanilla-server    |          CPU0      |
> OK                                                          
> 2.6.29-rc4             vanilla-generic   |          any       |
> X                                             OK            
> 2.6.29-rc4             vanilla-generic   |          CPU0      | OK  
>           OK             OK [5]_        OK            
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> Broadcom
> BNX2                                                                                                                
> 
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> 2.6.24-19              rt                | MSI      any       |
> OK              OK             X                            
> 2.6.24-19              rt                | MSI      CPU0      |
> OK              Maybe          X                            
> 2.6.24-19              rt                | APIC     any       |
> OK              OK             X                            
> 2.6.24-19              rt                | APIC     CPU0      |
> OK              Maybe          X                            
> 2.6.24-19-bnx-latest   rt                | APIC     CPU0      |
> OK              X                                           
> 2.6.24-19              server            | MSI      any       |
> X                                                           
> 2.6.24-19              server            | MSI      CPU0      |
> OK                                                          
> 2.6.24-19              generic           | APIC     any       |
> X                                                           
> 2.6.24-19              generic           | APIC     CPU0      |
> OK                                                          
> 2.6.27-11              generic           | APIC     any       |
> X                                                           
> 2.6.27-11              generic           | APIC     CPU0      |
> OK              10% drop                                     
> 2.6.28-8               generic           | APIC     any       |
> OK              X                                            
> 2.6.28-8               generic           | APIC     CPU0      |
> OK              OK             0.5% drop                     
> 2.6.29-rc3             vanilla-server    | MSI      any       |
> X                                                           
> 2.6.29-rc3             vanilla-server    | MSI      CPU0      |
> X                                                           
> 2.6.29-rc3             vanilla-server    | APIC     any       |
> OK              X                                           
> 2.6.29-rc3             vanilla-server    | APIC     CPU0      |
> OK              OK                                          
> 2.6.29-rc4             vanilla-generic   | APIC     any       |
> X                                                           
> 2.6.29-rc4             vanilla-generic   | APIC     CPU0      |
> OK              3% drop        10% drop       X             
> ======================
> ==================+=========+==========+===============+==============+==============+=================
> 
> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped
> nothing.
> 
> Kenny
> 

Hi Kenny

I am investigating how to reduce contention (and schedule() calls) on this workload.

Following patch already gave me less packet drops (but not yet *perfect*)
(10% packet loss instead of 30%, if 8 receivers on my 8 cpus machine)


David, this is a preliminary work, not meant for inclusion as is,
comments are welcome.

Thank you

[PATCH] net: sk_forward_alloc becomes an atomic_t

Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting) introduced a regression for high rate UDP flows,
because of extra lock_sock() in udp_recvmsg()

In order to reduce need for lock_sock() in UDP receive path, we might need
to declare sk_forward_alloc as an atomic_t.

udp_recvmsg() can avoid a lock_sock()/release_sock() pair.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/sock.h   |   14 +++++++-------
 net/core/sock.c      |   31 +++++++++++++++++++------------
 net/core/stream.c    |    2 +-
 net/ipv4/af_inet.c   |    2 +-
 net/ipv4/inet_diag.c |    2 +-
 net/ipv4/tcp_input.c |    2 +-
 net/ipv4/udp.c       |    2 --
 net/ipv6/udp.c       |    2 --
 net/sched/em_meta.c  |    2 +-
 9 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..c4befb9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -250,7 +250,7 @@ struct sock {
 	struct sk_buff_head	sk_async_wait_queue;
 #endif
 	int			sk_wmem_queued;
-	int			sk_forward_alloc;
+	atomic_t		sk_forward_alloc;
 	gfp_t			sk_allocation;
 	int			sk_route_caps;
 	int			sk_gso_type;
@@ -823,7 +823,7 @@ static inline int sk_wmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
+	return size <= atomic_read(&sk->sk_forward_alloc) ||
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
@@ -831,7 +831,7 @@ static inline int sk_rmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
+	return size <= atomic_read(&sk->sk_forward_alloc) ||
 		__sk_mem_schedule(sk, size, SK_MEM_RECV);
 }
 
@@ -839,7 +839,7 @@ static inline void sk_mem_reclaim(struct sock *sk)
 {
 	if (!sk_has_account(sk))
 		return;
-	if (sk->sk_forward_alloc >= SK_MEM_QUANTUM)
+	if (atomic_read(&sk->sk_forward_alloc) >= SK_MEM_QUANTUM)
 		__sk_mem_reclaim(sk);
 }
 
@@ -847,7 +847,7 @@ static inline void sk_mem_reclaim_partial(struct sock *sk)
 {
 	if (!sk_has_account(sk))
 		return;
-	if (sk->sk_forward_alloc > SK_MEM_QUANTUM)
+	if (atomic_read(&sk->sk_forward_alloc) > SK_MEM_QUANTUM)
 		__sk_mem_reclaim(sk);
 }
 
@@ -855,14 +855,14 @@ static inline void sk_mem_charge(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return;
-	sk->sk_forward_alloc -= size;
+	atomic_sub(size, &sk->sk_forward_alloc);
 }
 
 static inline void sk_mem_uncharge(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return;
-	sk->sk_forward_alloc += size;
+	atomic_add(size, &sk->sk_forward_alloc);
 }
 
 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
diff --git a/net/core/sock.c b/net/core/sock.c
index 0620046..8489105 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1081,7 +1081,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 
 		newsk->sk_dst_cache	= NULL;
 		newsk->sk_wmem_queued	= 0;
-		newsk->sk_forward_alloc = 0;
+		atomic_set(&newsk->sk_forward_alloc, 0);
 		newsk->sk_send_head	= NULL;
 		newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
 
@@ -1479,7 +1479,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	int amt = sk_mem_pages(size);
 	int allocated;
 
-	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
+	atomic_add(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
 	allocated = atomic_add_return(amt, prot->memory_allocated);
 
 	/* Under limit. */
@@ -1520,7 +1520,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 		if (prot->sysctl_mem[2] > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
-				 sk->sk_forward_alloc))
+				 atomic_read(&sk->sk_forward_alloc)))
 			return 1;
 	}
 
@@ -1537,7 +1537,7 @@ suppress_allocation:
 	}
 
 	/* Alas. Undo changes. */
-	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
+	atomic_sub(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
 	atomic_sub(amt, prot->memory_allocated);
 	return 0;
 }
@@ -1551,14 +1551,21 @@ EXPORT_SYMBOL(__sk_mem_schedule);
 void __sk_mem_reclaim(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
-
-	atomic_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
-	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
-
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	int val = atomic_read(&sk->sk_forward_alloc);
+
+begin:
+	val = atomic_read(&sk->sk_forward_alloc);
+	if (val >= SK_MEM_QUANTUM) {
+		if (atomic_cmpxchg(&sk->sk_forward_alloc, val,
+				   val & (SK_MEM_QUANTUM - 1)) != val)
+			goto begin;
+		atomic_sub(val >> SK_MEM_QUANTUM_SHIFT,
+			   prot->memory_allocated);
+
+		if (prot->memory_pressure && *prot->memory_pressure &&
+		    (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
+			*prot->memory_pressure = 0;
+	}
 }
 
 EXPORT_SYMBOL(__sk_mem_reclaim);
diff --git a/net/core/stream.c b/net/core/stream.c
index 8727cea..4d04d28 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -198,7 +198,7 @@ void sk_stream_kill_queues(struct sock *sk)
 	sk_mem_reclaim(sk);
 
 	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN_ON(atomic_read(&sk->sk_forward_alloc));
 
 	/* It is _impossible_ for the backlog to contain anything
 	 * when we get here.  All user references to this socket
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 627be4d..7a1475c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -152,7 +152,7 @@ void inet_sock_destruct(struct sock *sk)
 	WARN_ON(atomic_read(&sk->sk_rmem_alloc));
 	WARN_ON(atomic_read(&sk->sk_wmem_alloc));
 	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN_ON(atomic_read(&sk->sk_forward_alloc));
 
 	kfree(inet->opt);
 	dst_release(sk->sk_dst_cache);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 588a779..903ad66 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -158,7 +158,7 @@ static int inet_csk_diag_fill(struct sock *sk,
 	if (minfo) {
 		minfo->idiag_rmem = atomic_read(&sk->sk_rmem_alloc);
 		minfo->idiag_wmem = sk->sk_wmem_queued;
-		minfo->idiag_fmem = sk->sk_forward_alloc;
+		minfo->idiag_fmem = atomic_read(&sk->sk_forward_alloc);
 		minfo->idiag_tmem = atomic_read(&sk->sk_wmem_alloc);
 	}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a6961d7..5e08f37 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5258,7 +5258,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 
 				tcp_rcv_rtt_measure_ts(sk, skb);
 
-				if ((int)skb->truesize > sk->sk_forward_alloc)
+				if ((int)skb->truesize > atomic_read(&sk->sk_forward_alloc))
 					goto step5;
 
 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4bd178a..dcc246a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -955,9 +955,7 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
 out:
 	return err;
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 84b1a29..582b80a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -257,9 +257,7 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
 out:
 	return err;
 
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 72cf86e..94d90b6 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -383,7 +383,7 @@ META_COLLECTOR(int_sk_wmem_queued)
 META_COLLECTOR(int_sk_fwd_alloc)
 {
 	SKIP_NONLOCAL(skb);
-	dst->value = skb->sk->sk_forward_alloc;
+	dst->value = atomic_read(&skb->sk->sk_forward_alloc);
 }
 
 META_COLLECTOR(int_sk_sndbuf)


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-28  8:51                             ` Eric Dumazet
@ 2009-03-01 17:03                               ` Eric Dumazet
  2009-03-04  8:16                               ` David Miller
  1 sibling, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-03-01 17:03 UTC (permalink / raw)
  To: Kenny Chang; +Cc: netdev, David S. Miller, Christoph Lameter

Eric Dumazet a écrit :
> Kenny Chang a écrit :
>> It's been a while since I updated this thread.  We've been running
>> through the different suggestions and tabulating their effects, as well
>> as trying out an Intel card.  The short story is that setting affinity
>> and MSI works to some extent, and the Intel card doesn't seem to change
>> things significantly.  The results don't seem consistent enough for us
>> to be able to point to a smoking gun.
>>
>> It does look like the 2.6.29-rc4 kernel performs okay with the Intel
>> card, but this is not a real-time build and it's not likely to be in a
>> supported Ubuntu distribution real soon.  We've reached the point where
>> we'd like to look for an expert dedicated to work on this problem for a
>> period of time.  The final result being some sort of solution to produce
>> a realtime configuration with a reasonably "aged" kernel (.24~.28) that
>> has multicast performance greater than or equal to that of 2.6.15.
>>
>> If anybody is interested in devoting some compensated time to this
>> issue, we're offering up a bounty:
>> http://www.athenacr.com/bounties/multicast-performance/
>>
>> For completeness, here's the table of our experiment results:
>>
>> ====================== ================== ========= ==========
>> =============== ============== ============== =================
>> Kernel                 flavor             IRQ       affinity   *4x
>> mcasttest*  *5x mcasttest* *6x mcasttest*  *Mtools2* [4]_
>> ====================== ================== ========= ==========
>> =============== ============== ============== =================
>> Intel
>> e1000e                                                                                                                 
>>
>> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>>
>> 2.6.24.19              rt                |          any       |
>> OK              Maybe          X                            
>> 2.6.24.19              rt                |          CPU0      |
>> OK              OK             X                            
>> 2.6.24.19              generic           |          any       |
>> X                                                           
>> 2.6.24.19              generic           |          CPU0      |
>> OK                                                          
>> 2.6.29-rc3             vanilla-server    |          any       |
>> X                                                           
>> 2.6.29-rc3             vanilla-server    |          CPU0      |
>> OK                                                          
>> 2.6.29-rc4             vanilla-generic   |          any       |
>> X                                             OK            
>> 2.6.29-rc4             vanilla-generic   |          CPU0      | OK  
>>           OK             OK [5]_        OK            
>> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>>
>> Broadcom
>> BNX2                                                                                                                
>>
>> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>>
>> 2.6.24-19              rt                | MSI      any       |
>> OK              OK             X                            
>> 2.6.24-19              rt                | MSI      CPU0      |
>> OK              Maybe          X                            
>> 2.6.24-19              rt                | APIC     any       |
>> OK              OK             X                            
>> 2.6.24-19              rt                | APIC     CPU0      |
>> OK              Maybe          X                            
>> 2.6.24-19-bnx-latest   rt                | APIC     CPU0      |
>> OK              X                                           
>> 2.6.24-19              server            | MSI      any       |
>> X                                                           
>> 2.6.24-19              server            | MSI      CPU0      |
>> OK                                                          
>> 2.6.24-19              generic           | APIC     any       |
>> X                                                           
>> 2.6.24-19              generic           | APIC     CPU0      |
>> OK                                                          
>> 2.6.27-11              generic           | APIC     any       |
>> X                                                           
>> 2.6.27-11              generic           | APIC     CPU0      |
>> OK              10% drop                                     
>> 2.6.28-8               generic           | APIC     any       |
>> OK              X                                            
>> 2.6.28-8               generic           | APIC     CPU0      |
>> OK              OK             0.5% drop                     
>> 2.6.29-rc3             vanilla-server    | MSI      any       |
>> X                                                           
>> 2.6.29-rc3             vanilla-server    | MSI      CPU0      |
>> X                                                           
>> 2.6.29-rc3             vanilla-server    | APIC     any       |
>> OK              X                                           
>> 2.6.29-rc3             vanilla-server    | APIC     CPU0      |
>> OK              OK                                          
>> 2.6.29-rc4             vanilla-generic   | APIC     any       |
>> X                                                           
>> 2.6.29-rc4             vanilla-generic   | APIC     CPU0      |
>> OK              3% drop        10% drop       X             
>> ======================
>> ==================+=========+==========+===============+==============+==============+=================
>>
>> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
>> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped
>> nothing.
>>
>> Kenny
>>
> 
> Hi Kenny
> 
> I am investigating how to reduce contention (and schedule() calls) on this workload.
> 

I bound NIC (gigabit BNX2) irq to cpu 0, so that oprofile results on this cpu can show us
where ksoftirqd is spending its time.

We can see scheduler at work :)

Also, one thing to note is __copy_skb_header() : 9.49 % of cpu0 time.
The problem comes from dst_clone() (6.05 % total, so 2/3 of __copy_skb_header()),
touching a highly contended cache line. (other cpus are doing the decrement of
dst refcounter)

CPU: Core 2, speed 3000.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) 
with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
(samples for other cpus 1..7 omitted)
samples  cum. samples  %        cum. %     symbol name
23750    23750          9.8159   9.8159    try_to_wake_up
22972    46722          9.4944  19.3103    __copy_skb_header
20217    66939          8.3557  27.6660    enqueue_task_fair
14565    81504          6.0197  33.6857    sock_def_readable
13454    94958          5.5606  39.2463    task_rq_lock
13381    108339         5.5304  44.7767    resched_task
13090    121429         5.4101  50.1868    udp_queue_rcv_skb
11441    132870         4.7286  54.9154    skb_queue_tail
10109    142979         4.1781  59.0935    sock_queue_rcv_skb
10024    153003         4.1429  63.2364    __wake_up_sync
9952     162955         4.1132  67.3496    update_curr
8761     171716         3.6209  70.9705    sched_clock_cpu
7414     179130         3.0642  74.0347    rb_insert_color
7381     186511         3.0506  77.0853    select_task_rq_fair
6749     193260         2.7894  79.8747    __slab_alloc
5881     199141         2.4306  82.3053    __wake_up_common
5432     204573         2.2451  84.5504    __skb_clone
4306     208879         1.7797  86.3300    kmem_cache_alloc
3524     212403         1.4565  87.7865    place_entity
2783     215186         1.1502  88.9367    skb_clone
2576     217762         1.0647  90.0014    __udp4_lib_rcv
2430     220192         1.0043  91.0057    bnx2_poll_work
2184     222376         0.9027  91.9084    ipt_do_table
2090     224466         0.8638  92.7722    ip_route_input
1877     226343         0.7758  93.5479    __alloc_skb
1495     227838         0.6179  94.1658    native_sched_clock
1166     229004         0.4819  94.6477    __update_sched_clock
1083     230087         0.4476  95.0953    netif_receive_skb
1062     231149         0.4389  95.5343    activate_task
644      231793         0.2662  95.8004    __kmalloc_track_caller
638      232431         0.2637  96.0641    nf_iterate
549      232980         0.2269  96.2910    skb_put


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-02-28  8:51                             ` Eric Dumazet
  2009-03-01 17:03                               ` Eric Dumazet
@ 2009-03-04  8:16                               ` David Miller
  2009-03-04  8:36                                 ` Eric Dumazet
  1 sibling, 1 reply; 70+ messages in thread
From: David Miller @ 2009-03-04  8:16 UTC (permalink / raw)
  To: dada1; +Cc: kchang, netdev, cl

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 28 Feb 2009 09:51:11 +0100

> David, this is a preliminary work, not meant for inclusion as is,
> comments are welcome.
> 
> [PATCH] net: sk_forward_alloc becomes an atomic_t
> 
> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
> (UDP: Add memory accounting) introduced a regression for high rate UDP flows,
> because of extra lock_sock() in udp_recvmsg()
> 
> In order to reduce need for lock_sock() in UDP receive path, we might need
> to declare sk_forward_alloc as an atomic_t.
> 
> udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

This adds new overhead for TCP which has to hold the socket
lock for other reasons in these paths.

I don't get how an atomic_t operation is cheaper than a
lock_sock/release_sock.  Is it the case that in many
executions of these paths only atomic_read()'s are necessary?

I actually think this scheme is racy.  There is a reason we
have to hold the socket lock when doing memory scheduling.
Two threads can get in there and say "hey I have enough space
already" even though only enough space is allocated for one
of their requests.

What did I miss? :)


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-04  8:16                               ` David Miller
@ 2009-03-04  8:36                                 ` Eric Dumazet
  2009-03-07  7:46                                   ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-04  8:36 UTC (permalink / raw)
  To: David Miller; +Cc: kchang, netdev, cl

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 28 Feb 2009 09:51:11 +0100
> 
>> David, this is a preliminary work, not meant for inclusion as is,
>> comments are welcome.
>>
>> [PATCH] net: sk_forward_alloc becomes an atomic_t
>>
>> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
>> (UDP: Add memory accounting) introduced a regression for high rate UDP flows,
>> because of extra lock_sock() in udp_recvmsg()
>>
>> In order to reduce need for lock_sock() in UDP receive path, we might need
>> to declare sk_forward_alloc as an atomic_t.
>>
>> udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> This adds new overhead for TCP which has to hold the socket
> lock for other reasons in these paths.
> 
> I don't get how an atomic_t operation is cheaper than a
> lock_sock/release_sock.  Is it the case that in many
> executions of these paths only atomic_read()'s are necessary?
> 
> I actually think this scheme is racy.  There is a reason we
> have to hold the socket lock when doing memory scheduling.
> Two threads can get in there and say "hey I have enough space
> already" even though only enough space is allocated for one
> of their requests.
> 
> What did I miss? :)
> 

I believe you are right, and in fact was about to post a "dont look at this patch"
since it doesnt help the multicast reception at all, I redone tests more carefuly 
and got nothing but noise.

We have a cache line ping pong mess here, and need more thinking.

I rewrote Kenny prog to use non blocking sockets.

Receivers are doing :

        int delay = 50;
	fcntl(s, F_SETFL, O_NDELAY);
        while(1)
        {
            struct sockaddr_in from;
            socklen_t fromlen = sizeof(from);
            res = recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen);
            if (res == -1) {
                      delay++;
                      usleep(delay);
                      continue;
            }
            if (delay > 40)
                delay--;
            ++npackets;

With this litle user space change and 8 receivers on my dual quad core, softirqd
only takes 8% of one cpu and no drops at all (instead of 100% cpu and 30% drops)

So this is definitly a problem mixing scheduler cache line ping pongs with network
stack cache line ping pongs.

We could reorder fields so that fewer cache lines are touched by the softirq processing,
I tried this but still got packet drops.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-04  8:36                                 ` Eric Dumazet
@ 2009-03-07  7:46                                   ` Eric Dumazet
  2009-03-08 16:46                                     ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-07  7:46 UTC (permalink / raw)
  To: kchang; +Cc: David Miller, netdev, cl, Brian Bloniarz

Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Sat, 28 Feb 2009 09:51:11 +0100
>>
>>> David, this is a preliminary work, not meant for inclusion as is,
>>> comments are welcome.
>>>
>>> [PATCH] net: sk_forward_alloc becomes an atomic_t
>>>
>>> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
>>> (UDP: Add memory accounting) introduced a regression for high rate UDP flows,
>>> because of extra lock_sock() in udp_recvmsg()
>>>
>>> In order to reduce need for lock_sock() in UDP receive path, we might need
>>> to declare sk_forward_alloc as an atomic_t.
>>>
>>> udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
>>>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> This adds new overhead for TCP which has to hold the socket
>> lock for other reasons in these paths.
>>
>> I don't get how an atomic_t operation is cheaper than a
>> lock_sock/release_sock.  Is it the case that in many
>> executions of these paths only atomic_read()'s are necessary?
>>
>> I actually think this scheme is racy.  There is a reason we
>> have to hold the socket lock when doing memory scheduling.
>> Two threads can get in there and say "hey I have enough space
>> already" even though only enough space is allocated for one
>> of their requests.
>>
>> What did I miss? :)
>>
> 
> I believe you are right, and in fact was about to post a "dont look at this patch"
> since it doesnt help the multicast reception at all, I redone tests more carefuly 
> and got nothing but noise.
> 
> We have a cache line ping pong mess here, and need more thinking.
> 
> I rewrote Kenny prog to use non blocking sockets.
> 
> Receivers are doing :
> 
>         int delay = 50;
> 	fcntl(s, F_SETFL, O_NDELAY);
>         while(1)
>         {
>             struct sockaddr_in from;
>             socklen_t fromlen = sizeof(from);
>             res = recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen);
>             if (res == -1) {
>                       delay++;
>                       usleep(delay);
>                       continue;
>             }
>             if (delay > 40)
>                 delay--;
>             ++npackets;
> 
> With this litle user space change and 8 receivers on my dual quad core, softirqd
> only takes 8% of one cpu and no drops at all (instead of 100% cpu and 30% drops)
> 
> So this is definitly a problem mixing scheduler cache line ping pongs with network
> stack cache line ping pongs.
> 
> We could reorder fields so that fewer cache lines are touched by the softirq processing,
> I tried this but still got packet drops.
> 
> 
> 

I have more questions :

What is the maximum latency you can afford on the delivery of the packet(s) ?

Are user apps using real time scheduling ?

I had an idea, that keep cpu handling NIC interrupts only delivering packets to
socket queues, and not messing with scheduler : fast queueing, and wakeing up
a workqueue (on another cpu) to perform the scheduler work. But that means
some extra latency (in the order of 2 or 3 us I guess)

We could enter in this mode automatically, if the NIC rx handler *see* more than
N packets are waiting in NIC queue : In case of moderate or light trafic, no
extra latency would be necessary. This would mean some changes in NIC driver.

Hum, then, if NIC rx handler is run beside the ksoftirqd, we already know
we are in a stress situation, so maybe no driver changes are necessary :
Just test if we run ksoftirqd...



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-07  7:46                                   ` Eric Dumazet
@ 2009-03-08 16:46                                     ` Eric Dumazet
  2009-03-09  2:49                                       ` David Miller
  2009-03-09 22:56                                       ` Brian Bloniarz
  0 siblings, 2 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-03-08 16:46 UTC (permalink / raw)
  To: kchang; +Cc: David Miller, netdev, cl, Brian Bloniarz

Eric Dumazet a écrit :
> 
> I have more questions :
> 
> What is the maximum latency you can afford on the delivery of the packet(s) ?
> 
> Are user apps using real time scheduling ?
> 
> I had an idea, that keep cpu handling NIC interrupts only delivering packets to
> socket queues, and not messing with scheduler : fast queueing, and wakeing up
> a workqueue (on another cpu) to perform the scheduler work. But that means
> some extra latency (in the order of 2 or 3 us I guess)
> 
> We could enter in this mode automatically, if the NIC rx handler *see* more than
> N packets are waiting in NIC queue : In case of moderate or light trafic, no
> extra latency would be necessary. This would mean some changes in NIC driver.
> 
> Hum, then, if NIC rx handler is run beside the ksoftirqd, we already know
> we are in a stress situation, so maybe no driver changes are necessary :
> Just test if we run ksoftirqd...
> 

Here is a patch that helps. It's still an RFC of course, since its somewhat ugly :)

I am now able to have 8 receivers on my 8 cpus machine, with one multicast packet every 10 us,
without any loss. (standard setup, no affinity games)

oprofile results see that scheduler overhead vanished, we get back to pure network profile :)

(First offender being __copy_skb_header because of the atomic_inc on dst refcount)

CPU: Core 2, speed 3000.43 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
126329   126329        20.4296  20.4296    __copy_skb_header
31395    157724         5.0771  25.5067    udp_queue_rcv_skb
29191    186915         4.7207  30.2274    sock_def_readable
26284    213199         4.2506  34.4780    sock_queue_rcv_skb
26010    239209         4.2063  38.6842    kmem_cache_alloc
20040    259249         3.2408  41.9251    __udp4_lib_rcv
19570    278819         3.1648  45.0899    skb_queue_tail
17799    296618         2.8784  47.9683    bnx2_poll_work
17267    313885         2.7924  50.7606    skb_release_data
14663    328548         2.3713  53.1319    __skb_recv_datagram
14443    342991         2.3357  55.4676    __slab_alloc
13248    356239         2.1424  57.6100    copy_to_user
13138    369377         2.1246  59.7347    __sk_mem_schedule
12004    381381         1.9413  61.6759    __skb_clone
11924    393305         1.9283  63.6042    skb_clone
11077    404382         1.7913  65.3956    lock_sock_nested
10320    414702         1.6689  67.0645    ip_route_input
9622     424324         1.5560  68.6205    dst_release
8344     432668         1.3494  69.9699    __slab_free
8124     440792         1.3138  71.2837    mwait_idle
7066     447858         1.1427  72.4264    udp_recvmsg
6652     454510         1.0757  73.5021    netif_receive_skb
6386     460896         1.0327  74.5349    ipt_do_table
6010     466906         0.9719  75.5068    release_sock
6003     472909         0.9708  76.4776    memcpy_toiovec
5697     478606         0.9213  77.3989    __alloc_skb
5671     484277         0.9171  78.3160    copy_from_user
5031     489308         0.8136  79.1296    sysenter_past_esp
4753     494061         0.7686  79.8982    bnx2_interrupt
4429     498490         0.7162  80.6145    sock_rfree


[PATCH] softirq: Introduce mechanism to defer wakeups

Some network workloads need to call scheduler too many times. For example,
each received multicast frame can wakeup many threads. ksoftirqd is then
not able to drain NIC RX queues and we get frame losses and high latencies.

This patch adds an infrastructure to delay part of work done in
sock_def_readable() at end of do_softirq()


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/interrupt.h |    9 +++++++++
 include/net/sock.h        |    1 +
 kernel/softirq.c          |   29 ++++++++++++++++++++++++++++-
 net/core/sock.c           |   21 +++++++++++++++++++--
 4 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9127f6b..62caaae 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -295,6 +295,15 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
 				  int this_cpu, int softirq);
 
+/*
+ * delayed works : should be delayed at do_softirq() end
+ */
+struct softirq_del {
+	struct softirq_del	*next;
+	void 			(*func)(struct softirq_del *);
+};
+int softirq_del(struct softirq_del *sdel, void (*func)(struct softirq_del *));
+
 /* Tasklets --- multithreaded analogue of BHs.
 
    Main feature differing them of generic softirqs: tasklet
diff --git a/include/net/sock.h b/include/net/sock.h
index eefeeaf..95841de 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -260,6 +260,7 @@ struct sock {
 	unsigned long	        sk_lingertime;
 	struct sk_buff_head	sk_error_queue;
 	struct proto		*sk_prot_creator;
+	struct softirq_del	sk_del;
 	rwlock_t		sk_callback_lock;
 	int			sk_err,
 				sk_err_soft;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bdbe9de..40fe527 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -158,6 +158,33 @@ void local_bh_enable_ip(unsigned long ip)
 }
 EXPORT_SYMBOL(local_bh_enable_ip);
 
+
+static DEFINE_PER_CPU(struct softirq_del *, softirq_del_head);
+int softirq_del(struct softirq_del *sdel, void (*func)(struct softirq_del *))
+{
+	if (cmpxchg(&sdel->func, NULL, func) == NULL) {
+		sdel->next = __get_cpu_var(softirq_del_head);
+		__get_cpu_var(softirq_del_head) = sdel;
+		return 1;
+	}
+	return 0;
+}
+
+static void softirqdel_exec(void)
+{
+	struct softirq_del *sdel;
+	void (*func)(struct softirq_del *);
+
+	while ((sdel = __get_cpu_var(softirq_del_head)) != NULL) {
+		__get_cpu_var(softirq_del_head) = sdel->next;
+		func = sdel->func;
+		sdel->func = NULL;
+		(*func)(sdel);
+		}
+}
+
+
+
 /*
  * We restart softirq processing MAX_SOFTIRQ_RESTART times,
  * and we fall back to softirqd after that.
@@ -219,7 +246,7 @@ restart:
 
 	if (pending)
 		wakeup_softirqd();
-
+	softirqdel_exec();
 	trace_softirq_exit();
 
 	account_system_vtime(current);
diff --git a/net/core/sock.c b/net/core/sock.c
index 5f97caa..f9ee8dd 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1026,6 +1026,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 #endif
 
 		rwlock_init(&newsk->sk_dst_lock);
+		newsk->sk_del.func = NULL;
 		rwlock_init(&newsk->sk_callback_lock);
 		lockdep_set_class_and_name(&newsk->sk_callback_lock,
 				af_callback_keys + newsk->sk_family,
@@ -1634,12 +1635,27 @@ static void sock_def_error_report(struct sock *sk)
 	read_unlock(&sk->sk_callback_lock);
 }
 
+static void sock_readable_defer(struct softirq_del *sdel)
+{
+	struct sock *sk = container_of(sdel, struct sock, sk_del);
+
+	wake_up_interruptible_sync(sk->sk_sleep);
+	read_unlock(&sk->sk_callback_lock);
+}
+
 static void sock_def_readable(struct sock *sk, int len)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_sync(sk->sk_sleep);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) {
+		if (in_softirq()) {
+			if (!softirq_del(&sk->sk_del, sock_readable_defer))
+				goto unlock;
+			return;
+		}
+		wake_up_interruptible_sync(sk->sk_sleep);
+	}
+unlock:
 	read_unlock(&sk->sk_callback_lock);
 }
 
@@ -1720,6 +1736,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 		sk->sk_sleep	=	NULL;
 
 	rwlock_init(&sk->sk_dst_lock);
+	sk->sk_del.func		=	NULL;
 	rwlock_init(&sk->sk_callback_lock);
 	lockdep_set_class_and_name(&sk->sk_callback_lock,
 			af_callback_keys + sk->sk_family,


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-08 16:46                                     ` Eric Dumazet
@ 2009-03-09  2:49                                       ` David Miller
  2009-03-09  6:36                                         ` Eric Dumazet
  2009-03-09 22:56                                       ` Brian Bloniarz
  1 sibling, 1 reply; 70+ messages in thread
From: David Miller @ 2009-03-09  2:49 UTC (permalink / raw)
  To: dada1; +Cc: kchang, netdev, cl, bmb

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sun, 08 Mar 2009 17:46:13 +0100

> +	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) {
> +		if (in_softirq()) {
> +			if (!softirq_del(&sk->sk_del, sock_readable_defer))
> +				goto unlock;
> +			return;
> +		}

This is interesting.

I think you should make softirq_del() more flexible.  Make it the
socket's job to make sure it doesn't try to defer different
functions, and put the onus on locking there as well.

The cmpxchg() and all of this checking is just wasted work.

I'd really like to get rid of that callback lock too, then we'd
really be in business. :-)


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-09  2:49                                       ` David Miller
@ 2009-03-09  6:36                                         ` Eric Dumazet
  2009-03-13 21:51                                           ` David Miller
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-09  6:36 UTC (permalink / raw)
  To: David Miller; +Cc: kchang, netdev, cl, bmb

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sun, 08 Mar 2009 17:46:13 +0100
> 
>> +	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) {
>> +		if (in_softirq()) {
>> +			if (!softirq_del(&sk->sk_del, sock_readable_defer))
>> +				goto unlock;
>> +			return;
>> +		}
> 
> This is interesting.
> 
> I think you should make softirq_del() more flexible.  Make it the
> socket's job to make sure it doesn't try to defer different
> functions, and put the onus on locking there as well.
> 
> The cmpxchg() and all of this checking is just wasted work.
> 
> I'd really like to get rid of that callback lock too, then we'd
> really be in business. :-)

First thanks for your review David.

I chose cmpxchg() because I needed some form of exclusion here.
I first added a spinlock inside "struct softirq_del" then I realize
I could use cmpxchg() instead and keep the structure small. As the
synchronization is only needed at queueing time, we could pass
the address of a spinlock XXX to sofirq_del() call.

Also, when an event was queued for later invocation, I also needed to keep
a reference on "struct socket" to make sure it doesnt disappear before
the invocation. Not all sockets are RCU guarded (we added RCU only for 
some protocols (TCP, UDP ...). So I found keeping a read_lock
on callback was the easyest thing to do. I now realize we might
overflow preempt_count, so special care is needed.

About your first point, maybe we should make sofirq_del() (poor name I confess)
only have one argument (pointer to struct softirq_del), and initialize
the function pointer at socket init time. That would insure "struct softirq_del"
is associated to one callback only. cmpxchg() test would have to be
done on "next" field then (or use the spinlock XXX)

I am not sure output path needs such tricks, since threads are rarely
blocking on output : We dont trigger 400.000 wakeups per second ?

Another point : I did a tbench test and got 2517 MB/s with the patch,
instead of 2538 MB/s (using Linus 2.6 git tree), thats ~ 0.8 % regression
for this workload.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-08 16:46                                     ` Eric Dumazet
  2009-03-09  2:49                                       ` David Miller
@ 2009-03-09 22:56                                       ` Brian Bloniarz
  2009-03-10  5:28                                         ` Eric Dumazet
  1 sibling, 1 reply; 70+ messages in thread
From: Brian Bloniarz @ 2009-03-09 22:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kchang, netdev

Eric Dumazet wrote:
> Here is a patch that helps. It's still an RFC of course, since its somewhat ugly :)

Hi Eric,

I did some experimenting with this patch today -- we're users, not kernel hackers,
but the performance looks great. We see no loss with mcasttest, and no loss with
our internal test programs (which do much more user-space work). We're very
encouraged :)

One thing I'm curious about: previously, setting /proc/irq/<eth0>/smp_affinity
to one CPU made things perform better, but with this patch, performance is better
with smp_affinity == ff than with smp_affinity == 1. Do you know why that
is? Our tests are all with bnx2 msi_disable=1. I can investigate with oprofile
tomorrow.

Thank you for your continued help, we all deeply appreciate having someone
looking at this workload.

Thanks,
Brian Bloniarz

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-09 22:56                                       ` Brian Bloniarz
@ 2009-03-10  5:28                                         ` Eric Dumazet
  2009-03-10 23:22                                           ` Brian Bloniarz
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-10  5:28 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: kchang, netdev

Brian Bloniarz a écrit :
> Eric Dumazet wrote:
>> Here is a patch that helps. It's still an RFC of course, since its
>> somewhat ugly :)
> 
> Hi Eric,
> 
> I did some experimenting with this patch today -- we're users, not
> kernel hackers,
> but the performance looks great. We see no loss with mcasttest, and no
> loss with
> our internal test programs (which do much more user-space work). We're very
> encouraged :)
> 
> One thing I'm curious about: previously, setting
> /proc/irq/<eth0>/smp_affinity
> to one CPU made things perform better, but with this patch, performance
> is better
> with smp_affinity == ff than with smp_affinity == 1. Do you know why that
> is? Our tests are all with bnx2 msi_disable=1. I can investigate with
> oprofile
> tomorrow.
> 

Well, smp_affinity could help in my opininon if you dedicate
one cpu for the NIC, and others for user apps, if the average
work done per packet is large. If load is light, its better
to use the same cpu to perform all the work, since no expensive
bus trafic is needed between cpu to exchange memory lines.

If you only change /proc/irq/<eth0>/smp_affinity, and let scheduler
chose any cpu for your user-space work that can have long latencies,
I would not expect better performances.

Try to cpu affine your taks to 0xFE to get better determinism.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-10  5:28                                         ` Eric Dumazet
@ 2009-03-10 23:22                                           ` Brian Bloniarz
  2009-03-11  3:00                                             ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Brian Bloniarz @ 2009-03-10 23:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kchang, netdev

Hi Eric,

FYI: with your patch applied and lockdep enabled, I see:
[   39.114628] ================================================
[   39.121964] [ BUG: lock held when returning to user space! ]
[   39.127704] ------------------------------------------------
[   39.133461] msgtest/5242 is leaving the kernel with locks still held!
[   39.140132] 1 lock held by msgtest/5242:
[   39.144287]  #0:  (clock-AF_INET){-.-?}, at: [<ffffffff8041f5b9>] sock_def_readable+0x19/0xb0

I can't reproduced this with the mcasttest program yet, it
was with an internal test program which does some userspace
processing on the messages. I'll let you know if I find a way
to reproduce it with a simple program I can share.

 > Well, smp_affinity could help in my opininon if you dedicate
 > one cpu for the NIC, and others for user apps, if the average
 > work done per packet is large. If load is light, its better
 > to use the same cpu to perform all the work, since no expensive
 > bus trafic is needed between cpu to exchange memory lines.

I tried this setup as well: an 8-core box with 4 userspace
processes, each affined to an individual CPU1-4. The IRQ was on
CPU0. On most kernels, this setup loses fewer packets than the default
affinity (though they both lose some). With your patch enabled, the
default affinity loses 0 packets, and this setup loses some.

Thanks,
Brian Bloniarz


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-10 23:22                                           ` Brian Bloniarz
@ 2009-03-11  3:00                                             ` Eric Dumazet
  2009-03-12 15:47                                               ` Brian Bloniarz
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-11  3:00 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: kchang, netdev, David S. Miller

Brian Bloniarz a écrit :
> Hi Eric,
> 
> FYI: with your patch applied and lockdep enabled, I see:
> [   39.114628] ================================================
> [   39.121964] [ BUG: lock held when returning to user space! ]
> [   39.127704] ------------------------------------------------
> [   39.133461] msgtest/5242 is leaving the kernel with locks still held!
> [   39.140132] 1 lock held by msgtest/5242:
> [   39.144287]  #0:  (clock-AF_INET){-.-?}, at: [<ffffffff8041f5b9>]
> sock_def_readable+0x19/0xb0

And you told me you were not a kernel hacker ;)

> 
> I can't reproduced this with the mcasttest program yet, it
> was with an internal test program which does some userspace
> processing on the messages. I'll let you know if I find a way
> to reproduce it with a simple program I can share.

I reproduced it as well here quite easily with a tcpdump of a tcp session,
thanks for the report.

It seems  "if (in_softirq())" doesnt do what I thought.

I wanted to test if we were called from __do_softirq() handler,
since only this function is calling softirq_delay_exec()
to dequeue events.

It appears I have to make current->softirq_context available
even if !CONFIG_TRACE_IRQFLAGS

Here is an updated version of the patch.

I also made the call to softirq_delay_exec() is performed
with interrupts enabled, and that preempt count wont
overflow if many events are queued.

[PATCH] softirq: Introduce mechanism to defer wakeups

Some network workloads need to call scheduler too many times. For example,
each received multicast frame can wakeup many threads. ksoftirqd is then
not able to drain NIC RX queues and we get frame losses and high latencies.

This patch adds an infrastructure to delay part of work done in
sock_def_readable() at end of do_softirq(). This need to
make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/interrupt.h |   18 +++++++++++++++++
 include/linux/irqflags.h  |    7 ++----
 include/linux/sched.h     |    2 -
 include/net/sock.h        |    1
 kernel/softirq.c          |   34 +++++++++++++++++++++++++++++++++
 net/core/sock.c           |   37 ++++++++++++++++++++++++++++++++++--
 6 files changed, 92 insertions(+), 7 deletions(-)


diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9127f6b..a773d0c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -295,6 +295,24 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
 				  int this_cpu, int softirq);
 
+/*
+ * softirq delayed works : should be delayed at do_softirq() end
+ */
+struct softirq_delay {
+	struct softirq_delay	*next;
+	void 			(*func)(struct softirq_delay *);
+};
+
+int softirq_delay_queue(struct softirq_delay *sdel);
+
+static inline void softirq_delay_init(struct softirq_delay *sdel,
+				      void (*func)(struct softirq_delay *))
+{
+	sdel->next = NULL;
+	sdel->func = func;
+}
+
+
 /* Tasklets --- multithreaded analogue of BHs.
 
    Main feature differing them of generic softirqs: tasklet
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 74bde13..fe55ec4 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -13,6 +13,9 @@
 
 #include <linux/typecheck.h>
 
+#define trace_softirq_enter()	do { current->softirq_context++; } while (0)
+#define trace_softirq_exit()	do { current->softirq_context--; } while (0)
+
 #ifdef CONFIG_TRACE_IRQFLAGS
   extern void trace_softirqs_on(unsigned long ip);
   extern void trace_softirqs_off(unsigned long ip);
@@ -24,8 +27,6 @@
 # define trace_softirqs_enabled(p)	((p)->softirqs_enabled)
 # define trace_hardirq_enter()	do { current->hardirq_context++; } while (0)
 # define trace_hardirq_exit()	do { current->hardirq_context--; } while (0)
-# define trace_softirq_enter()	do { current->softirq_context++; } while (0)
-# define trace_softirq_exit()	do { current->softirq_context--; } while (0)
 # define INIT_TRACE_IRQFLAGS	.softirqs_enabled = 1,
 #else
 # define trace_hardirqs_on()		do { } while (0)
@@ -38,8 +39,6 @@
 # define trace_softirqs_enabled(p)	0
 # define trace_hardirq_enter()		do { } while (0)
 # define trace_hardirq_exit()		do { } while (0)
-# define trace_softirq_enter()		do { } while (0)
-# define trace_softirq_exit()		do { } while (0)
 # define INIT_TRACE_IRQFLAGS
 #endif
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8c216e0..5dd8487 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1320,8 +1320,8 @@ struct task_struct {
 	unsigned long softirq_enable_ip;
 	unsigned int softirq_enable_event;
 	int hardirq_context;
-	int softirq_context;
 #endif
+	int softirq_context;
 #ifdef CONFIG_LOCKDEP
 # define MAX_LOCK_DEPTH 48UL
 	u64 curr_chain_key;
diff --git a/include/net/sock.h b/include/net/sock.h
index eefeeaf..1bfd9b8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -260,6 +260,7 @@ struct sock {
 	unsigned long	        sk_lingertime;
 	struct sk_buff_head	sk_error_queue;
 	struct proto		*sk_prot_creator;
+	struct softirq_delay	sk_delay;
 	rwlock_t		sk_callback_lock;
 	int			sk_err,
 				sk_err_soft;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 9041ea7..c601730 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -158,6 +158,38 @@ void local_bh_enable_ip(unsigned long ip)
 }
 EXPORT_SYMBOL(local_bh_enable_ip);
 
+
+#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
+
+static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
+	SOFTIRQ_DELAY_END
+};
+
+/*
+ * Preemption is disabled by caller
+ */
+int softirq_delay_queue(struct softirq_delay *sdel)
+{
+	if (cmpxchg(&sdel->next, NULL, __get_cpu_var(softirq_delay_head)) == NULL) {
+		__get_cpu_var(softirq_delay_head) = sdel;
+		return 1;
+	}
+	return 0;
+}
+
+static void softirq_delay_exec(void)
+{
+	struct softirq_delay *sdel;
+
+	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
+		__get_cpu_var(softirq_delay_head) = sdel->next;
+		sdel->next = NULL;
+		sdel->func(sdel);
+		}
+}
+
+
+
 /*
  * We restart softirq processing MAX_SOFTIRQ_RESTART times,
  * and we fall back to softirqd after that.
@@ -211,6 +243,8 @@ restart:
 		pending >>= 1;
 	} while (pending);
 
+	softirq_delay_exec();
+
 	local_irq_disable();
 
 	pending = local_softirq_pending();
diff --git a/net/core/sock.c b/net/core/sock.c
index 5f97caa..d51d57d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -212,6 +212,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static void sock_readable_defer(struct softirq_delay *sdel);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1026,6 +1028,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 #endif
 
 		rwlock_init(&newsk->sk_dst_lock);
+		softirq_delay_init(&newsk->sk_delay, sock_readable_defer);
 		rwlock_init(&newsk->sk_callback_lock);
 		lockdep_set_class_and_name(&newsk->sk_callback_lock,
 				af_callback_keys + newsk->sk_family,
@@ -1634,12 +1637,41 @@ static void sock_def_error_report(struct sock *sk)
 	read_unlock(&sk->sk_callback_lock);
 }
 
+static void sock_readable_defer(struct softirq_delay *sdel)
+{
+	struct sock *sk = container_of(sdel, struct sock, sk_delay);
+
+	wake_up_interruptible_sync(sk->sk_sleep);
+	/*
+	 * Before unlocking, we increase preempt_count,
+	 * as it was decreased in sock_def_readable()
+	 */
+	preempt_disable();
+	read_unlock(&sk->sk_callback_lock);
+}
+
 static void sock_def_readable(struct sock *sk, int len)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_sync(sk->sk_sleep);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) {
+		if (current->softirq_context) {
+			/*
+			 * If called from __do_softirq(), we want to delay
+			 * calls to wake_up_interruptible_sync()
+			 */
+			if (!softirq_delay_queue(&sk->sk_delay))
+				goto unlock;
+			/*
+			 * We keep sk->sk_callback_lock read locked,
+			 * but decrease preempt_count to avoid an overflow
+			 */
+			preempt_enable_no_resched();
+			return;
+		}
+		wake_up_interruptible_sync(sk->sk_sleep);
+	}
+unlock:
 	read_unlock(&sk->sk_callback_lock);
 }
 
@@ -1720,6 +1752,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 		sk->sk_sleep	=	NULL;
 
 	rwlock_init(&sk->sk_dst_lock);
+	softirq_delay_init(&sk->sk_delay, sock_readable_defer);
 	rwlock_init(&sk->sk_callback_lock);
 	lockdep_set_class_and_name(&sk->sk_callback_lock,
 			af_callback_keys + sk->sk_family,


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-11  3:00                                             ` Eric Dumazet
@ 2009-03-12 15:47                                               ` Brian Bloniarz
  2009-03-12 16:34                                                 ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: Brian Bloniarz @ 2009-03-12 15:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kchang, netdev, David S. Miller

Eric Dumazet wrote:
> Here is an updated version of the patch.

This works great in all my tests so far.

Thanks,
Brian Bloniarz

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-12 15:47                                               ` Brian Bloniarz
@ 2009-03-12 16:34                                                 ` Eric Dumazet
  0 siblings, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-03-12 16:34 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: kchang, netdev, David S. Miller

Brian Bloniarz a écrit :
> Eric Dumazet wrote:
>> Here is an updated version of the patch.
> 
> This works great in all my tests so far.
> 
> Thanks,
> Brian Bloniarz

Cool

I am wondering if we should extend the mechanism and change
softirq_delay_exec() to wakeup a workqueue instead of
doing the loop from softirq handler, in case a given
level of stress / load is hit.

This could help machines with several cpus, and one NIC (without
multi RX queues) flooded by messages (not necessarly multicast trafic).
Imagine a media/chat server receiving XXX.000 frames / second

One cpu could be dedicated to pure softirq/network handling,
and other cpus could participate and handle the scheduler part if any.

Condition could be : 

- We run __do_softirq() from ksoftirqd and 
- We queued more than N 'struct softirq_delay' in softirq_delay_head
- We have more than one cpu online


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-09  6:36                                         ` Eric Dumazet
@ 2009-03-13 21:51                                           ` David Miller
  2009-03-13 22:30                                             ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: David Miller @ 2009-03-13 21:51 UTC (permalink / raw)
  To: dada1; +Cc: kchang, netdev, cl, bmb

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 09 Mar 2009 07:36:57 +0100

> I chose cmpxchg() because I needed some form of exclusion here.
> I first added a spinlock inside "struct softirq_del" then I realize
> I could use cmpxchg() instead and keep the structure small. As the
> synchronization is only needed at queueing time, we could pass
> the address of a spinlock XXX to sofirq_del() call.

I don't understand why you need the mutual exclusion in the
first place.  The function pointer always has the same value.
And this locking isn't protecting the list insertion either,
as that isn't even necessary.

It just looks like plain overhead to me.

> Also, when an event was queued for later invocation, I also needed to keep
> a reference on "struct socket" to make sure it doesnt disappear before
> the invocation. Not all sockets are RCU guarded (we added RCU only for 
> some protocols (TCP, UDP ...). So I found keeping a read_lock
> on callback was the easyest thing to do. I now realize we might
> overflow preempt_count, so special care is needed.

You're using this in UDP so... make the rule that you can't use
this with a non-RCU-quiescent protocol.

> About your first point, maybe we should make sofirq_del() (poor name
> I confess) only have one argument (pointer to struct softirq_del),
> and initialize the function pointer at socket init time. That would
> insure "struct softirq_del" is associated to one callback
> only. cmpxchg() test would have to be done on "next" field then (or
> use the spinlock XXX)

Why?  You run this from softirq safe context, nothing can run other
softirqs on this cpu and corrupt the list, therefore.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-13 21:51                                           ` David Miller
@ 2009-03-13 22:30                                             ` Eric Dumazet
  2009-03-13 22:38                                               ` David Miller
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-13 22:30 UTC (permalink / raw)
  To: David Miller; +Cc: kchang, netdev, cl, bmb

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Mon, 09 Mar 2009 07:36:57 +0100
> 
>> I chose cmpxchg() because I needed some form of exclusion here.
>> I first added a spinlock inside "struct softirq_del" then I realize
>> I could use cmpxchg() instead and keep the structure small. As the
>> synchronization is only needed at queueing time, we could pass
>> the address of a spinlock XXX to sofirq_del() call.
> 
> I don't understand why you need the mutual exclusion in the
> first place.  The function pointer always has the same value.
> And this locking isn't protecting the list insertion either,
> as that isn't even necessary.
> 
> It just looks like plain overhead to me.

I was lazy to check all callers (all protocols) had a lock on sock,
and prefered safety.

I was fooled by the read_lock(), and though several cpus could call
this function in //


> 
>> Also, when an event was queued for later invocation, I also needed to keep
>> a reference on "struct socket" to make sure it doesnt disappear before
>> the invocation. Not all sockets are RCU guarded (we added RCU only for 
>> some protocols (TCP, UDP ...). So I found keeping a read_lock
>> on callback was the easyest thing to do. I now realize we might
>> overflow preempt_count, so special care is needed.
> 
> You're using this in UDP so... make the rule that you can't use
> this with a non-RCU-quiescent protocol.

UDP/TCP only ? I though many other protocols (not all using RCU) were
using sock_def_readable() too...



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-13 22:30                                             ` Eric Dumazet
@ 2009-03-13 22:38                                               ` David Miller
  2009-03-13 22:45                                                 ` Eric Dumazet
  2009-03-16 22:22                                                 ` Multicast packet loss Eric Dumazet
  0 siblings, 2 replies; 70+ messages in thread
From: David Miller @ 2009-03-13 22:38 UTC (permalink / raw)
  To: dada1; +Cc: kchang, netdev, cl, bmb

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 13 Mar 2009 23:30:31 +0100

> David Miller a écrit :
> >> Also, when an event was queued for later invocation, I also needed to keep
> >> a reference on "struct socket" to make sure it doesnt disappear before
> >> the invocation. Not all sockets are RCU guarded (we added RCU only for 
> >> some protocols (TCP, UDP ...). So I found keeping a read_lock
> >> on callback was the easyest thing to do. I now realize we might
> >> overflow preempt_count, so special care is needed.
> > 
> > You're using this in UDP so... make the rule that you can't use
> > this with a non-RCU-quiescent protocol.
> 
> UDP/TCP only ? I though many other protocols (not all using RCU) were
> using sock_def_readable() too...

Maybe create a inet_def_readable() just for this purpose :-)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-13 22:38                                               ` David Miller
@ 2009-03-13 22:45                                                 ` Eric Dumazet
  2009-03-14  9:03                                                   ` [PATCH] net: reorder fields of struct socket Eric Dumazet
  2009-03-16 22:22                                                 ` Multicast packet loss Eric Dumazet
  1 sibling, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-13 22:45 UTC (permalink / raw)
  To: David Miller; +Cc: kchang, netdev, cl, bmb

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 13 Mar 2009 23:30:31 +0100
> 
>> David Miller a écrit :
>>>> Also, when an event was queued for later invocation, I also needed to keep
>>>> a reference on "struct socket" to make sure it doesnt disappear before
>>>> the invocation. Not all sockets are RCU guarded (we added RCU only for 
>>>> some protocols (TCP, UDP ...). So I found keeping a read_lock
>>>> on callback was the easyest thing to do. I now realize we might
>>>> overflow preempt_count, so special care is needed.
>>> You're using this in UDP so... make the rule that you can't use
>>> this with a non-RCU-quiescent protocol.
>> UDP/TCP only ? I though many other protocols (not all using RCU) were
>> using sock_def_readable() too...
> 
> Maybe create a inet_def_readable() just for this purpose :-)

I must be tired, I should had this idea before you :)

I post a new patch after some rest, I definitly should not be still awake !



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH] net: reorder fields of struct socket
  2009-03-13 22:45                                                 ` Eric Dumazet
@ 2009-03-14  9:03                                                   ` Eric Dumazet
  2009-03-16  2:59                                                     ` David Miller
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-14  9:03 UTC (permalink / raw)
  To: David Miller; +Cc: kchang, netdev, bmb

On x86_64, its rather unfortunate that "wait_queue_head_t wait"
field of "struct socket" spans two cache lines (assuming a 64
bytes cache line in current cpus)

offsetof(struct socket, wait)=0x30
sizeof(wait_queue_head_t)=0x18

This might explain why Kenny Chang noticed that his multicast workload
was performing bad with 64 bit kernels, since more cache lines ping pongs
were involved.

This litle patch moves "wait" field next "fasync_list" so that both
fields share a single cache line, to speedup sock_def_readable()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/include/linux/net.h b/include/linux/net.h
index 4515efa..4fc2ffd 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -129,11 +129,15 @@ struct socket {
 	socket_state		state;
 	short			type;
 	unsigned long		flags;
-	const struct proto_ops	*ops;
+	/*
+	 * Please keep fasync_list & wait fields in the same cache line
+	 */
 	struct fasync_struct	*fasync_list;
+	wait_queue_head_t	wait;
+
 	struct file		*file;
 	struct sock		*sk;
-	wait_queue_head_t	wait;
+	const struct proto_ops	*ops;
 };
 
 struct vm_area_struct;


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH] net: reorder fields of struct socket
  2009-03-14  9:03                                                   ` [PATCH] net: reorder fields of struct socket Eric Dumazet
@ 2009-03-16  2:59                                                     ` David Miller
  0 siblings, 0 replies; 70+ messages in thread
From: David Miller @ 2009-03-16  2:59 UTC (permalink / raw)
  To: dada1; +Cc: kchang, netdev, bmb

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 14 Mar 2009 10:03:54 +0100

> On x86_64, its rather unfortunate that "wait_queue_head_t wait"
> field of "struct socket" spans two cache lines (assuming a 64
> bytes cache line in current cpus)
> 
> offsetof(struct socket, wait)=0x30
> sizeof(wait_queue_head_t)=0x18
> 
> This might explain why Kenny Chang noticed that his multicast workload
> was performing bad with 64 bit kernels, since more cache lines ping pongs
> were involved.
> 
> This litle patch moves "wait" field next "fasync_list" so that both
> fields share a single cache line, to speedup sock_def_readable()
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks a lot Eric.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-13 22:38                                               ` David Miller
  2009-03-13 22:45                                                 ` Eric Dumazet
@ 2009-03-16 22:22                                                 ` Eric Dumazet
  2009-03-17 10:11                                                   ` Peter Zijlstra
  2009-04-03 19:28                                                   ` Brian Bloniarz
  1 sibling, 2 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-03-16 22:22 UTC (permalink / raw)
  To: David Miller; +Cc: kchang, netdev, cl, bmb

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 13 Mar 2009 23:30:31 +0100
> 
>> David Miller a écrit :
>>>> Also, when an event was queued for later invocation, I also needed to keep
>>>> a reference on "struct socket" to make sure it doesnt disappear before
>>>> the invocation. Not all sockets are RCU guarded (we added RCU only for 
>>>> some protocols (TCP, UDP ...). So I found keeping a read_lock
>>>> on callback was the easyest thing to do. I now realize we might
>>>> overflow preempt_count, so special care is needed.
>>> You're using this in UDP so... make the rule that you can't use
>>> this with a non-RCU-quiescent protocol.
>> UDP/TCP only ? I though many other protocols (not all using RCU) were
>> using sock_def_readable() too...
> 
> Maybe create a inet_def_readable() just for this purpose :-)


Here is the last incantation of the patch, that of course should be
split in two parts and better Changelog for further discussion on lkml.

We need to take a reference on sock when queued on a softirq delay
list. RCU wont help here because of SLAB_DESTROY_BY_RCU thing :
Another cpu could free/reuse the socket before we have a chance to
call softirq_delay_exec()

UDP & UDPLite use this delayed wakeup feature.

Thank you

[PATCH] softirq: Introduce mechanism to defer wakeups

Some network workloads need to call scheduler too many times. For example,
each received multicast frame can wakeup many threads. ksoftirqd is then
not able to drain NIC RX queues in time and we get frame losses and high
latencies.

This patch adds an infrastructure to delay work done in
sock_def_readable() at end of do_softirq(). This needs to
make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/interrupt.h |   18 +++++++++++++++
 include/linux/irqflags.h  |   11 ++++-----
 include/linux/sched.h     |    2 -
 include/net/sock.h        |    2 +
 include/net/udplite.h     |    1
 kernel/lockdep.c          |    2 -
 kernel/softirq.c          |   42 ++++++++++++++++++++++++++++++++++--
 lib/locking-selftest.c    |    4 +--
 net/core/sock.c           |   41 +++++++++++++++++++++++++++++++++++
 net/ipv4/udp.c            |    7 ++++++
 net/ipv6/udp.c            |    7 ++++++
 11 files changed, 125 insertions(+), 12 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9127f6b..a773d0c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -295,6 +295,24 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
 				  int this_cpu, int softirq);
 
+/*
+ * softirq delayed works : should be delayed at do_softirq() end
+ */
+struct softirq_delay {
+	struct softirq_delay	*next;
+	void 			(*func)(struct softirq_delay *);
+};
+
+int softirq_delay_queue(struct softirq_delay *sdel);
+
+static inline void softirq_delay_init(struct softirq_delay *sdel,
+				      void (*func)(struct softirq_delay *))
+{
+	sdel->next = NULL;
+	sdel->func = func;
+}
+
+
 /* Tasklets --- multithreaded analogue of BHs.
 
    Main feature differing them of generic softirqs: tasklet
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 74bde13..30c1e01 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -13,19 +13,21 @@
 
 #include <linux/typecheck.h>
 
+#define softirq_enter()	do { current->softirq_context++; } while (0)
+#define softirq_exit()	do { current->softirq_context--; } while (0)
+#define softirq_context(p)	((p)->softirq_context)
+#define running_from_softirq()  (softirq_context(current) > 0)
+
 #ifdef CONFIG_TRACE_IRQFLAGS
   extern void trace_softirqs_on(unsigned long ip);
   extern void trace_softirqs_off(unsigned long ip);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
 # define trace_hardirq_context(p)	((p)->hardirq_context)
-# define trace_softirq_context(p)	((p)->softirq_context)
 # define trace_hardirqs_enabled(p)	((p)->hardirqs_enabled)
 # define trace_softirqs_enabled(p)	((p)->softirqs_enabled)
 # define trace_hardirq_enter()	do { current->hardirq_context++; } while (0)
 # define trace_hardirq_exit()	do { current->hardirq_context--; } while (0)
-# define trace_softirq_enter()	do { current->softirq_context++; } while (0)
-# define trace_softirq_exit()	do { current->softirq_context--; } while (0)
 # define INIT_TRACE_IRQFLAGS	.softirqs_enabled = 1,
 #else
 # define trace_hardirqs_on()		do { } while (0)
@@ -33,13 +35,10 @@
 # define trace_softirqs_on(ip)		do { } while (0)
 # define trace_softirqs_off(ip)		do { } while (0)
 # define trace_hardirq_context(p)	0
-# define trace_softirq_context(p)	0
 # define trace_hardirqs_enabled(p)	0
 # define trace_softirqs_enabled(p)	0
 # define trace_hardirq_enter()		do { } while (0)
 # define trace_hardirq_exit()		do { } while (0)
-# define trace_softirq_enter()		do { } while (0)
-# define trace_softirq_exit()		do { } while (0)
 # define INIT_TRACE_IRQFLAGS
 #endif
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8c216e0..5dd8487 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1320,8 +1320,8 @@ struct task_struct {
 	unsigned long softirq_enable_ip;
 	unsigned int softirq_enable_event;
 	int hardirq_context;
-	int softirq_context;
 #endif
+	int softirq_context;
 #ifdef CONFIG_LOCKDEP
 # define MAX_LOCK_DEPTH 48UL
 	u64 curr_chain_key;
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..0160a83 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -260,6 +260,7 @@ struct sock {
 	unsigned long	        sk_lingertime;
 	struct sk_buff_head	sk_error_queue;
 	struct proto		*sk_prot_creator;
+	struct softirq_delay	sk_delay;
 	rwlock_t		sk_callback_lock;
 	int			sk_err,
 				sk_err_soft;
@@ -960,6 +961,7 @@ extern void *sock_kmalloc(struct sock *sk, int size,
 			  gfp_t priority);
 extern void sock_kfree_s(struct sock *sk, void *mem, int size);
 extern void sk_send_sigurg(struct sock *sk);
+extern void inet_def_readable(struct sock *sk, int len);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/udplite.h b/include/net/udplite.h
index afdffe6..7ce0ee0 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -25,6 +25,7 @@ static __inline__ int udplite_getfrag(void *from, char *to, int  offset,
 /* Designate sk as UDP-Lite socket */
 static inline int udplite_sk_init(struct sock *sk)
 {
+	sk->sk_data_ready = inet_def_readable;
 	udp_sk(sk)->pcflag = UDPLITE_BIT;
 	return 0;
 }
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 06b0c35..9873b40 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -1807,7 +1807,7 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,
 	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
 		curr->comm, task_pid_nr(curr),
 		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
-		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
+		softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
 		trace_hardirqs_enabled(curr),
 		trace_softirqs_enabled(curr));
 	print_lock(this);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bdbe9de..91a1714 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -158,6 +158,42 @@ void local_bh_enable_ip(unsigned long ip)
 }
 EXPORT_SYMBOL(local_bh_enable_ip);
 
+
+#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
+static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
+	SOFTIRQ_DELAY_END
+};
+
+/*
+ * Caller must disable preemption, and take care of appropriate
+ * locking and refcounting
+ */
+int softirq_delay_queue(struct softirq_delay *sdel)
+{
+	if (!sdel->next) {
+		sdel->next = __get_cpu_var(softirq_delay_head);
+		__get_cpu_var(softirq_delay_head) = sdel;
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Because locking is provided by subsystem, please note
+ * that sdel->func(sdel) is responsible for setting sdel->next to NULL
+ */
+static void softirq_delay_exec(void)
+{
+	struct softirq_delay *sdel;
+
+	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
+		__get_cpu_var(softirq_delay_head) = sdel->next;
+		sdel->func(sdel);	/*	sdel->next = NULL;*/
+		}
+}
+
+
+
 /*
  * We restart softirq processing MAX_SOFTIRQ_RESTART times,
  * and we fall back to softirqd after that.
@@ -180,7 +216,7 @@ asmlinkage void __do_softirq(void)
 	account_system_vtime(current);
 
 	__local_bh_disable((unsigned long)__builtin_return_address(0));
-	trace_softirq_enter();
+	softirq_enter();
 
 	cpu = smp_processor_id();
 restart:
@@ -211,6 +247,8 @@ restart:
 		pending >>= 1;
 	} while (pending);
 
+	softirq_delay_exec();
+
 	local_irq_disable();
 
 	pending = local_softirq_pending();
@@ -220,7 +258,7 @@ restart:
 	if (pending)
 		wakeup_softirqd();
 
-	trace_softirq_exit();
+	softirq_exit();
 
 	account_system_vtime(current);
 	_local_bh_enable();
diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index 280332c..1aa7351 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -157,11 +157,11 @@ static void init_shared_classes(void)
 #define SOFTIRQ_ENTER()				\
 		local_bh_disable();		\
 		local_irq_disable();		\
-		trace_softirq_enter();		\
+		softirq_enter();		\
 		WARN_ON(!in_softirq());
 
 #define SOFTIRQ_EXIT()				\
-		trace_softirq_exit();		\
+		softirq_exit();		\
 		local_irq_enable();		\
 		local_bh_enable();
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 0620046..c8745d1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -213,6 +213,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static void sock_readable_defer(struct softirq_delay *sdel);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1074,6 +1076,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 #endif
 
 		rwlock_init(&newsk->sk_dst_lock);
+		softirq_delay_init(&newsk->sk_delay, sock_readable_defer);
 		rwlock_init(&newsk->sk_callback_lock);
 		lockdep_set_class_and_name(&newsk->sk_callback_lock,
 				af_callback_keys + newsk->sk_family,
@@ -1691,6 +1694,43 @@ static void sock_def_readable(struct sock *sk, int len)
 	read_unlock(&sk->sk_callback_lock);
 }
 
+/*
+ * helper function called by softirq_delay_exec(),
+ * if inet_def_readable() queued us.
+ */
+static void sock_readable_defer(struct softirq_delay *sdel)
+{
+	struct sock *sk = container_of(sdel, struct sock, sk_delay);
+
+	sdel->next = NULL;
+	/*
+	 * At this point, we dont own a lock on socket, only a reference.
+	 * We must commit above write, or another cpu could miss a wakeup
+	 */
+	smp_wmb();
+	sock_def_readable(sk, 0);
+	sock_put(sk);
+}
+
+/*
+ * Custom version of sock_def_readable()
+ * We want to defer scheduler processing at the end of do_softirq()
+ * Called with socket locked.
+ */
+void inet_def_readable(struct sock *sk, int len)
+{
+	if (running_from_softirq()) {
+		if (softirq_delay_queue(&sk->sk_delay))
+			/*
+			 * If we queued this socket, take a reference on it
+			 * Caller owns socket lock, so write to sk_delay.next
+			 * will be committed before unlock.
+			 */
+			sock_hold(sk);
+	} else
+		sock_def_readable(sk, len);
+}
+
 static void sock_def_write_space(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
@@ -1768,6 +1808,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 		sk->sk_sleep	=	NULL;
 
 	rwlock_init(&sk->sk_dst_lock);
+	softirq_delay_init(&sk->sk_delay, sock_readable_defer);
 	rwlock_init(&sk->sk_callback_lock);
 	lockdep_set_class_and_name(&sk->sk_callback_lock,
 			af_callback_keys + sk->sk_family,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 05b7abb..1cc0907 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1342,6 +1342,12 @@ void udp_destroy_sock(struct sock *sk)
 	release_sock(sk);
 }
 
+static int udp_init_sock(struct sock *sk)
+{
+	sk->sk_data_ready = inet_def_readable;
+	return 0;
+}
+
 /*
  *	Socket option code for UDP
  */
@@ -1559,6 +1565,7 @@ struct proto udp_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udp_destroy_sock,
 	.setsockopt	   = udp_setsockopt,
 	.getsockopt	   = udp_getsockopt,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 84b1a29..1a9f8d4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -960,6 +960,12 @@ void udpv6_destroy_sock(struct sock *sk)
 	inet6_destroy_sock(sk);
 }
 
+static int udpv6_init_sock(struct sock *sk)
+{
+	sk->sk_data_ready = inet_def_readable;
+	return 0;
+}
+
 /*
  *	Socket option code for UDP
  */
@@ -1084,6 +1090,7 @@ struct proto udpv6_prot = {
 	.connect	   = ip6_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init 		   = udpv6_init_sock,
 	.destroy	   = udpv6_destroy_sock,
 	.setsockopt	   = udpv6_setsockopt,
 	.getsockopt	   = udpv6_getsockopt,


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-16 22:22                                                 ` Multicast packet loss Eric Dumazet
@ 2009-03-17 10:11                                                   ` Peter Zijlstra
  2009-03-17 11:08                                                     ` Eric Dumazet
  2009-04-03 19:28                                                   ` Brian Bloniarz
  1 sibling, 1 reply; 70+ messages in thread
From: Peter Zijlstra @ 2009-03-17 10:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kchang, netdev, cl, bmb

On Mon, 2009-03-16 at 23:22 +0100, Eric Dumazet wrote:

> Here is the last incantation of the patch, that of course should be
> split in two parts and better Changelog for further discussion on lkml.

I read the entire thread up to now, and I still don't really understand
the Changelog, sorry :(

> [PATCH] softirq: Introduce mechanism to defer wakeups
> 
> Some network workloads need to call scheduler too many times. For example,
> each received multicast frame can wakeup many threads. ksoftirqd is then
> not able to drain NIC RX queues in time and we get frame losses and high
> latencies.
> 
> This patch adds an infrastructure to delay work done in
> sock_def_readable() at end of do_softirq(). This needs to
> make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS

How does that solve the wakeup issue?

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---

> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -158,6 +158,42 @@ void local_bh_enable_ip(unsigned long ip)
>  }
>  EXPORT_SYMBOL(local_bh_enable_ip);
>  
> +
> +#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
> +static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
> +	SOFTIRQ_DELAY_END
> +};

Why the magic termination value? Can't we NULL terminate the list

> +
> +/*
> + * Caller must disable preemption, and take care of appropriate
> + * locking and refcounting
> + */

Shouldn't we call it __softirq_delay_queue() if the caller needs to
disabled preemption?

Futhermore, don't we always require the caller to take care of lifetime
issues when we queue something?

> +int softirq_delay_queue(struct softirq_delay *sdel)
> +{
> +	if (!sdel->next) {
> +		sdel->next = __get_cpu_var(softirq_delay_head);
> +		__get_cpu_var(softirq_delay_head) = sdel;
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Because locking is provided by subsystem, please note
> + * that sdel->func(sdel) is responsible for setting sdel->next to NULL
> + */
> +static void softirq_delay_exec(void)
> +{
> +	struct softirq_delay *sdel;
> +
> +	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
> +		__get_cpu_var(softirq_delay_head) = sdel->next;
> +		sdel->func(sdel);	/*	sdel->next = NULL;*/
> +		}
> +}

Why can't we write:

  struct softirq_delay *sdel, *next;

  sdel = __get_cpu_var(softirq_delay_head);
  __get_cpu_var(softirq_delay_head) = NULL;

  while (sdel) {
    next = sdel->next;
    sdel->func(sdel);
    sdel = next;
  }

Why does it matter what happens to sdel->next? we've done the callback.

Aah, the crux is in the re-use policy.. that most certainly does deserve
a comment.

How about we make sdel->next point to itself in the init case?

Then we can write:

  while (sdel) {
    next = sdel->next;
    sdel->next = sdel;
    sdel->func(sdel);
    sdel = next;
  }

and have the enqueue bit look like:

int __softirq_delay_queue(struct softirq_delay *sdel)
{
  struct softirq_delay **head;

  if (sdel->next != sdel)
    return 0;

  head = &__get_cpu_var(softirq_delay_head);
  sdel->next = *head;
  *head = sdel;
  return 1;
}
     
> @@ -1691,6 +1694,43 @@ static void sock_def_readable(struct sock *sk, int len)
>  	read_unlock(&sk->sk_callback_lock);
>  }
>  
> +/*
> + * helper function called by softirq_delay_exec(),
> + * if inet_def_readable() queued us.
> + */
> +static void sock_readable_defer(struct softirq_delay *sdel)
> +{
> +	struct sock *sk = container_of(sdel, struct sock, sk_delay);
> +
> +	sdel->next = NULL;
> +	/*
> +	 * At this point, we dont own a lock on socket, only a reference.
> +	 * We must commit above write, or another cpu could miss a wakeup
> +	 */
> +	smp_wmb();

Where's the matching barrier?

> +	sock_def_readable(sk, 0);
> +	sock_put(sk);
> +}
> +
> +/*
> + * Custom version of sock_def_readable()
> + * We want to defer scheduler processing at the end of do_softirq()
> + * Called with socket locked.
> + */
> +void inet_def_readable(struct sock *sk, int len)
> +{
> +	if (running_from_softirq()) {
> +		if (softirq_delay_queue(&sk->sk_delay))
> +			/*
> +			 * If we queued this socket, take a reference on it
> +			 * Caller owns socket lock, so write to sk_delay.next
> +			 * will be committed before unlock.
> +			 */
> +			sock_hold(sk);
> +	} else
> +		sock_def_readable(sk, len);
> +}

OK, so the idea is to handle a bunch of packets and instead of waking N
threads for each packet, only wake them once at the end of the batch?

Sounds like a sensible idea.. 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-17 10:11                                                   ` Peter Zijlstra
@ 2009-03-17 11:08                                                     ` Eric Dumazet
  2009-03-17 11:57                                                       ` Peter Zijlstra
  2009-03-17 15:00                                                       ` Brian Bloniarz
  0 siblings, 2 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-03-17 11:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: David Miller, kchang, netdev, cl, bmb

Peter Zijlstra a écrit :
> On Mon, 2009-03-16 at 23:22 +0100, Eric Dumazet wrote:
> 
>> Here is the last incantation of the patch, that of course should be
>> split in two parts and better Changelog for further discussion on lkml.
> 
> I read the entire thread up to now, and I still don't really understand
> the Changelog, sorry :(

Sure, I should have taken more time, will repost this in a couple of hours,
with nice CHangelogs and split patches.

> 
>> [PATCH] softirq: Introduce mechanism to defer wakeups
>>
>> Some network workloads need to call scheduler too many times. For example,
>> each received multicast frame can wakeup many threads. ksoftirqd is then
>> not able to drain NIC RX queues in time and we get frame losses and high
>> latencies.
>>
>> This patch adds an infrastructure to delay work done in
>> sock_def_readable() at end of do_softirq(). This needs to
>> make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS
> 
> How does that solve the wakeup issue?

Apparently, on SMP machines this actually helps a lot, in case of multicast
trafic handled by many subscribers. skb_cloning involves atomic ops on
route cache entries, and if we wakeup threads as we currently do, they
start to consume skb while the feeder is still doing skb clones for
other sockets. Many cache line ping pongs are slowing down the softirq.

I will post the test program to reproduce the problem.

> 
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> ---
> 
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -158,6 +158,42 @@ void local_bh_enable_ip(unsigned long ip)
>>  }
>>  EXPORT_SYMBOL(local_bh_enable_ip);
>>  
>> +
>> +#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
>> +static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
>> +	SOFTIRQ_DELAY_END
>> +};
> 
> Why the magic termination value? Can't we NULL terminate the list

Yes we can, you are right.

> 
>> +
>> +/*
>> + * Caller must disable preemption, and take care of appropriate
>> + * locking and refcounting
>> + */
> 
> Shouldn't we call it __softirq_delay_queue() if the caller needs to
> disabled preemption?

I was wondering if some BUG_ON() can be added to crash if preemption is enabled
at this point. Could not find an existing check,
doing again the 'if (running_from_softirq())'" test might be overkill,
should I document caller should do :

skeleton :

    lock_my_data(data); /* barrier here */
    sdel = &data->sdel;
    if (running_from_softirq()) {
	if (softirq_delay_queue(sdel)) {
		hold a refcount on data;
	} else {
		/* already queued, nothing to do */
	}
    } else {
	/* cannot queue the work , must do it right now */
	do_work(data);
    }
    release_my_data(data);
}

> 
> Futhermore, don't we always require the caller to take care of lifetime
> issues when we queue something?

You mean comment is too verbose... or 

> 
>> +int softirq_delay_queue(struct softirq_delay *sdel)
>> +{
>> +	if (!sdel->next) {
>> +		sdel->next = __get_cpu_var(softirq_delay_head);
>> +		__get_cpu_var(softirq_delay_head) = sdel;
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Because locking is provided by subsystem, please note
>> + * that sdel->func(sdel) is responsible for setting sdel->next to NULL
>> + */
>> +static void softirq_delay_exec(void)
>> +{
>> +	struct softirq_delay *sdel;
>> +
>> +	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
>> +		__get_cpu_var(softirq_delay_head) = sdel->next;
>> +		sdel->func(sdel);	/*	sdel->next = NULL;*/
>> +		}
>> +}
> 
> Why can't we write:
> 
>   struct softirq_delay *sdel, *next;
> 
>   sdel = __get_cpu_var(softirq_delay_head);
>   __get_cpu_var(softirq_delay_head) = NULL;
> 
>   while (sdel) {
>     next = sdel->next;
>     sdel->func(sdel);
>     sdel = next;
>   }
> 
> Why does it matter what happens to sdel->next? we've done the callback.
> 
> Aah, the crux is in the re-use policy.. that most certainly does deserve
> a comment.

Hum, so my comment was not verbose enough :)

> 
> How about we make sdel->next point to itself in the init case?
> 
> Then we can write:
> 
>   while (sdel) {
>     next = sdel->next;
>     sdel->next = sdel;
>     sdel->func(sdel);
>     sdel = next;
>   }
> 
> and have the enqueue bit look like:
> 
> int __softirq_delay_queue(struct softirq_delay *sdel)
> {
>   struct softirq_delay **head;
> 
>   if (sdel->next != sdel)
>     return 0;

Yes we could do that

> 
>   head = &__get_cpu_var(softirq_delay_head);
>   sdel->next = *head;
>   *head = sdel;
>   return 1;
> }
>      
>> @@ -1691,6 +1694,43 @@ static void sock_def_readable(struct sock *sk, int len)
>>  	read_unlock(&sk->sk_callback_lock);
>>  }
>>  
>> +/*
>> + * helper function called by softirq_delay_exec(),
>> + * if inet_def_readable() queued us.
>> + */
>> +static void sock_readable_defer(struct softirq_delay *sdel)
>> +{
>> +	struct sock *sk = container_of(sdel, struct sock, sk_delay);
>> +
>> +	sdel->next = NULL;
>> +	/*
>> +	 * At this point, we dont own a lock on socket, only a reference.
>> +	 * We must commit above write, or another cpu could miss a wakeup
>> +	 */
>> +	smp_wmb();
> 
> Where's the matching barrier?

Check softirq_delay_exec(void) comment, where I stated synchronization had
to be done by the subsystem.

In this socket case, caller of softirq_delay_exec() has a lock on socket.

Problem is I dont want to get this lock again in sock_readable_defer() callback

if sdel->next is not committed, another cpu could call _softirq_delay_queue() and
find sdel->next being not null (or != sdel with your suggestion). Then next->func()
wont be called as it should (or called litle bit too soon)

So matching barrier is on "lock_my_data(data)" in previous skeleton ?

> 
>> +	sock_def_readable(sk, 0);
>> +	sock_put(sk);
>> +}
>> +
>> +/*
>> + * Custom version of sock_def_readable()
>> + * We want to defer scheduler processing at the end of do_softirq()
>> + * Called with socket locked.
>> + */
>> +void inet_def_readable(struct sock *sk, int len)
>> +{
>> +	if (running_from_softirq()) {
>> +		if (softirq_delay_queue(&sk->sk_delay))
>> +			/*
>> +			 * If we queued this socket, take a reference on it
>> +			 * Caller owns socket lock, so write to sk_delay.next
>> +			 * will be committed before unlock.
>> +			 */
>> +			sock_hold(sk);
>> +	} else
>> +		sock_def_readable(sk, len);
>> +}
> 
> OK, so the idea is to handle a bunch of packets and instead of waking N
> threads for each packet, only wake them once at the end of the batch?
> 
> Sounds like a sensible idea.. 

Idea is to batch wakeups() yes, and if we receive several packets for
the same socket(s), we reduce number of wakeups to one. In the multicast stress
situation of Athena CR, it really helps, no packets dropped instead of
30%

Thanks Peter


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-17 11:08                                                     ` Eric Dumazet
@ 2009-03-17 11:57                                                       ` Peter Zijlstra
  2009-03-17 15:00                                                       ` Brian Bloniarz
  1 sibling, 0 replies; 70+ messages in thread
From: Peter Zijlstra @ 2009-03-17 11:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kchang, netdev, cl, bmb

On Tue, 2009-03-17 at 12:08 +0100, Eric Dumazet wrote:

> >> +
> >> +/*
> >> + * Caller must disable preemption, and take care of appropriate
> >> + * locking and refcounting
> >> + */
> > 
> > Shouldn't we call it __softirq_delay_queue() if the caller needs to
> > disabled preemption?
> 
> I was wondering if some BUG_ON() can be added to crash if preemption is enabled
> at this point.

__get_cpu_var() has a preemption check and will generate BUGs when
CONFIG_DEBUG_PREEMPT similar to smp_processor_id().

>  Could not find an existing check,
> doing again the 'if (running_from_softirq())'" test might be overkill,
> should I document caller should do :
> 
> skeleton :
> 
>     lock_my_data(data); /* barrier here */
>     sdel = &data->sdel;
>     if (running_from_softirq()) {

Small nit: I don't particularly like the running_from_softirq() name,
but in_softirq() is already taken, and sadly means something slightly
different.

> 	if (softirq_delay_queue(sdel)) {
> 		hold a refcount on data;
> 	} else {
> 		/* already queued, nothing to do */
> 	}
>     } else {
> 	/* cannot queue the work , must do it right now */
> 	do_work(data);
>     }
>     release_my_data(data);
> }
> 
> > 
> > Futhermore, don't we always require the caller to take care of lifetime
> > issues when we queue something?
> 
> You mean comment is too verbose... or 

Yeah.

> > Aah, the crux is in the re-use policy.. that most certainly does deserve
> > a comment.
> 
> Hum, so my comment was not verbose enough :)

That too :-) 

> >> +static void sock_readable_defer(struct softirq_delay *sdel)
> >> +{
> >> +	struct sock *sk = container_of(sdel, struct sock, sk_delay);
> >> +
> >> +	sdel->next = NULL;
> >> +	/*
> >> +	 * At this point, we dont own a lock on socket, only a reference.
> >> +	 * We must commit above write, or another cpu could miss a wakeup
> >> +	 */
> >> +	smp_wmb();
> > 
> > Where's the matching barrier?
> 
> Check softirq_delay_exec(void) comment, where I stated synchronization had
> to be done by the subsystem.

afaiu the memory barrier semantics you cannot pair a wmb with a lock
barrier, it must either be a read, read_barrier_depends or full barrier.

> In this socket case, caller of softirq_delay_exec() has a lock on socket.
> 
> Problem is I dont want to get this lock again in sock_readable_defer() callback
> 
> if sdel->next is not committed, another cpu could call _softirq_delay_queue() and
> find sdel->next being not null (or != sdel with your suggestion). Then next->func()
> wont be called as it should (or called litle bit too soon)

Right, what we can do is put the wmb in the callback and the rmb right
before the __queue op, or simply integrate it into the framework.

> > OK, so the idea is to handle a bunch of packets and instead of waking N
> > threads for each packet, only wake them once at the end of the batch?
> > 
> > Sounds like a sensible idea.. 
> 
> Idea is to batch wakeups() yes, and if we receive several packets for
> the same socket(s), we reduce number of wakeups to one. In the multicast stress
> situation of Athena CR, it really helps, no packets dropped instead of
> 30%

Yes I can see that helping tremendously.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-17 11:08                                                     ` Eric Dumazet
  2009-03-17 11:57                                                       ` Peter Zijlstra
@ 2009-03-17 15:00                                                       ` Brian Bloniarz
  2009-03-17 15:16                                                         ` Eric Dumazet
  1 sibling, 1 reply; 70+ messages in thread
From: Brian Bloniarz @ 2009-03-17 15:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Peter Zijlstra, David Miller, kchang, netdev, cl

Eric Dumazet wrote:
> Sure, I should have taken more time, will repost this in a couple of hours,
> with nice CHangelogs and split patches.

One small thing: with CONFIG_IPV6=m, inet_def_readable needs to be exported,
right?

Thanks,
Brian Bloniarz

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-17 15:00                                                       ` Brian Bloniarz
@ 2009-03-17 15:16                                                         ` Eric Dumazet
  2009-03-17 19:39                                                           ` David Stevens
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Dumazet @ 2009-03-17 15:16 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: Peter Zijlstra, David Miller, kchang, netdev, cl

Brian Bloniarz a écrit :
> Eric Dumazet wrote:
>> Sure, I should have taken more time, will repost this in a couple of
>> hours,
>> with nice CHangelogs and split patches.
> 
> One small thing: with CONFIG_IPV6=m, inet_def_readable needs to be
> exported,
> right?
> 

Absolutely, thank you !


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-17 15:16                                                         ` Eric Dumazet
@ 2009-03-17 19:39                                                           ` David Stevens
  2009-03-17 21:19                                                             ` Eric Dumazet
  0 siblings, 1 reply; 70+ messages in thread
From: David Stevens @ 2009-03-17 19:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Brian Bloniarz, cl, David Miller, kchang, netdev, netdev-owner,
	Peter Zijlstra

I did some testing with this and see at least a 20% improvement
without drop.

I agree with Peter's recommended changes (esp. sentinel vs null),
and also the trivial brace indentation  in softirq_delay_exec(),
but otherwise looks  good to me. Nice work.

                                        +-DLS


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-17 19:39                                                           ` David Stevens
@ 2009-03-17 21:19                                                             ` Eric Dumazet
  0 siblings, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-03-17 21:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Brian Bloniarz, cl, David Miller, kchang, netdev, netdev-owner,
	Peter Zijlstra

David Stevens a écrit :
> I did some testing with this and see at least a 20% improvement
> without drop.
> 
> I agree with Peter's recommended changes (esp. sentinel vs null),
> and also the trivial brace indentation  in softirq_delay_exec(),
> but otherwise looks  good to me. Nice work.
> 
>                                         +-DLS
> 
> 

Still I dont like very much all softirq.c changes. I feel very
uncomfortable to justify one extra call in do_softirq(), and
not very clean interface (stuff about locking, barriers...)

Easy way could be to add a SOFTIRQ but its not very wise.

I was wondering if we could use the infrastructure added in commit
54514a70adefe356afe854e2d3912d46668068e6
(softirq: Add support for triggering softirq work on softirqs.)
But I dont understand how it can works...
(softirq_work_list is feeded, but never processed)

Alternatively, we could use a framework dedicated to
network use, with well defined semantic :

Calling softirq_delay_exec() from net_rx_action(),
from this function, we know if time_squeeze was incremented,
or all netdev_budget consumed, and in this stress case 
try to give the wakeups job to another cpu.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-03-16 22:22                                                 ` Multicast packet loss Eric Dumazet
  2009-03-17 10:11                                                   ` Peter Zijlstra
@ 2009-04-03 19:28                                                   ` Brian Bloniarz
  2009-04-05 13:49                                                     ` Eric Dumazet
  1 sibling, 1 reply; 70+ messages in thread
From: Brian Bloniarz @ 2009-04-03 19:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kchang, netdev, cl

Hi Eric,

We've been experimenting with this softirq-delay patch in production, and
have seen some hard-to-reproduce crashes. We finally managed to capture a
kexec crashdump this morning.

This is the dmesg:

[53417.592868] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[53417.598377]  [<ffffffff80243643>] __do_softirq+0xc3/0x150
[53417.606300] PGD 32abb8067 PUD 32faf5067 PMD 0
[53417.610829] Oops: 0000 [1] SMP
[53417.614032] CPU 2
[53417.616083] Modules linked in: nfs lockd nfs_acl sunrpc openafs(P) autofs4 ipv6 ac sbs sbshc video output dock battery container iptable_filter ip_tables x_tables parport_pc lp parport loop joydev iTCO_wdt iTCO_vendor_support evdev button i5000_edac psmouse serio_raw pcspkr shpchp pci_hotplug edac_core ext3 jbd mbcache sr_mod cdrom ata_generic usbhid hid ata_piix sg sd_mod ehci_hcd pata_acpi uhci_hcd libata bnx2 aacraid usbcore scsi_mod thermal processor fan fbcon tileblit font bitblit softcursor fuse
[53417.662067] Pid: 13039, comm: gball Tainted: P        2.6.24-19acr2-generic #1
[53417.669219] RIP: 0010:[<ffffffff80243643>]  [<ffffffff80243643>] __do_softirq+0xc3/0x150
[53417.677368] RSP: 0018:ffff8103314f3f20  EFLAGS: 00010297
[53417.682697] RAX: ffff810084a1b000 RBX: ffffffff805ba530 RCX: 0000000000000000
[53417.689843] RDX: ffff8103305811e0 RSI: 0000000000000282 RDI: ffff810332ada580
[53417.696993] RBP: 0000000000000000 R08: ffff81032fad9f08 R09: ffff810332382000
[53417.704144] R10: 0000000000000000 R11: ffffffff80316ec0 R12: ffffffff8062b3d8
[53417.711294] R13: ffffffff8062b480 R14: 0000000000000002 R15: 000000000000000a
[53417.718447] FS:  00007fab0d7b8750(0000) GS:ffff810334401b80(0000) knlGS:0000000000000000
[53417.726568] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[53417.732332] CR2: 0000000000000000 CR3: 0000000329e2d000 CR4: 00000000000006e0
[53417.739476] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[53417.746637] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[53417.753787] Process gball (pid: 13039, threadinfo ffff81032adde000, task ffff810329ff77d0)
[53417.761991] Stack:  ffffffff8062b3d8 0000000000000046 ffff8103314f3f68 0000000000000000
[53417.770146]  00000000000000a0 ffff81032addfee8 0000000000000000 ffffffff8020d50c
[53417.777660]  ffff8103314f3f68 00000000000000c1 ffffffff8020ed25 ffffffff8062c870
[53417.784961] Call Trace:
[53417.787635]  <IRQ>  [<ffffffff8020d50c>] call_softirq+0x1c/0x30
[53417.793597]  [<ffffffff8020ed25>] do_softirq+0x35/0x90
[53417.798747]  [<ffffffff80243578>] irq_exit+0x88/0x90
[53417.803727]  [<ffffffff8020ef70>] do_IRQ+0x80/0x100
[53417.808624]  [<ffffffff8020c891>] ret_from_intr+0x0/0xa
[53417.813862]  <EOI>  [<ffffffff803e53c8>] skb_release_all+0x18/0x150
[53417.820164]  [<ffffffff803e4ad9>] __kfree_skb+0x9/0x90
[53417.825327]  [<ffffffff80437612>] udp_recvmsg+0x222/0x260
[53417.830744]  [<ffffffff80231264>] source_load+0x34/0x70
[53417.835984]  [<ffffffff80232a9a>] find_busiest_group+0x1fa/0x850
[53417.842019]  [<ffffffff803e0100>] sock_common_recvmsg+0x30/0x50
[53417.847958]  [<ffffffff803de1ca>] sock_recvmsg+0x14a/0x160
[53417.853462]  [<ffffffff80231c21>] update_curr+0x71/0x100
[53419.858789]  [<ffffffff802320fd>] __dequeue_entity+0x3d/0x50
[53417.864469]  [<ffffffff80253ab0>] autoremove_wake_function+0x0/0x30
[53417.870758]  [<ffffffff8046662f>] thread_return+0x3a/0x57b
[53417.876262]  [<ffffffff803df73e>] sys_recvfrom+0xfe/0x190
[53417.881680]  [<ffffffff802e2a95>] sys_epoll_wait+0x245/0x4e0
[53417.887358]  [<ffffffff80233e20>] default_wake_function+0x0/0x10
[53417.893384]  [<ffffffff8020c37e>] system_call+0x7e/0x83
[53417.898628]
[53417.900134]
[53417.900134] Code: 48 8b 11 48 89 cf 65 48 8b 04 25 08 00 00 00 4a 89 14 20 ff
[53417.909430] RIP  [<ffffffff80243643>] __do_softirq+0xc3/0x150
[53417.915210]  RSP <ffff8103314f3f20>

The disassembly where it crashed:
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:273
ffffffff8024361b:       d1 ed                   shr    %ebp
rcu_bh_qsctr_inc():
/local/home/bmb/doc/kernels/linux-hardy-eric/include/linux/rcupdate.h:130
ffffffff8024361d:       48 8b 40 08             mov    0x8(%rax),%rax
ffffffff80243621:       41 c7 44 05 08 01 00    movl   $0x1,0x8(%r13,%rax,1)
ffffffff80243628:       00 00
__do_softirq():
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:273
ffffffff8024362a:       75 d8                   jne    ffffffff80243604 <__do_softirq+0x84>
softirq_delay_exec():
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:225
ffffffff8024362c:       48 8b 14 24             mov    (%rsp),%rdx
ffffffff80243630:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
ffffffff80243637:       00 00
ffffffff80243639:       48 8b 0c 10             mov    (%rax,%rdx,1),%rcx
ffffffff8024363d:       48 83 f9 01             cmp    $0x1,%rcx
ffffffff80243641:       74 29                   je     ffffffff8024366c <__do_softirq+0xec>
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:226
ffffffff80243643:       48 8b 11                mov    (%rcx),%rdx
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:227
ffffffff80243646:       48 89 cf                mov    %rcx,%rdi
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:226
ffffffff80243649:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
ffffffff80243650:       00 00
ffffffff80243652:       4a 89 14 20             mov    %rdx,(%rax,%r12,1)
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:227
ffffffff80243656:       ff 51 08                callq  *0x8(%rcx)
/local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:225
ffffffff80243659:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
ffffffff80243660:       00 00
ffffffff80243662:       4a 8b 0c 20             mov    (%rax,%r12,1),%rcx
ffffffff80243666:       48 83 f9 01             cmp    $0x1,%rcx
ffffffff8024366a:       75 d7                   jne    ffffffff80243643 <__do_softirq+0xc3>
raw_local_irq_disable():
/local/home/bmb/doc/kernels/linux-hardy-eric/debian/build/build-generic/include2/asm/irqflags_64.h:76
ffffffff8024366c:       fa                      cli

And softirq.c line numbers:
    218   * Because locking is provided by subsystem, please note
    219   * that sdel->func(sdel) is responsible for setting sdel->next to NULL
    220   */
    221  static void softirq_delay_exec(void)
    222  {
    223          struct softirq_delay *sdel;
    224
    225          while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
    226                  __get_cpu_var(softirq_delay_head) = sdel->next;
    227                  sdel->func(sdel);       /*      sdel->next = NULL;*/
    228                  }
    229  }

So it's crashing because __get_cpu_var(softirq_delay_head)) is NULL somehow.

We aren't running a recent kernel -- we're running Ubuntu Hardy's 2.6.24-19,
with a backported version of this patch. One more atypical thing is that
we run openafs, 1.4.6.dfsg1-2.

Like I said, I have a full vmcore (3, actually) and would be happy to post any
more information you'd like to know.

Thanks,
Brian Bloniarz

Eric Dumazet wrote:
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Fri, 13 Mar 2009 23:30:31 +0100
>>
>>> David Miller a écrit :
>>>>> Also, when an event was queued for later invocation, I also needed to keep
>>>>> a reference on "struct socket" to make sure it doesnt disappear before
>>>>> the invocation. Not all sockets are RCU guarded (we added RCU only for 
>>>>> some protocols (TCP, UDP ...). So I found keeping a read_lock
>>>>> on callback was the easyest thing to do. I now realize we might
>>>>> overflow preempt_count, so special care is needed.
>>>> You're using this in UDP so... make the rule that you can't use
>>>> this with a non-RCU-quiescent protocol.
>>> UDP/TCP only ? I though many other protocols (not all using RCU) were
>>> using sock_def_readable() too...
>> Maybe create a inet_def_readable() just for this purpose :-)
> 
> 
> Here is the last incantation of the patch, that of course should be
> split in two parts and better Changelog for further discussion on lkml.
> 
> We need to take a reference on sock when queued on a softirq delay
> list. RCU wont help here because of SLAB_DESTROY_BY_RCU thing :
> Another cpu could free/reuse the socket before we have a chance to
> call softirq_delay_exec()
> 
> UDP & UDPLite use this delayed wakeup feature.
> 
> Thank you
> 
> [PATCH] softirq: Introduce mechanism to defer wakeups
> 
> Some network workloads need to call scheduler too many times. For example,
> each received multicast frame can wakeup many threads. ksoftirqd is then
> not able to drain NIC RX queues in time and we get frame losses and high
> latencies.
> 
> This patch adds an infrastructure to delay work done in
> sock_def_readable() at end of do_softirq(). This needs to
> make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS
> 
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  include/linux/interrupt.h |   18 +++++++++++++++
>  include/linux/irqflags.h  |   11 ++++-----
>  include/linux/sched.h     |    2 -
>  include/net/sock.h        |    2 +
>  include/net/udplite.h     |    1
>  kernel/lockdep.c          |    2 -
>  kernel/softirq.c          |   42 ++++++++++++++++++++++++++++++++++--
>  lib/locking-selftest.c    |    4 +--
>  net/core/sock.c           |   41 +++++++++++++++++++++++++++++++++++
>  net/ipv4/udp.c            |    7 ++++++
>  net/ipv6/udp.c            |    7 ++++++
>  11 files changed, 125 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 9127f6b..a773d0c 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -295,6 +295,24 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir
>  extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
>  				  int this_cpu, int softirq);
>  
> +/*
> + * softirq delayed works : should be delayed at do_softirq() end
> + */
> +struct softirq_delay {
> +	struct softirq_delay	*next;
> +	void 			(*func)(struct softirq_delay *);
> +};
> +
> +int softirq_delay_queue(struct softirq_delay *sdel);
> +
> +static inline void softirq_delay_init(struct softirq_delay *sdel,
> +				      void (*func)(struct softirq_delay *))
> +{
> +	sdel->next = NULL;
> +	sdel->func = func;
> +}
> +
> +
>  /* Tasklets --- multithreaded analogue of BHs.
>  
>     Main feature differing them of generic softirqs: tasklet
> diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
> index 74bde13..30c1e01 100644
> --- a/include/linux/irqflags.h
> +++ b/include/linux/irqflags.h
> @@ -13,19 +13,21 @@
>  
>  #include <linux/typecheck.h>
>  
> +#define softirq_enter()	do { current->softirq_context++; } while (0)
> +#define softirq_exit()	do { current->softirq_context--; } while (0)
> +#define softirq_context(p)	((p)->softirq_context)
> +#define running_from_softirq()  (softirq_context(current) > 0)
> +
>  #ifdef CONFIG_TRACE_IRQFLAGS
>    extern void trace_softirqs_on(unsigned long ip);
>    extern void trace_softirqs_off(unsigned long ip);
>    extern void trace_hardirqs_on(void);
>    extern void trace_hardirqs_off(void);
>  # define trace_hardirq_context(p)	((p)->hardirq_context)
> -# define trace_softirq_context(p)	((p)->softirq_context)
>  # define trace_hardirqs_enabled(p)	((p)->hardirqs_enabled)
>  # define trace_softirqs_enabled(p)	((p)->softirqs_enabled)
>  # define trace_hardirq_enter()	do { current->hardirq_context++; } while (0)
>  # define trace_hardirq_exit()	do { current->hardirq_context--; } while (0)
> -# define trace_softirq_enter()	do { current->softirq_context++; } while (0)
> -# define trace_softirq_exit()	do { current->softirq_context--; } while (0)
>  # define INIT_TRACE_IRQFLAGS	.softirqs_enabled = 1,
>  #else
>  # define trace_hardirqs_on()		do { } while (0)
> @@ -33,13 +35,10 @@
>  # define trace_softirqs_on(ip)		do { } while (0)
>  # define trace_softirqs_off(ip)		do { } while (0)
>  # define trace_hardirq_context(p)	0
> -# define trace_softirq_context(p)	0
>  # define trace_hardirqs_enabled(p)	0
>  # define trace_softirqs_enabled(p)	0
>  # define trace_hardirq_enter()		do { } while (0)
>  # define trace_hardirq_exit()		do { } while (0)
> -# define trace_softirq_enter()		do { } while (0)
> -# define trace_softirq_exit()		do { } while (0)
>  # define INIT_TRACE_IRQFLAGS
>  #endif
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8c216e0..5dd8487 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1320,8 +1320,8 @@ struct task_struct {
>  	unsigned long softirq_enable_ip;
>  	unsigned int softirq_enable_event;
>  	int hardirq_context;
> -	int softirq_context;
>  #endif
> +	int softirq_context;
>  #ifdef CONFIG_LOCKDEP
>  # define MAX_LOCK_DEPTH 48UL
>  	u64 curr_chain_key;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 4bb1ff9..0160a83 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -260,6 +260,7 @@ struct sock {
>  	unsigned long	        sk_lingertime;
>  	struct sk_buff_head	sk_error_queue;
>  	struct proto		*sk_prot_creator;
> +	struct softirq_delay	sk_delay;
>  	rwlock_t		sk_callback_lock;
>  	int			sk_err,
>  				sk_err_soft;
> @@ -960,6 +961,7 @@ extern void *sock_kmalloc(struct sock *sk, int size,
>  			  gfp_t priority);
>  extern void sock_kfree_s(struct sock *sk, void *mem, int size);
>  extern void sk_send_sigurg(struct sock *sk);
> +extern void inet_def_readable(struct sock *sk, int len);
>  
>  /*
>   * Functions to fill in entries in struct proto_ops when a protocol
> diff --git a/include/net/udplite.h b/include/net/udplite.h
> index afdffe6..7ce0ee0 100644
> --- a/include/net/udplite.h
> +++ b/include/net/udplite.h
> @@ -25,6 +25,7 @@ static __inline__ int udplite_getfrag(void *from, char *to, int  offset,
>  /* Designate sk as UDP-Lite socket */
>  static inline int udplite_sk_init(struct sock *sk)
>  {
> +	sk->sk_data_ready = inet_def_readable;
>  	udp_sk(sk)->pcflag = UDPLITE_BIT;
>  	return 0;
>  }
> diff --git a/kernel/lockdep.c b/kernel/lockdep.c
> index 06b0c35..9873b40 100644
> --- a/kernel/lockdep.c
> +++ b/kernel/lockdep.c
> @@ -1807,7 +1807,7 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,
>  	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
>  		curr->comm, task_pid_nr(curr),
>  		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
> -		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
> +		softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
>  		trace_hardirqs_enabled(curr),
>  		trace_softirqs_enabled(curr));
>  	print_lock(this);
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index bdbe9de..91a1714 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -158,6 +158,42 @@ void local_bh_enable_ip(unsigned long ip)
>  }
>  EXPORT_SYMBOL(local_bh_enable_ip);
>  
> +
> +#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
> +static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
> +	SOFTIRQ_DELAY_END
> +};
> +
> +/*
> + * Caller must disable preemption, and take care of appropriate
> + * locking and refcounting
> + */
> +int softirq_delay_queue(struct softirq_delay *sdel)
> +{
> +	if (!sdel->next) {
> +		sdel->next = __get_cpu_var(softirq_delay_head);
> +		__get_cpu_var(softirq_delay_head) = sdel;
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Because locking is provided by subsystem, please note
> + * that sdel->func(sdel) is responsible for setting sdel->next to NULL
> + */
> +static void softirq_delay_exec(void)
> +{
> +	struct softirq_delay *sdel;
> +
> +	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
> +		__get_cpu_var(softirq_delay_head) = sdel->next;
> +		sdel->func(sdel);	/*	sdel->next = NULL;*/
> +		}
> +}
> +
> +
> +
>  /*
>   * We restart softirq processing MAX_SOFTIRQ_RESTART times,
>   * and we fall back to softirqd after that.
> @@ -180,7 +216,7 @@ asmlinkage void __do_softirq(void)
>  	account_system_vtime(current);
>  
>  	__local_bh_disable((unsigned long)__builtin_return_address(0));
> -	trace_softirq_enter();
> +	softirq_enter();
>  
>  	cpu = smp_processor_id();
>  restart:
> @@ -211,6 +247,8 @@ restart:
>  		pending >>= 1;
>  	} while (pending);
>  
> +	softirq_delay_exec();
> +
>  	local_irq_disable();
>  
>  	pending = local_softirq_pending();
> @@ -220,7 +258,7 @@ restart:
>  	if (pending)
>  		wakeup_softirqd();
>  
> -	trace_softirq_exit();
> +	softirq_exit();
>  
>  	account_system_vtime(current);
>  	_local_bh_enable();
> diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
> index 280332c..1aa7351 100644
> --- a/lib/locking-selftest.c
> +++ b/lib/locking-selftest.c
> @@ -157,11 +157,11 @@ static void init_shared_classes(void)
>  #define SOFTIRQ_ENTER()				\
>  		local_bh_disable();		\
>  		local_irq_disable();		\
> -		trace_softirq_enter();		\
> +		softirq_enter();		\
>  		WARN_ON(!in_softirq());
>  
>  #define SOFTIRQ_EXIT()				\
> -		trace_softirq_exit();		\
> +		softirq_exit();		\
>  		local_irq_enable();		\
>  		local_bh_enable();
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 0620046..c8745d1 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -213,6 +213,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
>  /* Maximal space eaten by iovec or ancilliary data plus some space */
>  int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
>  
> +static void sock_readable_defer(struct softirq_delay *sdel);
> +
>  static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
>  {
>  	struct timeval tv;
> @@ -1074,6 +1076,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
>  #endif
>  
>  		rwlock_init(&newsk->sk_dst_lock);
> +		softirq_delay_init(&newsk->sk_delay, sock_readable_defer);
>  		rwlock_init(&newsk->sk_callback_lock);
>  		lockdep_set_class_and_name(&newsk->sk_callback_lock,
>  				af_callback_keys + newsk->sk_family,
> @@ -1691,6 +1694,43 @@ static void sock_def_readable(struct sock *sk, int len)
>  	read_unlock(&sk->sk_callback_lock);
>  }
>  
> +/*
> + * helper function called by softirq_delay_exec(),
> + * if inet_def_readable() queued us.
> + */
> +static void sock_readable_defer(struct softirq_delay *sdel)
> +{
> +	struct sock *sk = container_of(sdel, struct sock, sk_delay);
> +
> +	sdel->next = NULL;
> +	/*
> +	 * At this point, we dont own a lock on socket, only a reference.
> +	 * We must commit above write, or another cpu could miss a wakeup
> +	 */
> +	smp_wmb();
> +	sock_def_readable(sk, 0);
> +	sock_put(sk);
> +}
> +
> +/*
> + * Custom version of sock_def_readable()
> + * We want to defer scheduler processing at the end of do_softirq()
> + * Called with socket locked.
> + */
> +void inet_def_readable(struct sock *sk, int len)
> +{
> +	if (running_from_softirq()) {
> +		if (softirq_delay_queue(&sk->sk_delay))
> +			/*
> +			 * If we queued this socket, take a reference on it
> +			 * Caller owns socket lock, so write to sk_delay.next
> +			 * will be committed before unlock.
> +			 */
> +			sock_hold(sk);
> +	} else
> +		sock_def_readable(sk, len);
> +}
> +
>  static void sock_def_write_space(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> @@ -1768,6 +1808,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
>  		sk->sk_sleep	=	NULL;
>  
>  	rwlock_init(&sk->sk_dst_lock);
> +	softirq_delay_init(&sk->sk_delay, sock_readable_defer);
>  	rwlock_init(&sk->sk_callback_lock);
>  	lockdep_set_class_and_name(&sk->sk_callback_lock,
>  			af_callback_keys + sk->sk_family,
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 05b7abb..1cc0907 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1342,6 +1342,12 @@ void udp_destroy_sock(struct sock *sk)
>  	release_sock(sk);
>  }
>  
> +static int udp_init_sock(struct sock *sk)
> +{
> +	sk->sk_data_ready = inet_def_readable;
> +	return 0;
> +}
> +
>  /*
>   *	Socket option code for UDP
>   */
> @@ -1559,6 +1565,7 @@ struct proto udp_prot = {
>  	.connect	   = ip4_datagram_connect,
>  	.disconnect	   = udp_disconnect,
>  	.ioctl		   = udp_ioctl,
> +	.init		   = udp_init_sock,
>  	.destroy	   = udp_destroy_sock,
>  	.setsockopt	   = udp_setsockopt,
>  	.getsockopt	   = udp_getsockopt,
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 84b1a29..1a9f8d4 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -960,6 +960,12 @@ void udpv6_destroy_sock(struct sock *sk)
>  	inet6_destroy_sock(sk);
>  }
>  
> +static int udpv6_init_sock(struct sock *sk)
> +{
> +	sk->sk_data_ready = inet_def_readable;
> +	return 0;
> +}
> +
>  /*
>   *	Socket option code for UDP
>   */
> @@ -1084,6 +1090,7 @@ struct proto udpv6_prot = {
>  	.connect	   = ip6_datagram_connect,
>  	.disconnect	   = udp_disconnect,
>  	.ioctl		   = udp_ioctl,
> +	.init 		   = udpv6_init_sock,
>  	.destroy	   = udpv6_destroy_sock,
>  	.setsockopt	   = udpv6_setsockopt,
>  	.getsockopt	   = udpv6_getsockopt,
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-04-03 19:28                                                   ` Brian Bloniarz
@ 2009-04-05 13:49                                                     ` Eric Dumazet
  2009-04-06 21:53                                                       ` Brian Bloniarz
  2009-04-07 20:08                                                       ` Brian Bloniarz
  0 siblings, 2 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-04-05 13:49 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: David Miller, kchang, netdev, cl

Brian Bloniarz a écrit :
> Hi Eric,
> 
> We've been experimenting with this softirq-delay patch in production, and
> have seen some hard-to-reproduce crashes. We finally managed to capture a
> kexec crashdump this morning.
> 
> This is the dmesg:
> 
> [53417.592868] Unable to handle kernel NULL pointer dereference at
> 0000000000000000 RIP:
> [53417.598377]  [<ffffffff80243643>] __do_softirq+0xc3/0x150
> [53417.606300] PGD 32abb8067 PUD 32faf5067 PMD 0
> [53417.610829] Oops: 0000 [1] SMP
> [53417.614032] CPU 2
> [53417.616083] Modules linked in: nfs lockd nfs_acl sunrpc openafs(P)
> autofs4 ipv6 ac sbs sbshc video output dock battery container
> iptable_filter ip_tables x_tables parport_pc lp parport loop joydev
> iTCO_wdt iTCO_vendor_support evdev button i5000_edac psmouse serio_raw
> pcspkr shpchp pci_hotplug edac_core ext3 jbd mbcache sr_mod cdrom
> ata_generic usbhid hid ata_piix sg sd_mod ehci_hcd pata_acpi uhci_hcd
> libata bnx2 aacraid usbcore scsi_mod thermal processor fan fbcon
> tileblit font bitblit softcursor fuse
> [53417.662067] Pid: 13039, comm: gball Tainted: P       
> 2.6.24-19acr2-generic #1
> [53417.669219] RIP: 0010:[<ffffffff80243643>]  [<ffffffff80243643>]
> __do_softirq+0xc3/0x150
> [53417.677368] RSP: 0018:ffff8103314f3f20  EFLAGS: 00010297
> [53417.682697] RAX: ffff810084a1b000 RBX: ffffffff805ba530 RCX:
> 0000000000000000
> [53417.689843] RDX: ffff8103305811e0 RSI: 0000000000000282 RDI:
> ffff810332ada580
> [53417.696993] RBP: 0000000000000000 R08: ffff81032fad9f08 R09:
> ffff810332382000
> [53417.704144] R10: 0000000000000000 R11: ffffffff80316ec0 R12:
> ffffffff8062b3d8
> [53417.711294] R13: ffffffff8062b480 R14: 0000000000000002 R15:
> 000000000000000a
> [53417.718447] FS:  00007fab0d7b8750(0000) GS:ffff810334401b80(0000)
> knlGS:0000000000000000
> [53417.726568] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [53417.732332] CR2: 0000000000000000 CR3: 0000000329e2d000 CR4:
> 00000000000006e0
> [53417.739476] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [53417.746637] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [53417.753787] Process gball (pid: 13039, threadinfo ffff81032adde000,
> task ffff810329ff77d0)
> [53417.761991] Stack:  ffffffff8062b3d8 0000000000000046
> ffff8103314f3f68 0000000000000000
> [53417.770146]  00000000000000a0 ffff81032addfee8 0000000000000000
> ffffffff8020d50c
> [53417.777660]  ffff8103314f3f68 00000000000000c1 ffffffff8020ed25
> ffffffff8062c870
> [53417.784961] Call Trace:
> [53417.787635]  <IRQ>  [<ffffffff8020d50c>] call_softirq+0x1c/0x30
> [53417.793597]  [<ffffffff8020ed25>] do_softirq+0x35/0x90
> [53417.798747]  [<ffffffff80243578>] irq_exit+0x88/0x90
> [53417.803727]  [<ffffffff8020ef70>] do_IRQ+0x80/0x100
> [53417.808624]  [<ffffffff8020c891>] ret_from_intr+0x0/0xa
> [53417.813862]  <EOI>  [<ffffffff803e53c8>] skb_release_all+0x18/0x150
> [53417.820164]  [<ffffffff803e4ad9>] __kfree_skb+0x9/0x90
> [53417.825327]  [<ffffffff80437612>] udp_recvmsg+0x222/0x260
> [53417.830744]  [<ffffffff80231264>] source_load+0x34/0x70
> [53417.835984]  [<ffffffff80232a9a>] find_busiest_group+0x1fa/0x850
> [53417.842019]  [<ffffffff803e0100>] sock_common_recvmsg+0x30/0x50
> [53417.847958]  [<ffffffff803de1ca>] sock_recvmsg+0x14a/0x160
> [53417.853462]  [<ffffffff80231c21>] update_curr+0x71/0x100
> [53419.858789]  [<ffffffff802320fd>] __dequeue_entity+0x3d/0x50
> [53417.864469]  [<ffffffff80253ab0>] autoremove_wake_function+0x0/0x30
> [53417.870758]  [<ffffffff8046662f>] thread_return+0x3a/0x57b
> [53417.876262]  [<ffffffff803df73e>] sys_recvfrom+0xfe/0x190
> [53417.881680]  [<ffffffff802e2a95>] sys_epoll_wait+0x245/0x4e0
> [53417.887358]  [<ffffffff80233e20>] default_wake_function+0x0/0x10
> [53417.893384]  [<ffffffff8020c37e>] system_call+0x7e/0x83
> [53417.898628]
> [53417.900134]
> [53417.900134] Code: 48 8b 11 48 89 cf 65 48 8b 04 25 08 00 00 00 4a 89
> 14 20 ff
> [53417.909430] RIP  [<ffffffff80243643>] __do_softirq+0xc3/0x150
> [53417.915210]  RSP <ffff8103314f3f20>
> 
> The disassembly where it crashed:
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:273
> ffffffff8024361b:       d1 ed                   shr    %ebp
> rcu_bh_qsctr_inc():
> /local/home/bmb/doc/kernels/linux-hardy-eric/include/linux/rcupdate.h:130
> ffffffff8024361d:       48 8b 40 08             mov    0x8(%rax),%rax
> ffffffff80243621:       41 c7 44 05 08 01 00    movl  
> $0x1,0x8(%r13,%rax,1)
> ffffffff80243628:       00 00
> __do_softirq():
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:273
> ffffffff8024362a:       75 d8                   jne    ffffffff80243604
> <__do_softirq+0x84>
> softirq_delay_exec():
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:225
> ffffffff8024362c:       48 8b 14 24             mov    (%rsp),%rdx
> ffffffff80243630:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
> ffffffff80243637:       00 00
> ffffffff80243639:       48 8b 0c 10             mov    (%rax,%rdx,1),%rcx
> ffffffff8024363d:       48 83 f9 01             cmp    $0x1,%rcx
> ffffffff80243641:       74 29                   je     ffffffff8024366c
> <__do_softirq+0xec>
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:226
> ffffffff80243643:       48 8b 11                mov    (%rcx),%rdx
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:227
> ffffffff80243646:       48 89 cf                mov    %rcx,%rdi
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:226
> ffffffff80243649:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
> ffffffff80243650:       00 00
> ffffffff80243652:       4a 89 14 20             mov    %rdx,(%rax,%r12,1)
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:227
> ffffffff80243656:       ff 51 08                callq  *0x8(%rcx)
> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:225
> ffffffff80243659:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
> ffffffff80243660:       00 00
> ffffffff80243662:       4a 8b 0c 20             mov    (%rax,%r12,1),%rcx
> ffffffff80243666:       48 83 f9 01             cmp    $0x1,%rcx
> ffffffff8024366a:       75 d7                   jne    ffffffff80243643
> <__do_softirq+0xc3>
> raw_local_irq_disable():
> /local/home/bmb/doc/kernels/linux-hardy-eric/debian/build/build-generic/include2/asm/irqflags_64.h:76
> 
> ffffffff8024366c:       fa                      cli
> 
> And softirq.c line numbers:
>    218   * Because locking is provided by subsystem, please note
>    219   * that sdel->func(sdel) is responsible for setting sdel->next
> to NULL
>    220   */
>    221  static void softirq_delay_exec(void)
>    222  {
>    223          struct softirq_delay *sdel;
>    224
>    225          while ((sdel = __get_cpu_var(softirq_delay_head)) !=
> SOFTIRQ_DELAY_END) {
>    226                  __get_cpu_var(softirq_delay_head) = sdel->next;
>    227                  sdel->func(sdel);       /*      sdel->next =
> NULL;*/
>    228                  }
>    229  }
> 
> So it's crashing because __get_cpu_var(softirq_delay_head)) is NULL
> somehow.
> 
> We aren't running a recent kernel -- we're running Ubuntu Hardy's
> 2.6.24-19,
> with a backported version of this patch. One more atypical thing is that
> we run openafs, 1.4.6.dfsg1-2.
> 
> Like I said, I have a full vmcore (3, actually) and would be happy to
> post any
> more information you'd like to know.
> 
> Thanks,
> Brian Bloniarz

Hi Brian

2.6.24-19 kernel... hmm...

Could you please send me the diff of your backport against this kernel ?

I take you use Ubuntu Hardys 8.04 LTS server edition ?

Pointer being null might tell us that we managed to call inet_def_readable()
without socket lock hold...


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-04-05 13:49                                                     ` Eric Dumazet
@ 2009-04-06 21:53                                                       ` Brian Bloniarz
  2009-04-06 22:12                                                         ` Brian Bloniarz
  2009-04-07 20:08                                                       ` Brian Bloniarz
  1 sibling, 1 reply; 70+ messages in thread
From: Brian Bloniarz @ 2009-04-06 21:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Eric Dumazet wrote:
> Pointer being null might tell us that we managed to call inet_def_readable()
> without socket lock hold...

Trying to track this down: I added:
	BUG_ON(!spin_is_locked(&sk->sk_lock.slock));
to the top of inet_def_readable. This gives me the following panic:

[ 2528.745311] kernel BUG at net/core/sock.c:1674!
[ 2528.745311] invalid opcode: 0000 [#1] PREEMPT SMP
[ 2528.745311] last sysfs file: /sys/devices/system/cpu/cpu7/crash_notes
[ 2528.745311] CPU 6
[ 2528.745311] Modules linked in: iptable_filter ip_tables x_tables parport_pc lp parport loop iTCO_wdt iTCO_vendor_support serio_raw psmouse pcspkr i5k_amb shpchp i5000_edac pci_hotplug button edac_core ipv6 ibmpex joydev ipmi_msghandler evdev ext3 jbd mbcache usbhid hid sr_mod cdrom pata_acpi ata_generic sg sd_mod ata_piix ehci_hcd uhci_hcd libata aacraid usbcore scsi_mod bnx2 thermal processor fan thermal_sys fuse
[ 2528.745311] Pid: 14507, comm: signalgen Not tainted 2.6.29.1-eric2-lowlat-lockdep #3 IBM System x3550 -[7978AC1]-
[ 2528.745311] RIP: 0010:[<ffffffff80444ec2>]  [<ffffffff80444ec2>] inet_def_readable+0x52/0x60
[ 2528.745311] RSP: 0018:ffff88043b985b58  EFLAGS: 00010246
[ 2528.745311] RAX: 0000000000000019 RBX: ffff88043b90c280 RCX: 0000000000000000
[ 2528.745311] RDX: 0000000000001919 RSI: 0000000000000068 RDI: ffff88043b90c280
[ 2528.745311] RBP: ffff88043b985b68 R08: 0000000000000000 R09: 0000000000000000
[ 2528.745311] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88043b811400
[ 2528.745311] R13: 0000000000000000 R14: 0000000000000068 R15: 0000000000000000
[ 2528.745311] FS:  00007f82f0742750(0000) GS:ffff88043dbc8280(0000) knlGS:0000000000000000
[ 2528.745311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2528.745311] CR2: 000000000057f1a0 CR3: 000000043915e000 CR4: 00000000000406e0
[ 2528.745311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2528.745311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2528.745311] Process signalgen (pid: 14507, threadinfo ffff88043b984000, task ffff8804309a9ef0)
[ 2528.745311] Stack:
[ 2528.745311]  ffff88043b811400 ffff88043b90c280 ffff88043b985b98 ffffffff80444ff6
[ 2528.745311]  ffff88043b90c280 ffff88043b811400 0000000000000000 ffff88043b90c2c0
[ 2528.745311]  ffff88043b985bc8 ffffffff8049ee67 ffff88043b985bc8 ffff88043b811400
[ 2528.745311] Call Trace:
[ 2528.745311]  [<ffffffff80444ff6>] sock_queue_rcv_skb+0xd6/0x120
[ 2528.745311]  [<ffffffff8049ee67>] __udp_queue_rcv_skb+0x27/0xe0
[ 2528.745311]  [<ffffffff8044406a>] release_sock+0x7a/0xe0
[ 2528.745311]  [<ffffffff804a1d0d>] udp_recvmsg+0x1ed/0x330
[ 2528.745311]  [<ffffffff804437e2>] sock_common_recvmsg+0x32/0x50
[ 2528.745311]  [<ffffffff80441449>] sock_recvmsg+0x139/0x150
[ 2528.745311]  [<ffffffff8025a590>] ? autoremove_wake_function+0x0/0x40
[ 2528.745311]  [<ffffffff8026c4d9>] ? validate_chain+0x469/0x1270
[ 2528.745311]  [<ffffffff8026d60e>] ? __lock_acquire+0x32e/0xa40
[ 2528.745311]  [<ffffffff804429df>] sys_recvfrom+0xaf/0x110
[ 2528.745311]  [<ffffffff804e6109>] ? mutex_unlock+0x9/0x10
[ 2528.745311]  [<ffffffff80310041>] ? sys_epoll_wait+0x4a1/0x510
[ 2528.745311]  [<ffffffff8020c55b>] system_call_fastpath+0x16/0x1b
[ 2528.745311] Code: 85 c0 7e 1b 48 8d bf 98 02 00 00 e8 29 34 e0 ff 85 c0 74 04 f0 ff 43 28 48 83 c4 08 5b c9 c3 e8 15 f3 ff ff 48 83 c4 08 5b c9 c3 <0f> 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
[ 2528.745311] RIP  [<ffffffff80444ec2>] inet_def_readable+0x52/0x60
[ 2528.745311]  RSP <ffff88043b985b58>

Looks to me like __release_sock will call sk_backlog_rcv() with
the socket unlocked -- does that help at all?

Thanks,
Brian Bloniarz

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-04-06 21:53                                                       ` Brian Bloniarz
@ 2009-04-06 22:12                                                         ` Brian Bloniarz
  0 siblings, 0 replies; 70+ messages in thread
From: Brian Bloniarz @ 2009-04-06 22:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Brian Bloniarz wrote:
>     BUG_ON(!spin_is_locked(&sk->sk_lock.slock));

Oh, sorry, I think I'm just misunderstanding how the socket
lock works. This doesn't actually check that the socket is locked,
right?

-Brian

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-04-05 13:49                                                     ` Eric Dumazet
  2009-04-06 21:53                                                       ` Brian Bloniarz
@ 2009-04-07 20:08                                                       ` Brian Bloniarz
  2009-04-08  8:12                                                         ` Eric Dumazet
  1 sibling, 1 reply; 70+ messages in thread
From: Brian Bloniarz @ 2009-04-07 20:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kchang, netdev, cl

Eric Dumazet wrote:
 > Brian Bloniarz a écrit :
 >> We've been experimenting with this softirq-delay patch in production, and
 >> have seen some hard-to-reproduce crashes. We finally managed to capture a
 >> kexec crashdump this morning.
 >
 > Pointer being null might tell us that we managed to call inet_def_readable()
 > without socket lock hold...

False alarm -- I think I did the backport to 2.6.24 incorrectly. 2.6.24 was
before the UDP receive path started taking the socket lock, so
inet_def_readable's assumption doesn't hold.

Sorry to waste everyone's time.

Thanks,
Brian Bloniarz

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
  2009-04-07 20:08                                                       ` Brian Bloniarz
@ 2009-04-08  8:12                                                         ` Eric Dumazet
  0 siblings, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2009-04-08  8:12 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: David Miller, kchang, netdev, cl

Brian Bloniarz a écrit :
> Eric Dumazet wrote:
>> Brian Bloniarz a écrit :
>>> We've been experimenting with this softirq-delay patch in production,
> and
>>> have seen some hard-to-reproduce crashes. We finally managed to
> capture a
>>> kexec crashdump this morning.
>>
>> Pointer being null might tell us that we managed to call
> inet_def_readable()
>> without socket lock hold...
> 
> False alarm -- I think I did the backport to 2.6.24 incorrectly. 2.6.24 was
> before the UDP receive path started taking the socket lock, so
> inet_def_readable's assumption doesn't hold.
> 
> Sorry to waste everyone's time.
> 

Thanks for doing this discovery work and analysis. 

I am currently off-computers and could not do this until next week.

So, if you want to use 2.6.24, we need to back port other patches as well ?


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Multicast packet loss
@ 2009-04-05 14:42 bmb
  0 siblings, 0 replies; 70+ messages in thread
From: bmb @ 2009-04-05 14:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kchang, netdev, cl

> Could you please send me the diff of your backport against this kernel ?

Sure, the patch is at the bottom. It's against the tag: Ubuntu-2.6.24-19.41
from git://kernel.ubuntu.com/ubuntu/ubuntu-hardy.git

> I take you use Ubuntu Hardys 8.04 LTS server edition ?

Yes. We can only reproduce this in production right now, so trying out
a newer kernel would take some effort.

Thanks,
Brian Bloniarz

On Sun, 05 Apr 2009 15:49:14 +0200, Eric Dumazet <dada1@cosmosbay.com>
wrote:
> Brian Bloniarz a écrit :
>> Hi Eric,
>>
>> We've been experimenting with this softirq-delay patch in production,
> and
>> have seen some hard-to-reproduce crashes. We finally managed to capture
> a
>> kexec crashdump this morning.
>>
>> This is the dmesg:
>>
>> [53417.592868] Unable to handle kernel NULL pointer dereference at
>> 0000000000000000 RIP:
>> [53417.598377]  [<ffffffff80243643>] __do_softirq+0xc3/0x150
>> [53417.606300] PGD 32abb8067 PUD 32faf5067 PMD 0
>> [53417.610829] Oops: 0000 [1] SMP
>> [53417.614032] CPU 2
>> [53417.616083] Modules linked in: nfs lockd nfs_acl sunrpc openafs(P)
>> autofs4 ipv6 ac sbs sbshc video output dock battery container
>> iptable_filter ip_tables x_tables parport_pc lp parport loop joydev
>> iTCO_wdt iTCO_vendor_support evdev button i5000_edac psmouse serio_raw
>> pcspkr shpchp pci_hotplug edac_core ext3 jbd mbcache sr_mod cdrom
>> ata_generic usbhid hid ata_piix sg sd_mod ehci_hcd pata_acpi uhci_hcd
>> libata bnx2 aacraid usbcore scsi_mod thermal processor fan fbcon
>> tileblit font bitblit softcursor fuse
>> [53417.662067] Pid: 13039, comm: gball Tainted: P
>> 2.6.24-19acr2-generic #1
>> [53417.669219] RIP: 0010:[<ffffffff80243643>]  [<ffffffff80243643>]
>> __do_softirq+0xc3/0x150
>> [53417.677368] RSP: 0018:ffff8103314f3f20  EFLAGS: 00010297
>> [53417.682697] RAX: ffff810084a1b000 RBX: ffffffff805ba530 RCX:
>> 0000000000000000
>> [53417.689843] RDX: ffff8103305811e0 RSI: 0000000000000282 RDI:
>> ffff810332ada580
>> [53417.696993] RBP: 0000000000000000 R08: ffff81032fad9f08 R09:
>> ffff810332382000
>> [53417.704144] R10: 0000000000000000 R11: ffffffff80316ec0 R12:
>> ffffffff8062b3d8
>> [53417.711294] R13: ffffffff8062b480 R14: 0000000000000002 R15:
>> 000000000000000a
>> [53417.718447] FS:  00007fab0d7b8750(0000) GS:ffff810334401b80(0000)
>> knlGS:0000000000000000
>> [53417.726568] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [53417.732332] CR2: 0000000000000000 CR3: 0000000329e2d000 CR4:
>> 00000000000006e0
>> [53417.739476] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [53417.746637] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [53417.753787] Process gball (pid: 13039, threadinfo ffff81032adde000,
>> task ffff810329ff77d0)
>> [53417.761991] Stack:  ffffffff8062b3d8 0000000000000046
>> ffff8103314f3f68 0000000000000000
>> [53417.770146]  00000000000000a0 ffff81032addfee8 0000000000000000
>> ffffffff8020d50c
>> [53417.777660]  ffff8103314f3f68 00000000000000c1 ffffffff8020ed25
>> ffffffff8062c870
>> [53417.784961] Call Trace:
>> [53417.787635]  <IRQ>  [<ffffffff8020d50c>] call_softirq+0x1c/0x30
>> [53417.793597]  [<ffffffff8020ed25>] do_softirq+0x35/0x90
>> [53417.798747]  [<ffffffff80243578>] irq_exit+0x88/0x90
>> [53417.803727]  [<ffffffff8020ef70>] do_IRQ+0x80/0x100
>> [53417.808624]  [<ffffffff8020c891>] ret_from_intr+0x0/0xa
>> [53417.813862]  <EOI>  [<ffffffff803e53c8>] skb_release_all+0x18/0x150
>> [53417.820164]  [<ffffffff803e4ad9>] __kfree_skb+0x9/0x90
>> [53417.825327]  [<ffffffff80437612>] udp_recvmsg+0x222/0x260
>> [53417.830744]  [<ffffffff80231264>] source_load+0x34/0x70
>> [53417.835984]  [<ffffffff80232a9a>] find_busiest_group+0x1fa/0x850
>> [53417.842019]  [<ffffffff803e0100>] sock_common_recvmsg+0x30/0x50
>> [53417.847958]  [<ffffffff803de1ca>] sock_recvmsg+0x14a/0x160
>> [53417.853462]  [<ffffffff80231c21>] update_curr+0x71/0x100
>> [53419.858789]  [<ffffffff802320fd>] __dequeue_entity+0x3d/0x50
>> [53417.864469]  [<ffffffff80253ab0>] autoremove_wake_function+0x0/0x30
>> [53417.870758]  [<ffffffff8046662f>] thread_return+0x3a/0x57b
>> [53417.876262]  [<ffffffff803df73e>] sys_recvfrom+0xfe/0x190
>> [53417.881680]  [<ffffffff802e2a95>] sys_epoll_wait+0x245/0x4e0
>> [53417.887358]  [<ffffffff80233e20>] default_wake_function+0x0/0x10
>> [53417.893384]  [<ffffffff8020c37e>] system_call+0x7e/0x83
>> [53417.898628]
>> [53417.900134]
>> [53417.900134] Code: 48 8b 11 48 89 cf 65 48 8b 04 25 08 00 00 00 4a 89
>> 14 20 ff
>> [53417.909430] RIP  [<ffffffff80243643>] __do_softirq+0xc3/0x150
>> [53417.915210]  RSP <ffff8103314f3f20>
>>
>> The disassembly where it crashed:
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:273
>> ffffffff8024361b:       d1 ed                   shr    %ebp
>> rcu_bh_qsctr_inc():
>>
> /local/home/bmb/doc/kernels/linux-hardy-eric/include/linux/rcupdate.h:130
>> ffffffff8024361d:       48 8b 40 08             mov    0x8(%rax),%rax
>> ffffffff80243621:       41 c7 44 05 08 01 00    movl
>> $0x1,0x8(%r13,%rax,1)
>> ffffffff80243628:       00 00
>> __do_softirq():
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:273
>> ffffffff8024362a:       75 d8                   jne    ffffffff80243604
>> <__do_softirq+0x84>
>> softirq_delay_exec():
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:225
>> ffffffff8024362c:       48 8b 14 24             mov    (%rsp),%rdx
>> ffffffff80243630:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
>> ffffffff80243637:       00 00
>> ffffffff80243639:       48 8b 0c 10             mov
> (%rax,%rdx,1),%rcx
>> ffffffff8024363d:       48 83 f9 01             cmp    $0x1,%rcx
>> ffffffff80243641:       74 29                   je     ffffffff8024366c
>> <__do_softirq+0xec>
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:226
>> ffffffff80243643:       48 8b 11                mov    (%rcx),%rdx
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:227
>> ffffffff80243646:       48 89 cf                mov    %rcx,%rdi
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:226
>> ffffffff80243649:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
>> ffffffff80243650:       00 00
>> ffffffff80243652:       4a 89 14 20             mov
> %rdx,(%rax,%r12,1)
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:227
>> ffffffff80243656:       ff 51 08                callq  *0x8(%rcx)
>> /local/home/bmb/doc/kernels/linux-hardy-eric/kernel/softirq.c:225
>> ffffffff80243659:       65 48 8b 04 25 08 00    mov    %gs:0x8,%rax
>> ffffffff80243660:       00 00
>> ffffffff80243662:       4a 8b 0c 20             mov
> (%rax,%r12,1),%rcx
>> ffffffff80243666:       48 83 f9 01             cmp    $0x1,%rcx
>> ffffffff8024366a:       75 d7                   jne    ffffffff80243643
>> <__do_softirq+0xc3>
>> raw_local_irq_disable():
>>
>
/local/home/bmb/doc/kernels/linux-hardy-eric/debian/build/build-generic/include2/asm/irqflags_64.h:76
>>
>> ffffffff8024366c:       fa                      cli
>>
>> And softirq.c line numbers:
>>    218   * Because locking is provided by subsystem, please note
>>    219   * that sdel->func(sdel) is responsible for setting sdel->next
>> to NULL
>>    220   */
>>    221  static void softirq_delay_exec(void)
>>    222  {
>>    223          struct softirq_delay *sdel;
>>    224
>>    225          while ((sdel = __get_cpu_var(softirq_delay_head)) !=
>> SOFTIRQ_DELAY_END) {
>>    226                  __get_cpu_var(softirq_delay_head) = sdel->next;
>>    227                  sdel->func(sdel);       /*      sdel->next =
>> NULL;*/
>>    228                  }
>>    229  }
>>
>> So it's crashing because __get_cpu_var(softirq_delay_head)) is NULL
>> somehow.
>>
>> We aren't running a recent kernel -- we're running Ubuntu Hardy's
>> 2.6.24-19,
>> with a backported version of this patch. One more atypical thing is that
>> we run openafs, 1.4.6.dfsg1-2.
>>
>> Like I said, I have a full vmcore (3, actually) and would be happy to
>> post any
>> more information you'd like to know.
>>
>> Thanks,
>> Brian Bloniarz
>
> Hi Brian
>
> 2.6.24-19 kernel... hmm...
>
> Could you please send me the diff of your backport against this kernel ?
>
> I take you use Ubuntu Hardys 8.04 LTS server edition ?
>
> Pointer being null might tell us that we managed to call
> inet_def_readable()
> without socket lock hold...

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 2306920..b79a207 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -276,6 +276,24 @@ extern void FASTCALL(raise_softirq_irqoff(unsigned int
nr));
 extern void FASTCALL(raise_softirq(unsigned int nr));


+/*
+ * softirq delayed works : should be delayed at do_softirq() end
+ */
+struct softirq_delay {
+	struct softirq_delay	*next;
+	void 			(*func)(struct softirq_delay *);
+};
+
+int softirq_delay_queue(struct softirq_delay *sdel);
+
+static inline void softirq_delay_init(struct softirq_delay *sdel,
+				      void (*func)(struct softirq_delay *))
+{
+	sdel->next = NULL;
+	sdel->func = func;
+}
+
+
 /* Tasklets --- multithreaded analogue of BHs.

    Main feature differing them of generic softirqs: tasklet
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 412e025..f7b48a1 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -11,19 +11,21 @@
 #ifndef _LINUX_TRACE_IRQFLAGS_H
 #define _LINUX_TRACE_IRQFLAGS_H

+#define softirq_enter()	do { current->softirq_context++; } while (0)
+#define softirq_exit()	do { current->softirq_context--; } while (0)
+#define softirq_context(p)	((p)->softirq_context)
+#define running_from_softirq()  (softirq_context(current) > 0)
+
 #ifdef CONFIG_TRACE_IRQFLAGS
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
   extern void trace_softirqs_on(unsigned long ip);
   extern void trace_softirqs_off(unsigned long ip);
 # define trace_hardirq_context(p)	((p)->hardirq_context)
-# define trace_softirq_context(p)	((p)->softirq_context)
 # define trace_hardirqs_enabled(p)	((p)->hardirqs_enabled)
 # define trace_softirqs_enabled(p)	((p)->softirqs_enabled)
 # define trace_hardirq_enter()	do { current->hardirq_context++; } while
(0)
 # define trace_hardirq_exit()	do { current->hardirq_context--; } while (0)
-# define trace_softirq_enter()	do { current->softirq_context++; } while
(0)
-# define trace_softirq_exit()	do { current->softirq_context--; } while (0)
 # define INIT_TRACE_IRQFLAGS	.softirqs_enabled = 1,
 #else
 # define trace_hardirqs_on()		do { } while (0)
@@ -31,13 +33,10 @@
 # define trace_softirqs_on(ip)		do { } while (0)
 # define trace_softirqs_off(ip)		do { } while (0)
 # define trace_hardirq_context(p)	0
-# define trace_softirq_context(p)	0
 # define trace_hardirqs_enabled(p)	0
 # define trace_softirqs_enabled(p)	0
 # define trace_hardirq_enter()		do { } while (0)
 # define trace_hardirq_exit()		do { } while (0)
-# define trace_softirq_enter()		do { } while (0)
-# define trace_softirq_exit()		do { } while (0)
 # define INIT_TRACE_IRQFLAGS
 #endif

diff --git a/include/linux/net.h b/include/linux/net.h
index 596131e..9e762e9 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -115,12 +115,16 @@ enum sock_shutdown_cmd {
 struct socket {
 	socket_state		state;
 	unsigned long		flags;
-	const struct proto_ops	*ops;
+	/*
+	 * Please keep fasync_list & wait fields in the same cache line
+	 */
 	struct fasync_struct	*fasync_list;
+	wait_queue_head_t	wait;
+
 	struct file		*file;
 	struct sock		*sk;
-	wait_queue_head_t	wait;
 	short			type;
+	const struct proto_ops	*ops;
 };

 struct vm_area_struct;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc2f4fa..bf0ff49 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1111,8 +1111,8 @@ struct task_struct {
 	unsigned long softirq_enable_ip;
 	unsigned int softirq_enable_event;
 	int hardirq_context;
-	int softirq_context;
 #endif
+	int softirq_context;
 #ifdef CONFIG_LOCKDEP
 # define MAX_LOCK_DEPTH 30UL
 	u64 curr_chain_key;
diff --git a/include/net/sock.h b/include/net/sock.h
index 6e1542d..fb0f719 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -236,6 +236,7 @@ struct sock {
 	unsigned long	        sk_lingertime;
 	struct sk_buff_head	sk_error_queue;
 	struct proto		*sk_prot_creator;
+	struct softirq_delay	sk_delay;
 	rwlock_t		sk_callback_lock;
 	int			sk_err,
 				sk_err_soft;
@@ -859,6 +860,7 @@ extern void *sock_kmalloc(struct sock *sk, int size,
 			  gfp_t priority);
 extern void sock_kfree_s(struct sock *sk, void *mem, int size);
 extern void sk_send_sigurg(struct sock *sk);
+extern void inet_def_readable(struct sock *sk, int len);

 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/udplite.h b/include/net/udplite.h
index 635b0ea..1589817 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -28,6 +28,7 @@ static __inline__ int udplite_getfrag(void *from, char
*to, int  offset,
 /* Designate sk as UDP-Lite socket */
 static inline int udplite_sk_init(struct sock *sk)
 {
+	sk->sk_data_ready = inet_def_readable;
 	udp_sk(sk)->pcflag = UDPLITE_BIT;
 	return 0;
 }
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index e2c07ec..decb1f7 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -1643,7 +1643,7 @@ print_usage_bug(struct task_struct *curr, struct
held_lock *this,
 	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
 		curr->comm, task_pid_nr(curr),
 		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
-		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
+		softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
 		trace_hardirqs_enabled(curr),
 		trace_softirqs_enabled(curr));
 	print_lock(this);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bd89bc4..fb116ac 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -194,6 +194,42 @@ void local_bh_enable_ip(unsigned long ip)
 }
 EXPORT_SYMBOL(local_bh_enable_ip);

+
+#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
+static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
+	SOFTIRQ_DELAY_END
+};
+
+/*
+ * Caller must disable preemption, and take care of appropriate
+ * locking and refcounting
+ */
+int softirq_delay_queue(struct softirq_delay *sdel)
+{
+	if (!sdel->next) {
+		sdel->next = __get_cpu_var(softirq_delay_head);
+		__get_cpu_var(softirq_delay_head) = sdel;
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Because locking is provided by subsystem, please note
+ * that sdel->func(sdel) is responsible for setting sdel->next to NULL
+ */
+static void softirq_delay_exec(void)
+{
+	struct softirq_delay *sdel;
+
+	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
+		__get_cpu_var(softirq_delay_head) = sdel->next;
+		sdel->func(sdel);	/*	sdel->next = NULL;*/
+		}
+}
+
+
+
 /*
  * We restart softirq processing MAX_SOFTIRQ_RESTART times,
  * and we fall back to softirqd after that.
@@ -216,7 +252,7 @@ asmlinkage void __do_softirq(void)
 	account_system_vtime(current);

 	__local_bh_disable((unsigned long)__builtin_return_address(0));
-	trace_softirq_enter();
+	softirq_enter();

 	cpu = smp_processor_id();
 restart:
@@ -236,6 +272,8 @@ restart:
 		pending >>= 1;
 	} while (pending);

+	softirq_delay_exec();
+
 	local_irq_disable();

 	pending = local_softirq_pending();
@@ -245,7 +283,7 @@ restart:
 	if (pending)
 		wakeup_softirqd();

-	trace_softirq_exit();
+	softirq_exit();

 	account_system_vtime(current);
 	_local_bh_enable();
diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index 280332c..1aa7351 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -157,11 +157,11 @@ static void init_shared_classes(void)
 #define SOFTIRQ_ENTER()				\
 		local_bh_disable();		\
 		local_irq_disable();		\
-		trace_softirq_enter();		\
+		softirq_enter();		\
 		WARN_ON(!in_softirq());

 #define SOFTIRQ_EXIT()				\
-		trace_softirq_exit();		\
+		softirq_exit();		\
 		local_irq_enable();		\
 		local_bh_enable();

diff --git a/net/core/sock.c b/net/core/sock.c
index c519b43..cb70343 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -213,6 +213,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned
long)*(2*UIO_MAXIOV+512);

+static void sock_readable_defer(struct softirq_delay *sdel);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int
optlen)
 {
 	struct timeval tv;
@@ -996,6 +998,7 @@ struct sock *sk_clone(const struct sock *sk, const
gfp_t priority)
 #endif

 		rwlock_init(&newsk->sk_dst_lock);
+		softirq_delay_init(&newsk->sk_delay, sock_readable_defer);
 		rwlock_init(&newsk->sk_callback_lock);
 		lockdep_set_class_and_name(&newsk->sk_callback_lock,
 				af_callback_keys + newsk->sk_family,
@@ -1509,6 +1512,45 @@ static void sock_def_readable(struct sock *sk, int
len)
 	read_unlock(&sk->sk_callback_lock);
 }

+/*
+ * helper function called by softirq_delay_exec(),
+ * if inet_def_readable() queued us.
+ */
+static void sock_readable_defer(struct softirq_delay *sdel)
+{
+	struct sock *sk = container_of(sdel, struct sock, sk_delay);
+
+	sdel->next = NULL;
+	/*
+	 * At this point, we dont own a lock on socket, only a reference.
+	 * We must commit above write, or another cpu could miss a wakeup
+	 */
+	smp_wmb();
+	sock_def_readable(sk, 0);
+	sock_put(sk);
+}
+
+/*
+ * Custom version of sock_def_readable()
+ * We want to defer scheduler processing at the end of do_softirq()
+ * Called with socket locked.
+ */
+void inet_def_readable(struct sock *sk, int len)
+{
+	if (running_from_softirq()) {
+		if (softirq_delay_queue(&sk->sk_delay))
+			/*
+			 * If we queued this socket, take a reference on it
+			 * Caller owns socket lock, so write to sk_delay.next
+			 * will be committed before unlock.
+			 */
+			sock_hold(sk);
+	} else
+		sock_def_readable(sk, len);
+}
+
+EXPORT_SYMBOL(inet_def_readable);
+
 static void sock_def_write_space(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
@@ -1586,6 +1628,7 @@ void sock_init_data(struct socket *sock, struct sock
*sk)
 		sk->sk_sleep	=	NULL;

 	rwlock_init(&sk->sk_dst_lock);
+	softirq_delay_init(&sk->sk_delay, sock_readable_defer);
 	rwlock_init(&sk->sk_callback_lock);
 	lockdep_set_class_and_name(&sk->sk_callback_lock,
 			af_callback_keys + sk->sk_family,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 03c400c..cfeb051 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1226,6 +1226,12 @@ int udp_destroy_sock(struct sock *sk)
 	return 0;
 }

+static int udp_init_sock(struct sock *sk)
+{
+	sk->sk_data_ready = inet_def_readable;
+	return 0;
+}
+
 /*
  *	Socket option code for UDP
  */
@@ -1439,6 +1445,7 @@ struct proto udp_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udp_destroy_sock,
 	.setsockopt	   = udp_setsockopt,
 	.getsockopt	   = udp_getsockopt,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ee1cc3f..fa9ce73 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -856,6 +856,12 @@ int udpv6_destroy_sock(struct sock *sk)
 	return 0;
 }

+static int udpv6_init_sock(struct sock *sk)
+{
+	sk->sk_data_ready = inet_def_readable;
+	return 0;
+}
+
 /*
  *	Socket option code for UDP
  */
@@ -979,6 +985,7 @@ struct proto udpv6_prot = {
 	.connect	   = ip6_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init 		   = udpv6_init_sock,
 	.destroy	   = udpv6_destroy_sock,
 	.setsockopt	   = udpv6_setsockopt,
 	.getsockopt	   = udpv6_getsockopt,


^ permalink raw reply related	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2009-04-08  8:13 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-30 17:49 Multicast packet loss Kenny Chang
2009-01-30 19:04 ` Eric Dumazet
2009-01-30 19:17 ` Denys Fedoryschenko
2009-01-30 20:03 ` Neil Horman
2009-01-30 22:29   ` Kenny Chang
2009-01-30 22:41     ` Eric Dumazet
2009-01-31 16:03       ` Neil Horman
2009-02-02 16:13         ` Kenny Chang
2009-02-02 16:48         ` Kenny Chang
2009-02-03 11:55           ` Neil Horman
2009-02-03 15:20             ` Kenny Chang
2009-02-04  1:15               ` Neil Horman
2009-02-04 16:07                 ` Kenny Chang
2009-02-04 16:46                   ` Wesley Chow
2009-02-04 18:11                     ` Eric Dumazet
2009-02-05 13:33                       ` Neil Horman
2009-02-05 13:46                         ` Wesley Chow
2009-02-05 13:29                   ` Neil Horman
2009-02-01 12:40       ` Eric Dumazet
2009-02-02 13:45         ` Neil Horman
2009-02-02 16:57           ` Eric Dumazet
2009-02-02 18:22             ` Neil Horman
2009-02-02 19:51               ` Wes Chow
2009-02-02 20:29                 ` Eric Dumazet
2009-02-02 21:09                   ` Wes Chow
2009-02-02 21:31                     ` Eric Dumazet
2009-02-03 17:34                       ` Kenny Chang
2009-02-04  1:21                         ` Neil Horman
2009-02-26 17:15                           ` Kenny Chang
2009-02-28  8:51                             ` Eric Dumazet
2009-03-01 17:03                               ` Eric Dumazet
2009-03-04  8:16                               ` David Miller
2009-03-04  8:36                                 ` Eric Dumazet
2009-03-07  7:46                                   ` Eric Dumazet
2009-03-08 16:46                                     ` Eric Dumazet
2009-03-09  2:49                                       ` David Miller
2009-03-09  6:36                                         ` Eric Dumazet
2009-03-13 21:51                                           ` David Miller
2009-03-13 22:30                                             ` Eric Dumazet
2009-03-13 22:38                                               ` David Miller
2009-03-13 22:45                                                 ` Eric Dumazet
2009-03-14  9:03                                                   ` [PATCH] net: reorder fields of struct socket Eric Dumazet
2009-03-16  2:59                                                     ` David Miller
2009-03-16 22:22                                                 ` Multicast packet loss Eric Dumazet
2009-03-17 10:11                                                   ` Peter Zijlstra
2009-03-17 11:08                                                     ` Eric Dumazet
2009-03-17 11:57                                                       ` Peter Zijlstra
2009-03-17 15:00                                                       ` Brian Bloniarz
2009-03-17 15:16                                                         ` Eric Dumazet
2009-03-17 19:39                                                           ` David Stevens
2009-03-17 21:19                                                             ` Eric Dumazet
2009-04-03 19:28                                                   ` Brian Bloniarz
2009-04-05 13:49                                                     ` Eric Dumazet
2009-04-06 21:53                                                       ` Brian Bloniarz
2009-04-06 22:12                                                         ` Brian Bloniarz
2009-04-07 20:08                                                       ` Brian Bloniarz
2009-04-08  8:12                                                         ` Eric Dumazet
2009-03-09 22:56                                       ` Brian Bloniarz
2009-03-10  5:28                                         ` Eric Dumazet
2009-03-10 23:22                                           ` Brian Bloniarz
2009-03-11  3:00                                             ` Eric Dumazet
2009-03-12 15:47                                               ` Brian Bloniarz
2009-03-12 16:34                                                 ` Eric Dumazet
2009-02-27 18:40       ` Christoph Lameter
2009-02-27 18:56         ` Eric Dumazet
2009-02-27 19:45           ` Christoph Lameter
2009-02-27 20:12             ` Eric Dumazet
2009-02-27 21:36               ` Eric Dumazet
2009-02-02 13:53     ` Eric Dumazet
2009-04-05 14:42 bmb

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.