All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Is bug 200755 in anyone's queue??
       [not found] ` <20190716070246.0745ee6f@hermes.lan>
@ 2019-08-23 19:02   ` Steve Zabele
  2019-08-29 19:26     ` Willem de Bruijn
  2019-08-23 19:04   ` Steve Zabele
  1 sibling, 1 reply; 18+ messages in thread
From: Steve Zabele @ 2019-08-23 19:02 UTC (permalink / raw)
  To: netdev
  Cc: shum, vladimir116, saifi.khan, saifi.khan, daniel, on2k16nm,
	'Stephen Hemminger'

Hi folks,

Is there a way to find out where the SO_REUSEPORT bug reported a year ago in
August (and apparently has been a bug with kernels later than 4.4) is being
addressed?

The bug characteristics, simple standalone test code demonstrating the bug,
and an assessment of the likely location/cause of the bug within the kernel
are all described here

https://bugzilla.kernel.org/show_bug.cgi?id=200755

I'm really hoping this gets fixed so we can move forward on updating our
kernels/Ubuntu release from our aging 4.4/16.04 release

Thanks!

Steve



-----Original Message-----
From: Stephen Hemminger [mailto:stephen@networkplumber.org] 
Sent: Tuesday, July 16, 2019 10:03 AM
To: Steve Zabele
Cc: shum@canndrew.org; vladimir116@gmail.com; saifi.khan@DataSynergy.org;
saifi.khan@strikr.in; daniel@iogearbox.net; on2k16nm@gmail.com
Subject: Re: Is bug 200755 in anyone's queue??

On Tue, 16 Jul 2019 09:43:24 -0400
"Steve Zabele" <zabele@comcast.net> wrote:


> I came across bug report 200755 trying to figure out why some code I had
> provided to customers a while ago no longer works with the current Linux
> kernel. See
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=200755
> 
> I've verified that, as reported, 'connect' no longer works for UDP.
> Moreover, it appears it has been broken since the 4.5 kernel has been
> released. 
> 
>  
> 
> It does also appear that the intended new feature of doing round robin
> assignments to different UDP sockets opened with SO_REUSEPORT also does
not
> work as described.
> 
>  
> 
> Since the original bug report was made nearly a year ago for the 4.14
kernel
> (and the bug is also still present in the 4.15 kernel) I'm curious if
anyone
> is on the hook to get this fixed any time soon.
> 
>  
> 
> I'd rather not have to do my own demultiplexing using a single socket in
> user space to work around what is clearly a (maybe not so recently
> introduced) kernel bug if at all possible. My code had worked just fine on
> 3.X kernels, and appears to work okay up through 4.4. 
> 

Kernel developers do not use bugzilla, I forward bug reports
to netdev@vger.kernel.org (after filtering).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Is bug 200755 in anyone's queue??
       [not found] ` <20190716070246.0745ee6f@hermes.lan>
  2019-08-23 19:02   ` Is bug 200755 in anyone's queue?? Steve Zabele
@ 2019-08-23 19:04   ` Steve Zabele
  1 sibling, 0 replies; 18+ messages in thread
From: Steve Zabele @ 2019-08-23 19:04 UTC (permalink / raw)
  To: 'Steve Zabele', netdev
  Cc: shum, vladimir116, saifi.khan, daniel, on2k16nm,
	'Stephen Hemminger'

Hi folks,

Is there a way to find out where the SO_REUSEPORT bug reported a year ago in
August (and apparently has been a bug with kernels later than 4.4) is being
addressed?

The bug characteristics, simple standalone test code demonstrating the bug,
and an assessment of the likely location/cause of the bug within the kernel
are all described here

https://bugzilla.kernel.org/show_bug.cgi?id=200755

I'm really hoping this gets fixed so we can move forward on updating our
kernels/Ubuntu release from our aging 4.4/16.04 release

Thanks!

Steve



-----Original Message-----
From: Stephen Hemminger [mailto:stephen@networkplumber.org] 
Sent: Tuesday, July 16, 2019 10:03 AM
To: Steve Zabele
Cc: shum@canndrew.org; vladimir116@gmail.com; saifi.khan@DataSynergy.org;
saifi.khan@strikr.in; daniel@iogearbox.net; on2k16nm@gmail.com
Subject: Re: Is bug 200755 in anyone's queue??

On Tue, 16 Jul 2019 09:43:24 -0400
"Steve Zabele" <zabele@comcast.net> wrote:


> I came across bug report 200755 trying to figure out why some code I had
> provided to customers a while ago no longer works with the current Linux
> kernel. See
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=200755
> 
> I've verified that, as reported, 'connect' no longer works for UDP.
> Moreover, it appears it has been broken since the 4.5 kernel has been
> released. 
> 
>  
> 
> It does also appear that the intended new feature of doing round robin
> assignments to different UDP sockets opened with SO_REUSEPORT also does
not
> work as described.
> 
>  
> 
> Since the original bug report was made nearly a year ago for the 4.14
kernel
> (and the bug is also still present in the 4.15 kernel) I'm curious if
anyone
> is on the hook to get this fixed any time soon.
> 
>  
> 
> I'd rather not have to do my own demultiplexing using a single socket in
> user space to work around what is clearly a (maybe not so recently
> introduced) kernel bug if at all possible. My code had worked just fine on
> 3.X kernels, and appears to work okay up through 4.4. 
> 

Kernel developers do not use bugzilla, I forward bug reports
to netdev@vger.kernel.org (after filtering).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-08-23 19:02   ` Is bug 200755 in anyone's queue?? Steve Zabele
@ 2019-08-29 19:26     ` Willem de Bruijn
  2019-08-30  8:48       ` Steve Zabele
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Willem de Bruijn @ 2019-08-29 19:26 UTC (permalink / raw)
  To: Steve Zabele
  Cc: Network Development, shum, vladimir116, saifi.khan, saifi.khan,
	Daniel Borkmann, on2k16nm, Stephen Hemminger

On Fri, Aug 23, 2019 at 3:11 PM Steve Zabele <zabele@comcast.net> wrote:
>
> Hi folks,
>
> Is there a way to find out where the SO_REUSEPORT bug reported a year ago in
> August (and apparently has been a bug with kernels later than 4.4) is being
> addressed?
>
> The bug characteristics, simple standalone test code demonstrating the bug,
> and an assessment of the likely location/cause of the bug within the kernel
> are all described here
>
> https://bugzilla.kernel.org/show_bug.cgi?id=200755
>
> I'm really hoping this gets fixed so we can move forward on updating our
> kernels/Ubuntu release from our aging 4.4/16.04 release
>
> Thanks!
>
> Steve
>
>
>
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, July 16, 2019 10:03 AM
> To: Steve Zabele
> Cc: shum@canndrew.org; vladimir116@gmail.com; saifi.khan@DataSynergy.org;
> saifi.khan@strikr.in; daniel@iogearbox.net; on2k16nm@gmail.com
> Subject: Re: Is bug 200755 in anyone's queue??
>
> On Tue, 16 Jul 2019 09:43:24 -0400
> "Steve Zabele" <zabele@comcast.net> wrote:
>
>
> > I came across bug report 200755 trying to figure out why some code I had
> > provided to customers a while ago no longer works with the current Linux
> > kernel. See
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=200755
> >
> > I've verified that, as reported, 'connect' no longer works for UDP.
> > Moreover, it appears it has been broken since the 4.5 kernel has been
> > released.
> >
> >
> >
> > It does also appear that the intended new feature of doing round robin
> > assignments to different UDP sockets opened with SO_REUSEPORT also does
> not
> > work as described.
> >
> >
> >
> > Since the original bug report was made nearly a year ago for the 4.14
> kernel
> > (and the bug is also still present in the 4.15 kernel) I'm curious if
> anyone
> > is on the hook to get this fixed any time soon.
> >
> >
> >
> > I'd rather not have to do my own demultiplexing using a single socket in
> > user space to work around what is clearly a (maybe not so recently
> > introduced) kernel bug if at all possible. My code had worked just fine on
> > 3.X kernels, and appears to work okay up through 4.4.
> >
>
> Kernel developers do not use bugzilla, I forward bug reports
> to netdev@vger.kernel.org (after filtering).

SO_REUSEPORT was not intended to be used in this way. Opening
multiple connected sockets with the same local port.

But since the interface allowed connect after joining a group, and
that is being used, I guess that point is moot. Still, I'm a bit
surprised that it ever worked as described.

Also note that the default distribution algorithm is not round robin
assignment, but hash based. So multiple consecutive datagrams arriving
at the same socket is not unexpected.

I suspect that this quick hack might "work". It seemed to on the
supplied .c file:

                  score = compute_score(sk, net, saddr, sport,
                                        daddr, hnum, dif, sdif);
                  if (score > badness) {
  -                       if (sk->sk_reuseport) {
  +                       if (sk->sk_reuseport && !sk->sk_state !=
TCP_ESTABLISHED) {

But a more robust approach, that also works on existing kernels, is to
swap the default distribution algorithm with a custom BPF based one (
SO_ATTACH_REUSEPORT_EBPF).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Is bug 200755 in anyone's queue??
  2019-08-29 19:26     ` Willem de Bruijn
@ 2019-08-30  8:48       ` Steve Zabele
  2019-08-30  8:53       ` Steve Zabele
  2019-08-30  8:54       ` Eric Dumazet
  2 siblings, 0 replies; 18+ messages in thread
From: Steve Zabele @ 2019-08-30  8:48 UTC (permalink / raw)
  To: 'Willem de Bruijn'
  Cc: 'Network Development',
	shum, vladimir116, saifi.khan, saifi.khan,
	'Daniel Borkmann', on2k16nm, 'Stephen Hemminger',
	mark.keaton

Hi Willem!

**Thank you** for the reply and the code segment, very much appreciated.

Can we expect that this will make its way into a near-term official release of the kernel? Our customers are really not up to patching and rebuilding kernels, plus it "taints" the kernel from a security perspective, and whenever there is a new release of the kernel (you come in one morning and your kernel has been magically upgraded for you because you forgot to disable auto updates) you need to rebuild and hope that the previous patch is still good for the new code, etc, etc.

Getting this onto the main branch as part of the official release cycle will be greatly appreciated!

Note that using an ebpf approach can't solve this problem (we know because we tried for quite a while to make it work, no luck). The key issue is that at the point when the ebpf filter gets the packet buffer reference it is pointing to the start of the UDP portion of the packet, and hence is not able to access the IP source address which is earlier in the buffer. Plus every time a new socket is opened or closed, a new epbf has to be created and inserted -- and there is really no good way to figure out which index is (now) associated with which file descriptor.. 

So thank you and the group for your attention to this.

With respect to your comment

>SO_REUSEPORT was not intended to be used in this way. Opening
>multiple connected sockets with the same local port.

I'd like to offer that there are a number of reliable transport protocols (alternatives to TCP) that use UDP. NORM (IETF RFC 5470) and Google's new QUIC protocol (https://www.ietf.org/blog/whats-happening-quic) are good examples.

Now consider that users of these protocols will want to create servers using these protocols -- a webserver is a good example. In fact Google has one running on QUIC, and many Chrome users don't even know they are using QUIC when they access Google webservers.

With a client-server model, clients contact the server at a well known server address and port. Upon first contact from a new client, the server opens another socket with the same local address and port and "connects" to the clients address and ephemeral port so that only traffic for the given five tuple arrives on the new file descriptor -- this allows the server application to keep concurrent sessions with different clients cleanly separated, even though all sessions use the same local server port. In fact, reusing the same port for different sessions is really important from a firewalling perspective,

This is pretty much what our application does, i.e., it uses different sockets/file descriptors to keep sessions straight.

And if it's worth anything, we have been using this mechanism with UDP for a *very* long time, the change in behavior appears to have happened with the 4.5 kernel.

So **thank you**!!

Steve

-----Original Message-----
From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
Sent: Thursday, August 29, 2019 3:27 PM
To: Steve Zabele
Cc: Network Development; shum@canndrew.org; vladimir116@gmail.com; saifi.khan@datasynergy.org; saifi.khan@strikr.in; Daniel Borkmann; on2k16nm@gmail.com; Stephen Hemminger
Subject: Re: Is bug 200755 in anyone's queue??

On Fri, Aug 23, 2019 at 3:11 PM Steve Zabele <zabele@comcast.net> wrote:
>
> Hi folks,
>
> Is there a way to find out where the SO_REUSEPORT bug reported a year ago in
> August (and apparently has been a bug with kernels later than 4.4) is being
> addressed?
>
> The bug characteristics, simple standalone test code demonstrating the bug,
> and an assessment of the likely location/cause of the bug within the kernel
> are all described here
>
> https://bugzilla.kernel.org/show_bug.cgi?id=200755
>
> I'm really hoping this gets fixed so we can move forward on updating our
> kernels/Ubuntu release from our aging 4.4/16.04 release
>
> Thanks!
>
> Steve
>
>
>
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, July 16, 2019 10:03 AM
> To: Steve Zabele
> Cc: shum@canndrew.org; vladimir116@gmail.com; saifi.khan@DataSynergy.org;
> saifi.khan@strikr.in; daniel@iogearbox.net; on2k16nm@gmail.com
> Subject: Re: Is bug 200755 in anyone's queue??
>
> On Tue, 16 Jul 2019 09:43:24 -0400
> "Steve Zabele" <zabele@comcast.net> wrote:
>
>
> > I came across bug report 200755 trying to figure out why some code I had
> > provided to customers a while ago no longer works with the current Linux
> > kernel. See
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=200755
> >
> > I've verified that, as reported, 'connect' no longer works for UDP.
> > Moreover, it appears it has been broken since the 4.5 kernel has been
> > released.
> >
> >
> >
> > It does also appear that the intended new feature of doing round robin
> > assignments to different UDP sockets opened with SO_REUSEPORT also does
> not
> > work as described.
> >
> >
> >
> > Since the original bug report was made nearly a year ago for the 4.14
> kernel
> > (and the bug is also still present in the 4.15 kernel) I'm curious if
> anyone
> > is on the hook to get this fixed any time soon.
> >
> >
> >
> > I'd rather not have to do my own demultiplexing using a single socket in
> > user space to work around what is clearly a (maybe not so recently
> > introduced) kernel bug if at all possible. My code had worked just fine on
> > 3.X kernels, and appears to work okay up through 4.4.
> >
>
> Kernel developers do not use bugzilla, I forward bug reports
> to netdev@vger.kernel.org (after filtering).

SO_REUSEPORT was not intended to be used in this way. Opening
multiple connected sockets with the same local port.

But since the interface allowed connect after joining a group, and
that is being used, I guess that point is moot. Still, I'm a bit
surprised that it ever worked as described.

Also note that the default distribution algorithm is not round robin
assignment, but hash based. So multiple consecutive datagrams arriving
at the same socket is not unexpected.

I suspect that this quick hack might "work". It seemed to on the
supplied .c file:

                  score = compute_score(sk, net, saddr, sport,
                                        daddr, hnum, dif, sdif);
                  if (score > badness) {
  -                       if (sk->sk_reuseport) {
  +                       if (sk->sk_reuseport && !sk->sk_state !=
TCP_ESTABLISHED) {

But a more robust approach, that also works on existing kernels, is to
swap the default distribution algorithm with a custom BPF based one (
SO_ATTACH_REUSEPORT_EBPF).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Is bug 200755 in anyone's queue??
  2019-08-29 19:26     ` Willem de Bruijn
  2019-08-30  8:48       ` Steve Zabele
@ 2019-08-30  8:53       ` Steve Zabele
  2019-08-30  8:54       ` Eric Dumazet
  2 siblings, 0 replies; 18+ messages in thread
From: Steve Zabele @ 2019-08-30  8:53 UTC (permalink / raw)
  To: 'Steve Zabele', 'Willem de Bruijn'
  Cc: 'Network Development',
	shum, vladimir116, saifi.khan, 'Daniel Borkmann',
	on2k16nm, 'Stephen Hemminger',
	mark.keaton

Resending since the last send bounced with this error

The following recipient(s) cannot be reached:

      'saifi.khan@datasynergy.org' on 8/30/2019 4:49 AM
            550 5.1.1 <saifi.khan@datasynergy.org> recipient invalid domain

Sorry for the spam.

Steve


-----Original Message-----
From: Steve Zabele [mailto:zabele@comcast.net] 
Sent: Friday, August 30, 2019 4:49 AM
To: 'Willem de Bruijn'
Cc: 'Network Development'; 'shum@canndrew.org'; 'vladimir116@gmail.com'; 'saifi.khan@datasynergy.org'; 'saifi.khan@strikr.in'; 'Daniel Borkmann'; 'on2k16nm@gmail.com'; 'Stephen Hemminger'; 'mark.keaton@raytheon.com'
Subject: RE: Is bug 200755 in anyone's queue??

Hi Willem!

**Thank you** for the reply and the code segment, very much appreciated.

Can we expect that this will make its way into a near-term official release of the kernel? Our customers are really not up to patching and rebuilding kernels, plus it "taints" the kernel from a security perspective, and whenever there is a new release of the kernel (you come in one morning and your kernel has been magically upgraded for you because you forgot to disable auto updates) you need to rebuild and hope that the previous patch is still good for the new code, etc, etc.

Getting this onto the main branch as part of the official release cycle will be greatly appreciated!

Note that using an ebpf approach can't solve this problem (we know because we tried for quite a while to make it work, no luck). The key issue is that at the point when the ebpf filter gets the packet buffer reference it is pointing to the start of the UDP portion of the packet, and hence is not able to access the IP source address which is earlier in the buffer. Plus every time a new socket is opened or closed, a new epbf has to be created and inserted -- and there is really no good way to figure out which index is (now) associated with which file descriptor.. 

So thank you and the group for your attention to this.

With respect to your comment

>SO_REUSEPORT was not intended to be used in this way. Opening
>multiple connected sockets with the same local port.

I'd like to offer that there are a number of reliable transport protocols (alternatives to TCP) that use UDP. NORM (IETF RFC 5470) and Google's new QUIC protocol (https://www.ietf.org/blog/whats-happening-quic) are good examples.

Now consider that users of these protocols will want to create servers using these protocols -- a webserver is a good example. In fact Google has one running on QUIC, and many Chrome users don't even know they are using QUIC when they access Google webservers.

With a client-server model, clients contact the server at a well known server address and port. Upon first contact from a new client, the server opens another socket with the same local address and port and "connects" to the clients address and ephemeral port so that only traffic for the given five tuple arrives on the new file descriptor -- this allows the server application to keep concurrent sessions with different clients cleanly separated, even though all sessions use the same local server port. In fact, reusing the same port for different sessions is really important from a firewalling perspective,

This is pretty much what our application does, i.e., it uses different sockets/file descriptors to keep sessions straight.

And if it's worth anything, we have been using this mechanism with UDP for a *very* long time, the change in behavior appears to have happened with the 4.5 kernel.

So **thank you**!!

Steve

-----Original Message-----
From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
Sent: Thursday, August 29, 2019 3:27 PM
To: Steve Zabele
Cc: Network Development; shum@canndrew.org; vladimir116@gmail.com; saifi.khan@datasynergy.org; saifi.khan@strikr.in; Daniel Borkmann; on2k16nm@gmail.com; Stephen Hemminger
Subject: Re: Is bug 200755 in anyone's queue??

On Fri, Aug 23, 2019 at 3:11 PM Steve Zabele <zabele@comcast.net> wrote:
>
> Hi folks,
>
> Is there a way to find out where the SO_REUSEPORT bug reported a year ago in
> August (and apparently has been a bug with kernels later than 4.4) is being
> addressed?
>
> The bug characteristics, simple standalone test code demonstrating the bug,
> and an assessment of the likely location/cause of the bug within the kernel
> are all described here
>
> https://bugzilla.kernel.org/show_bug.cgi?id=200755
>
> I'm really hoping this gets fixed so we can move forward on updating our
> kernels/Ubuntu release from our aging 4.4/16.04 release
>
> Thanks!
>
> Steve
>
>
>
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, July 16, 2019 10:03 AM
> To: Steve Zabele
> Cc: shum@canndrew.org; vladimir116@gmail.com; saifi.khan@DataSynergy.org;
> saifi.khan@strikr.in; daniel@iogearbox.net; on2k16nm@gmail.com
> Subject: Re: Is bug 200755 in anyone's queue??
>
> On Tue, 16 Jul 2019 09:43:24 -0400
> "Steve Zabele" <zabele@comcast.net> wrote:
>
>
> > I came across bug report 200755 trying to figure out why some code I had
> > provided to customers a while ago no longer works with the current Linux
> > kernel. See
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=200755
> >
> > I've verified that, as reported, 'connect' no longer works for UDP.
> > Moreover, it appears it has been broken since the 4.5 kernel has been
> > released.
> >
> >
> >
> > It does also appear that the intended new feature of doing round robin
> > assignments to different UDP sockets opened with SO_REUSEPORT also does
> not
> > work as described.
> >
> >
> >
> > Since the original bug report was made nearly a year ago for the 4.14
> kernel
> > (and the bug is also still present in the 4.15 kernel) I'm curious if
> anyone
> > is on the hook to get this fixed any time soon.
> >
> >
> >
> > I'd rather not have to do my own demultiplexing using a single socket in
> > user space to work around what is clearly a (maybe not so recently
> > introduced) kernel bug if at all possible. My code had worked just fine on
> > 3.X kernels, and appears to work okay up through 4.4.
> >
>
> Kernel developers do not use bugzilla, I forward bug reports
> to netdev@vger.kernel.org (after filtering).

SO_REUSEPORT was not intended to be used in this way. Opening
multiple connected sockets with the same local port.

But since the interface allowed connect after joining a group, and
that is being used, I guess that point is moot. Still, I'm a bit
surprised that it ever worked as described.

Also note that the default distribution algorithm is not round robin
assignment, but hash based. So multiple consecutive datagrams arriving
at the same socket is not unexpected.

I suspect that this quick hack might "work". It seemed to on the
supplied .c file:

                  score = compute_score(sk, net, saddr, sport,
                                        daddr, hnum, dif, sdif);
                  if (score > badness) {
  -                       if (sk->sk_reuseport) {
  +                       if (sk->sk_reuseport && !sk->sk_state !=
TCP_ESTABLISHED) {

But a more robust approach, that also works on existing kernels, is to
swap the default distribution algorithm with a custom BPF based one (
SO_ATTACH_REUSEPORT_EBPF).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-08-29 19:26     ` Willem de Bruijn
  2019-08-30  8:48       ` Steve Zabele
  2019-08-30  8:53       ` Steve Zabele
@ 2019-08-30  8:54       ` Eric Dumazet
  2019-08-30 20:30         ` Willem de Bruijn
  2 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2019-08-30  8:54 UTC (permalink / raw)
  To: Willem de Bruijn, Steve Zabele
  Cc: Network Development, shum, vladimir116, saifi.khan, saifi.khan,
	Daniel Borkmann, on2k16nm, Stephen Hemminger



On 8/29/19 9:26 PM, Willem de Bruijn wrote:

> SO_REUSEPORT was not intended to be used in this way. Opening
> multiple connected sockets with the same local port.
> 
> But since the interface allowed connect after joining a group, and
> that is being used, I guess that point is moot. Still, I'm a bit
> surprised that it ever worked as described.
> 
> Also note that the default distribution algorithm is not round robin
> assignment, but hash based. So multiple consecutive datagrams arriving
> at the same socket is not unexpected.
> 
> I suspect that this quick hack might "work". It seemed to on the
> supplied .c file:
> 
>                   score = compute_score(sk, net, saddr, sport,
>                                         daddr, hnum, dif, sdif);
>                   if (score > badness) {
>   -                       if (sk->sk_reuseport) {
>   +                       if (sk->sk_reuseport && !sk->sk_state !=
> TCP_ESTABLISHED) {
> 
> But a more robust approach, that also works on existing kernels, is to
> swap the default distribution algorithm with a custom BPF based one (
> SO_ATTACH_REUSEPORT_EBPF).
> 

Yes, I suspect that reuseport could still be used by to load-balance incoming packets
targetting the same 4-tuple.

So all sockets would have the same score, and we would select the first socket in
the list (if not applying reuseport hashing)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-08-30  8:54       ` Eric Dumazet
@ 2019-08-30 20:30         ` Willem de Bruijn
  2019-09-03 17:55           ` Willem de Bruijn
  0 siblings, 1 reply; 18+ messages in thread
From: Willem de Bruijn @ 2019-08-30 20:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Steve Zabele, Network Development, shum, vladimir116, saifi.khan,
	Daniel Borkmann, on2k16nm, Stephen Hemminger

On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 8/29/19 9:26 PM, Willem de Bruijn wrote:
>
> > SO_REUSEPORT was not intended to be used in this way. Opening
> > multiple connected sockets with the same local port.
> >
> > But since the interface allowed connect after joining a group, and
> > that is being used, I guess that point is moot. Still, I'm a bit
> > surprised that it ever worked as described.
> >
> > Also note that the default distribution algorithm is not round robin
> > assignment, but hash based. So multiple consecutive datagrams arriving
> > at the same socket is not unexpected.
> >
> > I suspect that this quick hack might "work". It seemed to on the
> > supplied .c file:
> >
> >                   score = compute_score(sk, net, saddr, sport,
> >                                         daddr, hnum, dif, sdif);
> >                   if (score > badness) {
> >   -                       if (sk->sk_reuseport) {
> >   +                       if (sk->sk_reuseport && !sk->sk_state !=
> > TCP_ESTABLISHED) {

This won't work for a mix of connected and connectionless sockets, of
course (even ignoring the typo), as it only skips reuseport on the
connected sockets.

> >
> > But a more robust approach, that also works on existing kernels, is to
> > swap the default distribution algorithm with a custom BPF based one (
> > SO_ATTACH_REUSEPORT_EBPF).
> >
>
> Yes, I suspect that reuseport could still be used by to load-balance incoming packets
> targetting the same 4-tuple.
>
> So all sockets would have the same score, and we would select the first socket in
> the list (if not applying reuseport hashing)

Can you elaborate a bit?

One option I see is to record in struct sock_reuseport if any port in
the group is connected and, if so, don't return immediately on the
first reuseport_select_sock hit, but continue the search for a higher
scoring connected socket.

Or do return immediately, but do this refined search in
reuseport_select_sock itself, as it has a reference to all sockets in the
group in sock_reuseport->socks[]. Instead of the straightforward hash.

Steve, Re: your point on a scalable QUIC server. That is an
interesting case certainly. Opening a connected socket per flow adds
both memory and port table pressure. I once looked into an SO_TXONLY
udp socket option that does not hash connected sockets into the port
table. In effect receiving on a small set of listening sockets (e.g.,
one per cpu) and sending over separate tx-only sockets. That still
introduces unnecessary memory allocation. OTOH it amortizes some
operations, such as route lookup.

Anyway, that does not fix the immediate issue you reported when using
SO_REUSEPORT as described.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-08-30 20:30         ` Willem de Bruijn
@ 2019-09-03 17:55           ` Willem de Bruijn
  2019-09-04 10:28             ` Steve Zabele
  0 siblings, 1 reply; 18+ messages in thread
From: Willem de Bruijn @ 2019-09-03 17:55 UTC (permalink / raw)
  Cc: Eric Dumazet, Steve Zabele, Network Development, shum,
	vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger

On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 8/29/19 9:26 PM, Willem de Bruijn wrote:
> >
> > > SO_REUSEPORT was not intended to be used in this way. Opening
> > > multiple connected sockets with the same local port.
> > >
> > > But since the interface allowed connect after joining a group, and
> > > that is being used, I guess that point is moot. Still, I'm a bit
> > > surprised that it ever worked as described.
> > >
> > > Also note that the default distribution algorithm is not round robin
> > > assignment, but hash based. So multiple consecutive datagrams arriving
> > > at the same socket is not unexpected.
> > >
> > > I suspect that this quick hack might "work". It seemed to on the
> > > supplied .c file:
> > >
> > >                   score = compute_score(sk, net, saddr, sport,
> > >                                         daddr, hnum, dif, sdif);
> > >                   if (score > badness) {
> > >   -                       if (sk->sk_reuseport) {
> > >   +                       if (sk->sk_reuseport && !sk->sk_state !=
> > > TCP_ESTABLISHED) {
>
> This won't work for a mix of connected and connectionless sockets, of
> course (even ignoring the typo), as it only skips reuseport on the
> connected sockets.
>
> > >
> > > But a more robust approach, that also works on existing kernels, is to
> > > swap the default distribution algorithm with a custom BPF based one (
> > > SO_ATTACH_REUSEPORT_EBPF).
> > >
> >
> > Yes, I suspect that reuseport could still be used by to load-balance incoming packets
> > targetting the same 4-tuple.
> >
> > So all sockets would have the same score, and we would select the first socket in
> > the list (if not applying reuseport hashing)
>
> Can you elaborate a bit?
>
> One option I see is to record in struct sock_reuseport if any port in
> the group is connected and, if so, don't return immediately on the
> first reuseport_select_sock hit, but continue the search for a higher
> scoring connected socket.
>
> Or do return immediately, but do this refined search in
> reuseport_select_sock itself, as it has a reference to all sockets in the
> group in sock_reuseport->socks[]. Instead of the straightforward hash.

That won't work, as reuseport_select_sock does not have access to
protocol specific data, notably inet_dport.

Unfortunately, what I've come up with so far is not concise and slows
down existing reuseport lookup in a busy port table slot. Note that it
is needed for both ipv4 and ipv6.

Do not break out of the port table slot early, but continue to search
for a higher scored match even after matching a reuseport:

"
   @@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
                                     struct udp_hslot *hslot2,
                                     struct sk_buff *skb)
 {
+       struct sock *reuseport_result = NULL;
        struct sock *sk, *result;
+       int reuseport_score = 0;
        int score, badness;
        u32 hash = 0;

        result = NULL;
        badness = 0;
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
                score = compute_score(sk, net, saddr, sport,
                                      daddr, hnum, dif, sdif);
                if (score > badness) {
-                       if (sk->sk_reuseport) {
+                       if (sk->sk_reuseport &&
+                           sk->sk_state != TCP_ESTABLISHED &&
+                           !reuseport_result) {
                                hash = udp_ehashfn(net, daddr, hnum,
                                                   saddr, sport);
-                               result = reuseport_select_sock(sk, hash, skb,
+                               reuseport_result =
reuseport_select_sock(sk, hash, skb,
                                                        sizeof(struct udphdr));
-                               if (result)
-                                       return result;
+                               if (reuseport_result)
+                                       reuseport_score = score;
+                               continue;
                        }
                        badness = score;
                        result = sk;
                }
        }
+
+       if (badness < reuseport_score)
+               result = reuseport_result;
+
        return result;
"

To break out after the first reuseport hit when it is safe, i.e., when
it holds no connected sockets, requires adding this state to struct
reuseport_sock at __ip4_datagram_connect. And modify
reuseport_select_sock to read this. At least, I have not found a more
elegant solution.

> Steve, Re: your point on a scalable QUIC server. That is an
> interesting case certainly. Opening a connected socket per flow adds
> both memory and port table pressure. I once looked into an SO_TXONLY
> udp socket option that does not hash connected sockets into the port
> table. In effect receiving on a small set of listening sockets (e.g.,
> one per cpu) and sending over separate tx-only sockets. That still
> introduces unnecessary memory allocation. OTOH it amortizes some
> operations, such as route lookup.
>
> Anyway, that does not fix the immediate issue you reported when using
> SO_REUSEPORT as described.

As for the BPF program: good point on accessing the udp port when
skb->data is already beyond the header.

Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
Which I think will work, but have not tested.

As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
attached (with CAP_SYS_ADMIN). See
tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
example that parses udp headers with bpf_skb_load_bytes.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Is bug 200755 in anyone's queue??
  2019-09-03 17:55           ` Willem de Bruijn
@ 2019-09-04 10:28             ` Steve Zabele
  2019-09-04 12:00               ` Mark KEATON
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Zabele @ 2019-09-04 10:28 UTC (permalink / raw)
  To: 'Willem de Bruijn'
  Cc: 'Eric Dumazet', 'Network Development',
	shum, vladimir116, saifi.khan, 'Daniel Borkmann',
	on2k16nm, 'Stephen Hemminger',
	mark.keaton

Hi Willem,

Thanks for continuing to poke at this, much appreciated!

>As for the BPF program: good point on accessing the udp port when
>skb->data is already beyond the header.

> Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
>Which I think will work, but have not tested.

Please note that the test code was intentionally set up to make testing as simple as possible. Hence the source addresses for the multiple UDP sessions were identical -- but that is not the general case. In the general case a connected and bound socket should be associated with exactly one five tuple (source and dest addresses, source and destination ports, and protocol.

So a 'connect bpf' would actually need access to the IP addresses as well, not just the ports. To do this, the load bytes call required negative arguments, which failed miserably when we tried it.

In any event, there remains the issue of figuring out which index to return when a match is detected since the index is not the same as the file descriptor value and in fact can change as file descriptors are added and deleted. If I understand the kernel mechanism correctly, the operation is something like this. When you add the first one, its assigned to the first slot; when you add the second its assigned to the second slot; when you delete the first one, the second is moved to the first slot) so tracking this requires figuring out the order stored in the socket array within the kernel, and updating the bpf whenever something changes. I don't know if it's even possible to query which slot a given 

So we think handling this with a bpf is really not viable.

One thing worth mentioning is that the connect mechanism here is meant to (at least used to) work the same as connect does with TCP. Bind sets the expected/required local address and port; connect sets the expected/required remote address and port -- so a socket file descriptor becomes associated with exactly one five-tuple. That's how it's worked for several decades anyway.

Thanks again!!!

Steve

-----Original Message-----
From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
Sent: Tuesday, September 03, 2019 1:56 PM
Cc: Eric Dumazet; Steve Zabele; Network Development; shum@canndrew.org; vladimir116@gmail.com; saifi.khan@strikr.in; Daniel Borkmann; on2k16nm@gmail.com; Stephen Hemminger
Subject: Re: Is bug 200755 in anyone's queue??

On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 8/29/19 9:26 PM, Willem de Bruijn wrote:
> >
> > > SO_REUSEPORT was not intended to be used in this way. Opening
> > > multiple connected sockets with the same local port.
> > >
> > > But since the interface allowed connect after joining a group, and
> > > that is being used, I guess that point is moot. Still, I'm a bit
> > > surprised that it ever worked as described.
> > >
> > > Also note that the default distribution algorithm is not round robin
> > > assignment, but hash based. So multiple consecutive datagrams arriving
> > > at the same socket is not unexpected.
> > >
> > > I suspect that this quick hack might "work". It seemed to on the
> > > supplied .c file:
> > >
> > >                   score = compute_score(sk, net, saddr, sport,
> > >                                         daddr, hnum, dif, sdif);
> > >                   if (score > badness) {
> > >   -                       if (sk->sk_reuseport) {
> > >   +                       if (sk->sk_reuseport && !sk->sk_state !=
> > > TCP_ESTABLISHED) {
>
> This won't work for a mix of connected and connectionless sockets, of
> course (even ignoring the typo), as it only skips reuseport on the
> connected sockets.
>
> > >
> > > But a more robust approach, that also works on existing kernels, is to
> > > swap the default distribution algorithm with a custom BPF based one (
> > > SO_ATTACH_REUSEPORT_EBPF).
> > >
> >
> > Yes, I suspect that reuseport could still be used by to load-balance incoming packets
> > targetting the same 4-tuple.
> >
> > So all sockets would have the same score, and we would select the first socket in
> > the list (if not applying reuseport hashing)
>
> Can you elaborate a bit?
>
> One option I see is to record in struct sock_reuseport if any port in
> the group is connected and, if so, don't return immediately on the
> first reuseport_select_sock hit, but continue the search for a higher
> scoring connected socket.
>
> Or do return immediately, but do this refined search in
> reuseport_select_sock itself, as it has a reference to all sockets in the
> group in sock_reuseport->socks[]. Instead of the straightforward hash.

That won't work, as reuseport_select_sock does not have access to
protocol specific data, notably inet_dport.

Unfortunately, what I've come up with so far is not concise and slows
down existing reuseport lookup in a busy port table slot. Note that it
is needed for both ipv4 and ipv6.

Do not break out of the port table slot early, but continue to search
for a higher scored match even after matching a reuseport:

"
   @@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
                                     struct udp_hslot *hslot2,
                                     struct sk_buff *skb)
 {
+       struct sock *reuseport_result = NULL;
        struct sock *sk, *result;
+       int reuseport_score = 0;
        int score, badness;
        u32 hash = 0;

        result = NULL;
        badness = 0;
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
                score = compute_score(sk, net, saddr, sport,
                                      daddr, hnum, dif, sdif);
                if (score > badness) {
-                       if (sk->sk_reuseport) {
+                       if (sk->sk_reuseport &&
+                           sk->sk_state != TCP_ESTABLISHED &&
+                           !reuseport_result) {
                                hash = udp_ehashfn(net, daddr, hnum,
                                                   saddr, sport);
-                               result = reuseport_select_sock(sk, hash, skb,
+                               reuseport_result =
reuseport_select_sock(sk, hash, skb,
                                                        sizeof(struct udphdr));
-                               if (result)
-                                       return result;
+                               if (reuseport_result)
+                                       reuseport_score = score;
+                               continue;
                        }
                        badness = score;
                        result = sk;
                }
        }
+
+       if (badness < reuseport_score)
+               result = reuseport_result;
+
        return result;
"

To break out after the first reuseport hit when it is safe, i.e., when
it holds no connected sockets, requires adding this state to struct
reuseport_sock at __ip4_datagram_connect. And modify
reuseport_select_sock to read this. At least, I have not found a more
elegant solution.

> Steve, Re: your point on a scalable QUIC server. That is an
> interesting case certainly. Opening a connected socket per flow adds
> both memory and port table pressure. I once looked into an SO_TXONLY
> udp socket option that does not hash connected sockets into the port
> table. In effect receiving on a small set of listening sockets (e.g.,
> one per cpu) and sending over separate tx-only sockets. That still
> introduces unnecessary memory allocation. OTOH it amortizes some
> operations, such as route lookup.
>
> Anyway, that does not fix the immediate issue you reported when using
> SO_REUSEPORT as described.

As for the BPF program: good point on accessing the udp port when
skb->data is already beyond the header.

Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
Which I think will work, but have not tested.

As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
attached (with CAP_SYS_ADMIN). See
tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
example that parses udp headers with bpf_skb_load_bytes.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RE: Is bug 200755 in anyone's queue??
  2019-09-04 10:28             ` Steve Zabele
@ 2019-09-04 12:00               ` Mark KEATON
  2019-09-04 12:23                 ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Mark KEATON @ 2019-09-04 12:00 UTC (permalink / raw)
  To: Steve Zabele, Willem de Bruijn
  Cc: Eric Dumazet, Network Development, shum, vladimir116, saifi.khan,
	Daniel Borkmann, on2k16nm, Stephen Hemminger

Hi Willem,

I am the person who commented on the original bug report in bugzilla.

In communicating with Steve just now about possible solutions that maintain the efficiency that you are after, what would you think of the following:  keep two lists of UDP sockets, those connected and those not connected, and always searching the connected list first.  If the connected list is empty, then the lookup can quickly use the not connected list to find a socket for load balancing.  If there are connected sockets, then only those connected sockets are searched first for an exact match.

Another option might be to do it with a single list if the connected sockets are all at the beginning of the list.  This would require the two separate lookups to start at different points in the list.

Thoughts?

Thanks!
Mark


> On Sep 4, 2019, at 6:28 AM, Steve Zabele <zabele@comcast.net> wrote:
> 
> Hi Willem,
> 
> Thanks for continuing to poke at this, much appreciated!
> 
>> As for the BPF program: good point on accessing the udp port when
>> skb->data is already beyond the header.
> 
>> Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
>> Which I think will work, but have not tested.
> 
> Please note that the test code was intentionally set up to make testing as simple as possible. Hence the source addresses for the multiple UDP sessions were identical -- but that is not the general case. In the general case a connected and bound socket should be associated with exactly one five tuple (source and dest addresses, source and destination ports, and protocol.
> 
> So a 'connect bpf' would actually need access to the IP addresses as well, not just the ports. To do this, the load bytes call required negative arguments, which failed miserably when we tried it.
> 
> In any event, there remains the issue of figuring out which index to return when a match is detected since the index is not the same as the file descriptor value and in fact can change as file descriptors are added and deleted. If I understand the kernel mechanism correctly, the operation is something like this. When you add the first one, its assigned to the first slot; when you add the second its assigned to the second slot; when you delete the first one, the second is moved to the first slot) so tracking this requires figuring out the order stored in the socket array within the kernel, and updating the bpf whenever something changes. I don't know if it's even possible to query which slot a given 
> 
> So we think handling this with a bpf is really not viable.
> 
> One thing worth mentioning is that the connect mechanism here is meant to (at least used to) work the same as connect does with TCP. Bind sets the expected/required local address and port; connect sets the expected/required remote address and port -- so a socket file descriptor becomes associated with exactly one five-tuple. That's how it's worked for several decades anyway.
> 
> Thanks again!!!
> 
> Steve
> 
> -----Original Message-----
> From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
> Sent: Tuesday, September 03, 2019 1:56 PM
> Cc: Eric Dumazet; Steve Zabele; Network Development; shum@canndrew.org; vladimir116@gmail.com; saifi.khan@strikr.in; Daniel Borkmann; on2k16nm@gmail.com; Stephen Hemminger
> Subject: Re: Is bug 200755 in anyone's queue??
> 
> On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> 
>> On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> 
>>> 
>>> 
>>> On 8/29/19 9:26 PM, Willem de Bruijn wrote:
>>> 
>>>> SO_REUSEPORT was not intended to be used in this way. Opening
>>>> multiple connected sockets with the same local port.
>>>> 
>>>> But since the interface allowed connect after joining a group, and
>>>> that is being used, I guess that point is moot. Still, I'm a bit
>>>> surprised that it ever worked as described.
>>>> 
>>>> Also note that the default distribution algorithm is not round robin
>>>> assignment, but hash based. So multiple consecutive datagrams arriving
>>>> at the same socket is not unexpected.
>>>> 
>>>> I suspect that this quick hack might "work". It seemed to on the
>>>> supplied .c file:
>>>> 
>>>>                  score = compute_score(sk, net, saddr, sport,
>>>>                                        daddr, hnum, dif, sdif);
>>>>                  if (score > badness) {
>>>>  -                       if (sk->sk_reuseport) {
>>>>  +                       if (sk->sk_reuseport && !sk->sk_state !=
>>>> TCP_ESTABLISHED) {
>> 
>> This won't work for a mix of connected and connectionless sockets, of
>> course (even ignoring the typo), as it only skips reuseport on the
>> connected sockets.
>> 
>>>> 
>>>> But a more robust approach, that also works on existing kernels, is to
>>>> swap the default distribution algorithm with a custom BPF based one (
>>>> SO_ATTACH_REUSEPORT_EBPF).
>>>> 
>>> 
>>> Yes, I suspect that reuseport could still be used by to load-balance incoming packets
>>> targetting the same 4-tuple.
>>> 
>>> So all sockets would have the same score, and we would select the first socket in
>>> the list (if not applying reuseport hashing)
>> 
>> Can you elaborate a bit?
>> 
>> One option I see is to record in struct sock_reuseport if any port in
>> the group is connected and, if so, don't return immediately on the
>> first reuseport_select_sock hit, but continue the search for a higher
>> scoring connected socket.
>> 
>> Or do return immediately, but do this refined search in
>> reuseport_select_sock itself, as it has a reference to all sockets in the
>> group in sock_reuseport->socks[]. Instead of the straightforward hash.
> 
> That won't work, as reuseport_select_sock does not have access to
> protocol specific data, notably inet_dport.
> 
> Unfortunately, what I've come up with so far is not concise and slows
> down existing reuseport lookup in a busy port table slot. Note that it
> is needed for both ipv4 and ipv6.
> 
> Do not break out of the port table slot early, but continue to search
> for a higher scored match even after matching a reuseport:
> 
> "
>   @@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>                                     struct udp_hslot *hslot2,
>                                     struct sk_buff *skb)
> {
> +       struct sock *reuseport_result = NULL;
>        struct sock *sk, *result;
> +       int reuseport_score = 0;
>        int score, badness;
>        u32 hash = 0;
> 
>        result = NULL;
>        badness = 0;
>        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>                score = compute_score(sk, net, saddr, sport,
>                                      daddr, hnum, dif, sdif);
>                if (score > badness) {
> -                       if (sk->sk_reuseport) {
> +                       if (sk->sk_reuseport &&
> +                           sk->sk_state != TCP_ESTABLISHED &&
> +                           !reuseport_result) {
>                                hash = udp_ehashfn(net, daddr, hnum,
>                                                   saddr, sport);
> -                               result = reuseport_select_sock(sk, hash, skb,
> +                               reuseport_result =
> reuseport_select_sock(sk, hash, skb,
>                                                        sizeof(struct udphdr));
> -                               if (result)
> -                                       return result;
> +                               if (reuseport_result)
> +                                       reuseport_score = score;
> +                               continue;
>                        }
>                        badness = score;
>                        result = sk;
>                }
>        }
> +
> +       if (badness < reuseport_score)
> +               result = reuseport_result;
> +
>        return result;
> "
> 
> To break out after the first reuseport hit when it is safe, i.e., when
> it holds no connected sockets, requires adding this state to struct
> reuseport_sock at __ip4_datagram_connect. And modify
> reuseport_select_sock to read this. At least, I have not found a more
> elegant solution.
> 
>> Steve, Re: your point on a scalable QUIC server. That is an
>> interesting case certainly. Opening a connected socket per flow adds
>> both memory and port table pressure. I once looked into an SO_TXONLY
>> udp socket option that does not hash connected sockets into the port
>> table. In effect receiving on a small set of listening sockets (e.g.,
>> one per cpu) and sending over separate tx-only sockets. That still
>> introduces unnecessary memory allocation. OTOH it amortizes some
>> operations, such as route lookup.
>> 
>> Anyway, that does not fix the immediate issue you reported when using
>> SO_REUSEPORT as described.
> 
> As for the BPF program: good point on accessing the udp port when
> skb->data is already beyond the header.
> 
> Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
> Which I think will work, but have not tested.
> 
> As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
> attached (with CAP_SYS_ADMIN). See
> tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
> example that parses udp headers with bpf_skb_load_bytes.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-04 12:00               ` Mark KEATON
@ 2019-09-04 12:23                 ` Eric Dumazet
  2019-09-04 14:23                   ` Willem de Bruijn
  2019-09-04 14:51                   ` Steve Zabele
  0 siblings, 2 replies; 18+ messages in thread
From: Eric Dumazet @ 2019-09-04 12:23 UTC (permalink / raw)
  To: Mark KEATON, Steve Zabele, Willem de Bruijn
  Cc: Network Development, shum, vladimir116, saifi.khan,
	Daniel Borkmann, on2k16nm, Stephen Hemminger



On 9/4/19 2:00 PM, Mark KEATON wrote:
> Hi Willem,
> 
> I am the person who commented on the original bug report in bugzilla.
> 
> In communicating with Steve just now about possible solutions that maintain the efficiency that you are after, what would you think of the following:  keep two lists of UDP sockets, those connected and those not connected, and always searching the connected list first. 

This was my suggestion.

Note that this requires adding yet another hash table, and yet another lookup
(another cache line miss per incoming packet)

This lookup will slow down DNS and QUIC servers, or any application solely using not connected sockets.


The word 'quick' you use is slightly misleading, since a change like that is a trade off.
Some applications might become faster, while others become slower.

Another issue is that a connect() can follow a bind(), we would need to rehash sockets
from one table to another. (Or add another set of anchors in UDP sockets, so that sockets can be in all the hash tables)


 If the connected list is empty, then the lookup can quickly use the not connected list to find a socket for load balancing.  If there are connected sockets, then only those connected sockets are searched first for an exact match.
> 
> Another option might be to do it with a single list if the connected sockets are all at the beginning of the list.  This would require the two separate lookups to start at different points in the list.
> 
> Thoughts?
> 
> Thanks!
> Mark
> 
> 
>> On Sep 4, 2019, at 6:28 AM, Steve Zabele <zabele@comcast.net> wrote:
>>
>> Hi Willem,
>>
>> Thanks for continuing to poke at this, much appreciated!
>>
>>> As for the BPF program: good point on accessing the udp port when
>>> skb->data is already beyond the header.
>>
>>> Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
>>> Which I think will work, but have not tested.
>>
>> Please note that the test code was intentionally set up to make testing as simple as possible. Hence the source addresses for the multiple UDP sessions were identical -- but that is not the general case. In the general case a connected and bound socket should be associated with exactly one five tuple (source and dest addresses, source and destination ports, and protocol.
>>
>> So a 'connect bpf' would actually need access to the IP addresses as well, not just the ports. To do this, the load bytes call required negative arguments, which failed miserably when we tried it.
>>
>> In any event, there remains the issue of figuring out which index to return when a match is detected since the index is not the same as the file descriptor value and in fact can change as file descriptors are added and deleted. If I understand the kernel mechanism correctly, the operation is something like this. When you add the first one, its assigned to the first slot; when you add the second its assigned to the second slot; when you delete the first one, the second is moved to the first slot) so tracking this requires figuring out the order stored in the socket array within the kernel, and updating the bpf whenever something changes. I don't know if it's even possible to query which slot a given 
>>
>> So we think handling this with a bpf is really not viable.
>>
>> One thing worth mentioning is that the connect mechanism here is meant to (at least used to) work the same as connect does with TCP. Bind sets the expected/required local address and port; connect sets the expected/required remote address and port -- so a socket file descriptor becomes associated with exactly one five-tuple. That's how it's worked for several decades anyway.
>>
>> Thanks again!!!
>>
>> Steve
>>
>> -----Original Message-----
>> From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
>> Sent: Tuesday, September 03, 2019 1:56 PM
>> Cc: Eric Dumazet; Steve Zabele; Network Development; shum@canndrew.org; vladimir116@gmail.com; saifi.khan@strikr.in; Daniel Borkmann; on2k16nm@gmail.com; Stephen Hemminger
>> Subject: Re: Is bug 200755 in anyone's queue??
>>
>> On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>>
>>> On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/29/19 9:26 PM, Willem de Bruijn wrote:
>>>>
>>>>> SO_REUSEPORT was not intended to be used in this way. Opening
>>>>> multiple connected sockets with the same local port.
>>>>>
>>>>> But since the interface allowed connect after joining a group, and
>>>>> that is being used, I guess that point is moot. Still, I'm a bit
>>>>> surprised that it ever worked as described.
>>>>>
>>>>> Also note that the default distribution algorithm is not round robin
>>>>> assignment, but hash based. So multiple consecutive datagrams arriving
>>>>> at the same socket is not unexpected.
>>>>>
>>>>> I suspect that this quick hack might "work". It seemed to on the
>>>>> supplied .c file:
>>>>>
>>>>>                  score = compute_score(sk, net, saddr, sport,
>>>>>                                        daddr, hnum, dif, sdif);
>>>>>                  if (score > badness) {
>>>>>  -                       if (sk->sk_reuseport) {
>>>>>  +                       if (sk->sk_reuseport && !sk->sk_state !=
>>>>> TCP_ESTABLISHED) {
>>>
>>> This won't work for a mix of connected and connectionless sockets, of
>>> course (even ignoring the typo), as it only skips reuseport on the
>>> connected sockets.
>>>
>>>>>
>>>>> But a more robust approach, that also works on existing kernels, is to
>>>>> swap the default distribution algorithm with a custom BPF based one (
>>>>> SO_ATTACH_REUSEPORT_EBPF).
>>>>>
>>>>
>>>> Yes, I suspect that reuseport could still be used by to load-balance incoming packets
>>>> targetting the same 4-tuple.
>>>>
>>>> So all sockets would have the same score, and we would select the first socket in
>>>> the list (if not applying reuseport hashing)
>>>
>>> Can you elaborate a bit?
>>>
>>> One option I see is to record in struct sock_reuseport if any port in
>>> the group is connected and, if so, don't return immediately on the
>>> first reuseport_select_sock hit, but continue the search for a higher
>>> scoring connected socket.
>>>
>>> Or do return immediately, but do this refined search in
>>> reuseport_select_sock itself, as it has a reference to all sockets in the
>>> group in sock_reuseport->socks[]. Instead of the straightforward hash.
>>
>> That won't work, as reuseport_select_sock does not have access to
>> protocol specific data, notably inet_dport.
>>
>> Unfortunately, what I've come up with so far is not concise and slows
>> down existing reuseport lookup in a busy port table slot. Note that it
>> is needed for both ipv4 and ipv6.
>>
>> Do not break out of the port table slot early, but continue to search
>> for a higher scored match even after matching a reuseport:
>>
>> "
>>   @@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>>                                     struct udp_hslot *hslot2,
>>                                     struct sk_buff *skb)
>> {
>> +       struct sock *reuseport_result = NULL;
>>        struct sock *sk, *result;
>> +       int reuseport_score = 0;
>>        int score, badness;
>>        u32 hash = 0;
>>
>>        result = NULL;
>>        badness = 0;
>>        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>>                score = compute_score(sk, net, saddr, sport,
>>                                      daddr, hnum, dif, sdif);
>>                if (score > badness) {
>> -                       if (sk->sk_reuseport) {
>> +                       if (sk->sk_reuseport &&
>> +                           sk->sk_state != TCP_ESTABLISHED &&
>> +                           !reuseport_result) {
>>                                hash = udp_ehashfn(net, daddr, hnum,
>>                                                   saddr, sport);
>> -                               result = reuseport_select_sock(sk, hash, skb,
>> +                               reuseport_result =
>> reuseport_select_sock(sk, hash, skb,
>>                                                        sizeof(struct udphdr));
>> -                               if (result)
>> -                                       return result;
>> +                               if (reuseport_result)
>> +                                       reuseport_score = score;
>> +                               continue;
>>                        }
>>                        badness = score;
>>                        result = sk;
>>                }
>>        }
>> +
>> +       if (badness < reuseport_score)
>> +               result = reuseport_result;
>> +
>>        return result;
>> "
>>
>> To break out after the first reuseport hit when it is safe, i.e., when
>> it holds no connected sockets, requires adding this state to struct
>> reuseport_sock at __ip4_datagram_connect. And modify
>> reuseport_select_sock to read this. At least, I have not found a more
>> elegant solution.
>>
>>> Steve, Re: your point on a scalable QUIC server. That is an
>>> interesting case certainly. Opening a connected socket per flow adds
>>> both memory and port table pressure. I once looked into an SO_TXONLY
>>> udp socket option that does not hash connected sockets into the port
>>> table. In effect receiving on a small set of listening sockets (e.g.,
>>> one per cpu) and sending over separate tx-only sockets. That still
>>> introduces unnecessary memory allocation. OTOH it amortizes some
>>> operations, such as route lookup.
>>>
>>> Anyway, that does not fix the immediate issue you reported when using
>>> SO_REUSEPORT as described.
>>
>> As for the BPF program: good point on accessing the udp port when
>> skb->data is already beyond the header.
>>
>> Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
>> Which I think will work, but have not tested.
>>
>> As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
>> attached (with CAP_SYS_ADMIN). See
>> tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
>> example that parses udp headers with bpf_skb_load_bytes.
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-04 12:23                 ` Eric Dumazet
@ 2019-09-04 14:23                   ` Willem de Bruijn
  2019-09-04 14:51                   ` Steve Zabele
  1 sibling, 0 replies; 18+ messages in thread
From: Willem de Bruijn @ 2019-09-04 14:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mark KEATON, Steve Zabele, Willem de Bruijn, Network Development,
	shum, vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger

On Wed, Sep 4, 2019 at 8:23 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 9/4/19 2:00 PM, Mark KEATON wrote:
> > Hi Willem,
> >
> > I am the person who commented on the original bug report in bugzilla.
> >
> > In communicating with Steve just now about possible solutions that maintain the efficiency that you are after, what would you think of the following:  keep two lists of UDP sockets, those connected and those not connected, and always searching the connected list first.
>
> This was my suggestion.
>
> Note that this requires adding yet another hash table, and yet another lookup
> (another cache line miss per incoming packet)
>
> This lookup will slow down DNS and QUIC servers, or any application solely using not connected sockets.

Exactly.

The only way around it that I see is to keep the single list and
optionally mark a struct reuseport_sock as having no connected
members, in which case the search can break on the first reuseport
match, as it does today.

"
On top of the main patch it requires something like

@@ -22,6 +22,7 @@ struct sock_reuseport {
        /* ID stays the same even after the size of socks[] grows. */
        unsigned int            reuseport_id;
        bool                    bind_inany;
+       unsigned int             connected;
        struct bpf_prog __rcu   *prog;          /* optional BPF sock selector */
        struct sock             *socks[0];      /* array of sock pointers */
 };

@@ -73,6 +74,15 @@ int __ip4_datagram_connect(struct sock *sk, struct
sockaddr *uaddr, int addr_len
        sk_set_txhash(sk);
        inet->inet_id = jiffies;

+       if (rcu_access_pointer(sk->sk_reuseport_cb)) {
+               struct sock_reuseport *reuse;
+
+               rcu_read_lock();
+               reuse = rcu_dereference(sk->sk_reuseport_cb);
+               reuse->connected = 1;
+               rcu_read_unlock();
+       }
+
        sk_dst_set(sk, &rt->dst);
        err = 0;
"

plus a way for reuseport_select_sock to communicate that. Probably a
variant __reuseport_select_sock with an extra argument.

As for BPF: the example I pointed out does read ip addresses and uses
a BPF map for socket selection. But as that feature is new with 4.19
it is probably moot for this purpose, as we are targeting a fix that
can be backported to 4.19 stable.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-04 12:23                 ` Eric Dumazet
  2019-09-04 14:23                   ` Willem de Bruijn
@ 2019-09-04 14:51                   ` Steve Zabele
  2019-09-04 15:46                     ` Willem de Bruijn
  1 sibling, 1 reply; 18+ messages in thread
From: Steve Zabele @ 2019-09-04 14:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mark KEATON, Willem de Bruijn, Network Development, shum,
	vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger

I think a dual table approach makes a lot of sense here, especially if we look at the different use cases. For the DNS server example, almost certainly there will not be any connected sockets using the server port, so a test of whether the connected table is empty (maybe a boolean stored with the unconnected table?) should get to the existing code very quickly and not require accessing the memory holding the connected table. For our use case, the connected sockets persist for long periods (at network timescales at least) and so any rehashing should be infrequent and so have limited impact on performance overall.

So does a dual table approach seem workable to other folks that know the internals?

Thanks!

Steve

Sent from my iPhone

> On Sep 4, 2019, at 8:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
>> On 9/4/19 2:00 PM, Mark KEATON wrote:
>> Hi Willem,
>> 
>> I am the person who commented on the original bug report in bugzilla.
>> 
>> In communicating with Steve just now about possible solutions that maintain the efficiency that you are after, what would you think of the following:  keep two lists of UDP sockets, those connected and those not connected, and always searching the connected list first. 
> 
> This was my suggestion.
> 
> Note that this requires adding yet another hash table, and yet another lookup
> (another cache line miss per incoming packet)
> 
> This lookup will slow down DNS and QUIC servers, or any application solely using not connected sockets.
> 
> 
> The word 'quick' you use is slightly misleading, since a change like that is a trade off.
> Some applications might become faster, while others become slower.
> 
> Another issue is that a connect() can follow a bind(), we would need to rehash sockets
> from one table to another. (Or add another set of anchors in UDP sockets, so that sockets can be in all the hash tables)
> 
> 
> If the connected list is empty, then the lookup can quickly use the not connected list to find a socket for load balancing.  If there are connected sockets, then only those connected sockets are searched first for an exact match.
>> 
>> Another option might be to do it with a single list if the connected sockets are all at the beginning of the list.  This would require the two separate lookups to start at different points in the list.
>> 
>> Thoughts?
>> 
>> Thanks!
>> Mark
>> 
>> 
>>> On Sep 4, 2019, at 6:28 AM, Steve Zabele <zabele@comcast.net> wrote:
>>> 
>>> Hi Willem,
>>> 
>>> Thanks for continuing to poke at this, much appreciated!
>>> 
>>>> As for the BPF program: good point on accessing the udp port when
>>>> skb->data is already beyond the header.
>>> 
>>>> Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
>>>> Which I think will work, but have not tested.
>>> 
>>> Please note that the test code was intentionally set up to make testing as simple as possible. Hence the source addresses for the multiple UDP sessions were identical -- but that is not the general case. In the general case a connected and bound socket should be associated with exactly one five tuple (source and dest addresses, source and destination ports, and protocol.
>>> 
>>> So a 'connect bpf' would actually need access to the IP addresses as well, not just the ports. To do this, the load bytes call required negative arguments, which failed miserably when we tried it.
>>> 
>>> In any event, there remains the issue of figuring out which index to return when a match is detected since the index is not the same as the file descriptor value and in fact can change as file descriptors are added and deleted. If I understand the kernel mechanism correctly, the operation is something like this. When you add the first one, its assigned to the first slot; when you add the second its assigned to the second slot; when you delete the first one, the second is moved to the first slot) so tracking this requires figuring out the order stored in the socket array within the kernel, and updating the bpf whenever something changes. I don't know if it's even possible to query which slot a given 
>>> 
>>> So we think handling this with a bpf is really not viable.
>>> 
>>> One thing worth mentioning is that the connect mechanism here is meant to (at least used to) work the same as connect does with TCP. Bind sets the expected/required local address and port; connect sets the expected/required remote address and port -- so a socket file descriptor becomes associated with exactly one five-tuple. That's how it's worked for several decades anyway.
>>> 
>>> Thanks again!!!
>>> 
>>> Steve
>>> 
>>> -----Original Message-----
>>> From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
>>> Sent: Tuesday, September 03, 2019 1:56 PM
>>> Cc: Eric Dumazet; Steve Zabele; Network Development; shum@canndrew.org; vladimir116@gmail.com; saifi.khan@strikr.in; Daniel Borkmann; on2k16nm@gmail.com; Stephen Hemminger
>>> Subject: Re: Is bug 200755 in anyone's queue??
>>> 
>>> On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
>>> <willemdebruijn.kernel@gmail.com> wrote:
>>>> 
>>>>> On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 8/29/19 9:26 PM, Willem de Bruijn wrote:
>>>>>> 
>>>>>> SO_REUSEPORT was not intended to be used in this way. Opening
>>>>>> multiple connected sockets with the same local port.
>>>>>> 
>>>>>> But since the interface allowed connect after joining a group, and
>>>>>> that is being used, I guess that point is moot. Still, I'm a bit
>>>>>> surprised that it ever worked as described.
>>>>>> 
>>>>>> Also note that the default distribution algorithm is not round robin
>>>>>> assignment, but hash based. So multiple consecutive datagrams arriving
>>>>>> at the same socket is not unexpected.
>>>>>> 
>>>>>> I suspect that this quick hack might "work". It seemed to on the
>>>>>> supplied .c file:
>>>>>> 
>>>>>>                 score = compute_score(sk, net, saddr, sport,
>>>>>>                                       daddr, hnum, dif, sdif);
>>>>>>                 if (score > badness) {
>>>>>> -                       if (sk->sk_reuseport) {
>>>>>> +                       if (sk->sk_reuseport && !sk->sk_state !=
>>>>>> TCP_ESTABLISHED) {
>>>> 
>>>> This won't work for a mix of connected and connectionless sockets, of
>>>> course (even ignoring the typo), as it only skips reuseport on the
>>>> connected sockets.
>>>> 
>>>>>> 
>>>>>> But a more robust approach, that also works on existing kernels, is to
>>>>>> swap the default distribution algorithm with a custom BPF based one (
>>>>>> SO_ATTACH_REUSEPORT_EBPF).
>>>>>> 
>>>>> 
>>>>> Yes, I suspect that reuseport could still be used by to load-balance incoming packets
>>>>> targetting the same 4-tuple.
>>>>> 
>>>>> So all sockets would have the same score, and we would select the first socket in
>>>>> the list (if not applying reuseport hashing)
>>>> 
>>>> Can you elaborate a bit?
>>>> 
>>>> One option I see is to record in struct sock_reuseport if any port in
>>>> the group is connected and, if so, don't return immediately on the
>>>> first reuseport_select_sock hit, but continue the search for a higher
>>>> scoring connected socket.
>>>> 
>>>> Or do return immediately, but do this refined search in
>>>> reuseport_select_sock itself, as it has a reference to all sockets in the
>>>> group in sock_reuseport->socks[]. Instead of the straightforward hash.
>>> 
>>> That won't work, as reuseport_select_sock does not have access to
>>> protocol specific data, notably inet_dport.
>>> 
>>> Unfortunately, what I've come up with so far is not concise and slows
>>> down existing reuseport lookup in a busy port table slot. Note that it
>>> is needed for both ipv4 and ipv6.
>>> 
>>> Do not break out of the port table slot early, but continue to search
>>> for a higher scored match even after matching a reuseport:
>>> 
>>> "
>>>  @@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>>>                                    struct udp_hslot *hslot2,
>>>                                    struct sk_buff *skb)
>>> {
>>> +       struct sock *reuseport_result = NULL;
>>>       struct sock *sk, *result;
>>> +       int reuseport_score = 0;
>>>       int score, badness;
>>>       u32 hash = 0;
>>> 
>>>       result = NULL;
>>>       badness = 0;
>>>       udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>>>               score = compute_score(sk, net, saddr, sport,
>>>                                     daddr, hnum, dif, sdif);
>>>               if (score > badness) {
>>> -                       if (sk->sk_reuseport) {
>>> +                       if (sk->sk_reuseport &&
>>> +                           sk->sk_state != TCP_ESTABLISHED &&
>>> +                           !reuseport_result) {
>>>                               hash = udp_ehashfn(net, daddr, hnum,
>>>                                                  saddr, sport);
>>> -                               result = reuseport_select_sock(sk, hash, skb,
>>> +                               reuseport_result =
>>> reuseport_select_sock(sk, hash, skb,
>>>                                                       sizeof(struct udphdr));
>>> -                               if (result)
>>> -                                       return result;
>>> +                               if (reuseport_result)
>>> +                                       reuseport_score = score;
>>> +                               continue;
>>>                       }
>>>                       badness = score;
>>>                       result = sk;
>>>               }
>>>       }
>>> +
>>> +       if (badness < reuseport_score)
>>> +               result = reuseport_result;
>>> +
>>>       return result;
>>> "
>>> 
>>> To break out after the first reuseport hit when it is safe, i.e., when
>>> it holds no connected sockets, requires adding this state to struct
>>> reuseport_sock at __ip4_datagram_connect. And modify
>>> reuseport_select_sock to read this. At least, I have not found a more
>>> elegant solution.
>>> 
>>>> Steve, Re: your point on a scalable QUIC server. That is an
>>>> interesting case certainly. Opening a connected socket per flow adds
>>>> both memory and port table pressure. I once looked into an SO_TXONLY
>>>> udp socket option that does not hash connected sockets into the port
>>>> table. In effect receiving on a small set of listening sockets (e.g.,
>>>> one per cpu) and sending over separate tx-only sockets. That still
>>>> introduces unnecessary memory allocation. OTOH it amortizes some
>>>> operations, such as route lookup.
>>>> 
>>>> Anyway, that does not fix the immediate issue you reported when using
>>>> SO_REUSEPORT as described.
>>> 
>>> As for the BPF program: good point on accessing the udp port when
>>> skb->data is already beyond the header.
>>> 
>>> Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
>>> Which I think will work, but have not tested.
>>> 
>>> As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
>>> attached (with CAP_SYS_ADMIN). See
>>> tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
>>> example that parses udp headers with bpf_skb_load_bytes.
>>> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-04 14:51                   ` Steve Zabele
@ 2019-09-04 15:46                     ` Willem de Bruijn
  2019-09-04 19:26                       ` Eric Dumazet
  2019-09-10 15:52                       ` Willem de Bruijn
  0 siblings, 2 replies; 18+ messages in thread
From: Willem de Bruijn @ 2019-09-04 15:46 UTC (permalink / raw)
  To: Steve Zabele
  Cc: Eric Dumazet, Mark KEATON, Willem de Bruijn, Network Development,
	shum, vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger

On Wed, Sep 4, 2019 at 10:51 AM Steve Zabele <zabele@comcast.net> wrote:
>
> I think a dual table approach makes a lot of sense here, especially if we look at the different use cases. For the DNS server example, almost certainly there will not be any connected sockets using the server port, so a test of whether the connected table is empty (maybe a boolean stored with the unconnected table?) should get to the existing code very quickly and not require accessing the memory holding the connected table. For our use case, the connected sockets persist for long periods (at network timescales at least) and so any rehashing should be infrequent and so have limited impact on performance overall.
>
> So does a dual table approach seem workable to other folks that know the internals?

Let me take a stab and compare. A dual table does bring it more in
line with how the TCP code is structured.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-04 15:46                     ` Willem de Bruijn
@ 2019-09-04 19:26                       ` Eric Dumazet
  2019-09-10 15:52                       ` Willem de Bruijn
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2019-09-04 19:26 UTC (permalink / raw)
  To: Willem de Bruijn, Steve Zabele
  Cc: Eric Dumazet, Mark KEATON, Network Development, shum,
	vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger



On 9/4/19 5:46 PM, Willem de Bruijn wrote:
> On Wed, Sep 4, 2019 at 10:51 AM Steve Zabele <zabele@comcast.net> wrote:
>>
>> I think a dual table approach makes a lot of sense here, especially if we look at the different use cases. For the DNS server example, almost certainly there will not be any connected sockets using the server port, so a test of whether the connected table is empty (maybe a boolean stored with the unconnected table?)


UDP hash tables are shared among netns, and the hashes function depend on a netns salt ( net_hash_mix())

(see udp_hashfn() definition)

So a boolean would be polluted on a slot having both non connected socket on netns A, and a connected socket for netns B.


 should get to the existing code very quickly and not require accessing the memory holding the connected table. For our use case, the connected sockets persist for long periods (at network timescales at least) and so any rehashing should be infrequent and so have limited impact on performance overall.
>>
>> So does a dual table approach seem workable to other folks that know the internals?
> 
> Let me take a stab and compare. A dual table does bring it more in
> line with how the TCP code is structured.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-04 15:46                     ` Willem de Bruijn
  2019-09-04 19:26                       ` Eric Dumazet
@ 2019-09-10 15:52                       ` Willem de Bruijn
  2019-09-10 16:56                         ` Paolo Abeni
  1 sibling, 1 reply; 18+ messages in thread
From: Willem de Bruijn @ 2019-09-10 15:52 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Steve Zabele, Eric Dumazet, Mark KEATON, Network Development,
	shum, vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger, Craig Gallek

On Wed, Sep 4, 2019 at 11:46 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Sep 4, 2019 at 10:51 AM Steve Zabele <zabele@comcast.net> wrote:
> >
> > I think a dual table approach makes a lot of sense here, especially if we look at the different use cases. For the DNS server example, almost certainly there will not be any connected sockets using the server port, so a test of whether the connected table is empty (maybe a boolean stored with the unconnected table?) should get to the existing code very quickly and not require accessing the memory holding the connected table. For our use case, the connected sockets persist for long periods (at network timescales at least) and so any rehashing should be infrequent and so have limited impact on performance overall.
> >
> > So does a dual table approach seem workable to other folks that know the internals?
>
> Let me take a stab and compare. A dual table does bring it more in
> line with how the TCP code is structured.

On closer look, I think two tables is too much code churn and risk for
a stable fix. It requires lookup changes across ipv4 and ipv6 unicast,
multicast, early demux, .. Though I'm happy to be proven wrong, of
course.

One variant that is easy to see only modifies behavior for reuseport
groups with connections is to mark those as such:

"
@@ -21,7 +21,8 @@ struct sock_reuseport {
        unsigned int            synq_overflow_ts;
        /* ID stays the same even after the size of socks[] grows. */
        unsigned int            reuseport_id;
-       bool                    bind_inany;
+       unsigned int            bind_inany:1;
+       unsigned int            has_conns:1;
        struct bpf_prog __rcu   *prog;          /* optional BPF sock selector */
        struct sock             *socks[0];      /* array of sock pointers */
 };
@@ -37,6 +38,23 @@ extern struct sock *reuseport_select_sock(struct sock *sk,
 extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog);
 extern int reuseport_detach_prog(struct sock *sk);

+static inline bool reuseport_has_conns(struct sock *sk, bool set)
+{
+       struct sock_reuseport *reuse;
+       bool ret = false;
+
+       rcu_read_lock();
+       reuse = rcu_dereference(sk->sk_reuseport_cb);
+       if (reuse) {
+               if (set)
+                       reuse->has_conns = 1;
+               ret = reuse->has_conns;
+       }
+       rcu_read_unlock();
+
+       return ret;
+}

@@ -67,6 +68,7 @@ int __ip4_datagram_connect(struct sock *sk, struct
sockaddr *uaddr, int addr_len
                if (sk->sk_prot->rehash)
                        sk->sk_prot->rehash(sk);
        }
+       reuseport_has_conns(sk, true);
        inet->inet_daddr = fl4->daddr;
        inet->inet_dport = usin->sin_port;
        sk->sk_state = TCP_ESTABLISHED;
"

Then at lookup treat connected reuseport sockets are regular sockets
and do not return early on a reuseport match if there may be higher
scoring connections:

"
@@ -423,13 +423,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
                score = compute_score(sk, net, saddr, sport,
                                      daddr, hnum, dif, sdif);
                if (score > badness) {
-                       if (sk->sk_reuseport) {
+                       if (sk->sk_reuseport &&
+                           sk->sk_state != TCP_ESTABLISHED) {
                                hash = udp_ehashfn(net, daddr, hnum,
                                                   saddr, sport);
                                result = reuseport_select_sock(sk, hash, skb,
                                                        sizeof(struct udphdr));
-                               if (result)
+                               if (result && !reuseport_has_conns(sk, false))
                                        return result;
+                               sk = result ? : sk;
                        }
                        badness = score;
                        result = sk;
"

and finally for reuseport matches only return unconnected sockets in the group:

"
@@ -295,8 +295,19 @@ struct sock *reuseport_select_sock(struct sock *sk,

 select_by_hash:
                /* no bpf or invalid bpf result: fall back to hash usage */
-               if (!sk2)
-                       sk2 = reuse->socks[reciprocal_scale(hash, socks)];
+               if (!sk2) {
+                       int i, j;
+
+                       i = j = reciprocal_scale(hash, socks);
+                       while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) {
+                               i++;
+                               if (i == reuse->num_socks)
+                                       i = 0;
+                               if (i == j)
+                                       goto out;
+                       }
+                       sk2 = reuse->socks[i];
+               }
        }
"

This is hardly a short patch, but the behavioral change is contained.

I also coded up the alternative to rely on order in the list: entries
are listed at the head and the list is traversed at the head. To keep
all connections within a group ahead of all the unconnected sockets in
a group (1) rehash on connect and (2) do a more complex
insert-after-connected-reuseport for new reuseport sockets:

"
@@ -67,12 +68,16 @@ int __ip4_datagram_connect(struct sock *sk, struct
sockaddr *uaddr, int addr_len
                if (sk->sk_prot->rehash)
                        sk->sk_prot->rehash(sk);
        }
+
        inet->inet_daddr = fl4->daddr;
        inet->inet_dport = usin->sin_port;
        sk->sk_state = TCP_ESTABLISHED;
        sk_set_txhash(sk);
        inet->inet_id = jiffies;

+       if (rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_prot->rehash)
+               sk->sk_prot->rehash(sk);
+
        sk_dst_set(sk, &rt->dst);
        err = 0;

@@ -323,7 +323,21 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
                    sk->sk_family == AF_INET6)
                        hlist_add_tail_rcu(&udp_sk(sk)->udp_portaddr_node,
                                           &hslot2->head);
-               else
+               else if (sk->sk_reuseport) {
+                       struct sock *cur, *last_conn = NULL;
+
+                       udp_portaddr_for_each_entry_rcu(cur, &hslot2->head) {
+                               if (cur->sk_state == TCP_ESTABLISHED &&
+                                   rcu_access_pointer(cur->sk_reuseport_cb))
+                                       last_conn = cur;
+                       }
+                       if (last_conn)
+
hlist_add_behind_rcu(&udp_sk(sk)->udp_portaddr_node,
+
&udp_sk(last_conn)->udp_portaddr_node);
+                       else
+
hlist_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
+                                                        &hslot2->head);
+               } else
                        hlist_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
                                           &hslot2->head);

@@ -423,7 +437,8 @@ static struct sock *udp4_lib_lookup2(struct net *net,
                score = compute_score(sk, net, saddr, sport,
                                      daddr, hnum, dif, sdif);
                if (score > badness) {
-                       if (sk->sk_reuseport) {
+                       if (sk->sk_reuseport &&
+                           sk->sk_state != TCP_ESTABLISHED) {
                                hash = udp_ehashfn(net, daddr, hnum,
                                                   saddr, sport);
                                result = reuseport_select_sock(sk, hash, skb,
@@ -1891,10 +1906,12 @@ void udp_lib_rehash(struct sock *sk, u16 newhash)
                                             udp_sk(sk)->udp_port_hash);
                        /* we must lock primary chain too */
                        spin_lock_bh(&hslot->lock);
+#if 0
                        /* TODO: differentiate inet_rcv_saddr change
from regular connect */
                        if (rcu_access_pointer(sk->sk_reuseport_cb))
                                reuseport_detach_sock(sk);
+#endif

-                       if (hslot2 != nhslot2) {
+                       if (1) {
                                spin_lock(&hslot2->lock);

hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
                                hslot2->count--;
"

This clearly has some loose ends and is no shorter or simpler. So
unless anyone has comments or a different solution, I'll finish
up the first variant.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-10 15:52                       ` Willem de Bruijn
@ 2019-09-10 16:56                         ` Paolo Abeni
  2019-09-10 17:10                           ` Willem de Bruijn
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2019-09-10 16:56 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Steve Zabele, Eric Dumazet, Mark KEATON, Network Development,
	shum, vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger, Craig Gallek

Hi all,

On Tue, 2019-09-10 at 11:52 -0400, Willem de Bruijn wrote:
> This clearly has some loose ends and is no shorter or simpler. So
> unless anyone has comments or a different solution, I'll finish
> up the first variant.

I'm sorry for the late feedback.

I was wondering if we could use a new UDP-specific setsockopt to remove
the connected socket from the reuseport group at connect() time?

That would not have any behavioral change for existing application
leveraging the current reuseport implementation and requires possibly a
simpler implementation, but would need application changes for UDP
servers doing reuse/connect().

WDYT?

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is bug 200755 in anyone's queue??
  2019-09-10 16:56                         ` Paolo Abeni
@ 2019-09-10 17:10                           ` Willem de Bruijn
  0 siblings, 0 replies; 18+ messages in thread
From: Willem de Bruijn @ 2019-09-10 17:10 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Willem de Bruijn, Steve Zabele, Eric Dumazet, Mark KEATON,
	Network Development, shum, vladimir116, saifi.khan,
	Daniel Borkmann, Stephen Hemminger, Craig Gallek

On Tue, Sep 10, 2019 at 12:56 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi all,
>
> On Tue, 2019-09-10 at 11:52 -0400, Willem de Bruijn wrote:
> > This clearly has some loose ends and is no shorter or simpler. So
> > unless anyone has comments or a different solution, I'll finish
> > up the first variant.
>
> I'm sorry for the late feedback.
>
> I was wondering if we could use a new UDP-specific setsockopt to remove
> the connected socket from the reuseport group at connect() time?
>
> That would not have any behavioral change for existing application
> leveraging the current reuseport implementation and requires possibly a
> simpler implementation, but would need application changes for UDP
> servers doing reuse/connect().
>
> WDYT?

Thanks for taking a look, too, Paolo.

I looked into detaching the sockets from the group at connect time. It
could be done without setsockopt, even. Unfortunately, it brings other
problems.

The reuseport group is still there, so may still match sockets
before the connection. If the connected socket no longer has
sk_reuseport set, binding new sockets will fail on conflict. And so a
few more.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-09-10 17:10 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <010601d53bdc$79c86dc0$6d594940$@net>
     [not found] ` <20190716070246.0745ee6f@hermes.lan>
2019-08-23 19:02   ` Is bug 200755 in anyone's queue?? Steve Zabele
2019-08-29 19:26     ` Willem de Bruijn
2019-08-30  8:48       ` Steve Zabele
2019-08-30  8:53       ` Steve Zabele
2019-08-30  8:54       ` Eric Dumazet
2019-08-30 20:30         ` Willem de Bruijn
2019-09-03 17:55           ` Willem de Bruijn
2019-09-04 10:28             ` Steve Zabele
2019-09-04 12:00               ` Mark KEATON
2019-09-04 12:23                 ` Eric Dumazet
2019-09-04 14:23                   ` Willem de Bruijn
2019-09-04 14:51                   ` Steve Zabele
2019-09-04 15:46                     ` Willem de Bruijn
2019-09-04 19:26                       ` Eric Dumazet
2019-09-10 15:52                       ` Willem de Bruijn
2019-09-10 16:56                         ` Paolo Abeni
2019-09-10 17:10                           ` Willem de Bruijn
2019-08-23 19:04   ` Steve Zabele

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.