Netdev Archive on lore.kernel.org
 help / color / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Willy Tarreau <w@1wt.eu>
Cc: Tolga Ceylan <tolga.ceylan@gmail.com>,
	Tom Herbert <tom@herbertland.com>,
	cgallek@google.com, Josh Snyder <josh@code406.com>,
	Aaron Conole <aconole@bytheb.org>,
	"David S. Miller" <davem@davemloft.net>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Date: Thu, 24 Mar 2016 09:33:11 -0700
Message-ID: <1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com> (raw)
In-Reply-To: <20160324153053.GA7569@1wt.eu>


On Thu, 2016-03-24 at 16:30 +0100, Willy Tarreau wrote:
> Hi Eric,
> 
> (just lost my e-mail, trying not to forget some points)
> 
> On Thu, Mar 24, 2016 at 07:45:44AM -0700, Eric Dumazet wrote:
> > On Thu, 2016-03-24 at 15:22 +0100, Willy Tarreau wrote:
> > > Hi Eric,
> > 
> > > But that means that any software making use of SO_REUSEPORT needs to
> > > also implement BPF on Linux to achieve the same as what it does on
> > > other OSes ? Also I found a case where a dying process would still
> > > cause trouble in the accept queue, maybe it's not redistributed, I
> > > don't remember, all I remember is that my traffic stopped after a
> > > segfault of only one of them :-/ I'll have to dig a bit regarding
> > > this.
> > 
> > Hi Willy
> > 
> > Problem is : If we add a SO_REUSEPORT_LISTEN_OFF, this wont work with
> > BPF. 
> 
> I wasn't for adding SO_REUSEPORT_LISTEN_OFF either. Instead the idea was
> just to modify the score in compute_score() so that a socket which disables
> SO_REUSEPORT scores less than one which still has it. The application
> wishing to terminate just has to clear the SO_REUSEPORT flag and wait for
> accept() reporting EAGAIN. The patch simply looked like this (copy-pasted
> hence space-mangled) :
> 
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct net *net,
>                                 return -1;
>                         score += 4;
>                 }
> +               if (sk->sk_reuseport)
> +                       score++;

This wont work with BPF

>                 if (sk->sk_incoming_cpu == raw_smp_processor_id())
>                         score++;

This one does not work either with BPF

>         }
> 
> > BPF makes a decision without knowing individual listeners states.
> 
> But is the decision taken without considering compute_score() ? The point
> really was to be the least possibly intrusive and quite logical for the
> application : "disable SO_REUSEPORT when you don't want to participate to
> incoming load balancing anymore".

Whole point of BPF was to avoid iterate through all sockets [1],
and let user space use whatever selection logic it needs.

[1] This was okay with up to 16 sockets. But with 128 it does not scale.

If you really look at how BPF works, implementing another 'per listener' flag
would break the BPF selection.

You can certainly implement the SO_REUSEPORT_LISTEN_OFF by loading an
updated BPF, why should we add another way in the kernel to do the same,
in a way that would not work in some cases ?

  reply index

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-27  0:30 Tolga Ceylan
2015-09-27  1:04 ` Eric Dumazet
2015-09-27  1:37   ` Tolga Ceylan
2015-09-27  1:44 ` Aaron Conole
2015-09-27  2:02   ` Tolga Ceylan
2015-09-27  2:24     ` Eric Dumazet
2015-11-11  5:41       ` Tom Herbert
2015-11-11  6:19         ` Eric Dumazet
2015-11-11 17:05           ` Tom Herbert
2015-11-11 17:23             ` Eric Dumazet
2015-11-11 18:23               ` Tom Herbert
2015-11-11 18:43                 ` Eric Dumazet
2015-11-12  1:09                   ` Eric Dumazet
2015-12-15 16:14                     ` Willy Tarreau
2015-12-15 17:10                       ` Eric Dumazet
2015-12-15 17:43                         ` Willy Tarreau
2015-12-15 18:21                           ` Eric Dumazet
2015-12-15 19:44                             ` Willy Tarreau
2015-12-15 21:21                               ` Eric Dumazet
2015-12-16  7:38                                 ` Willy Tarreau
2015-12-16 16:15                                   ` Willy Tarreau
2015-12-18 16:33                                     ` Josh Snyder
2015-12-18 18:58                                       ` Willy Tarreau
2015-12-19  2:38                                         ` Eric Dumazet
2015-12-19  7:00                                           ` Willy Tarreau
2015-12-21 20:38                                             ` Tom Herbert
2015-12-21 20:41                                               ` Willy Tarreau
2016-03-24  5:10                                                 ` Tolga Ceylan
2016-03-24  6:12                                                   ` Willy Tarreau
2016-03-24 14:13                                                     ` Eric Dumazet
2016-03-24 14:22                                                       ` Willy Tarreau
2016-03-24 14:45                                                         ` Eric Dumazet
2016-03-24 15:30                                                           ` Willy Tarreau
2016-03-24 16:33                                                             ` Eric Dumazet [this message]
2016-03-24 16:50                                                               ` Willy Tarreau
2016-03-24 17:01                                                                 ` Eric Dumazet
2016-03-24 17:26                                                                   ` Tom Herbert
2016-03-24 17:55                                                                     ` Daniel Borkmann
2016-03-24 18:20                                                                       ` Tolga Ceylan
2016-03-24 18:24                                                                         ` Willy Tarreau
2016-03-24 18:37                                                                         ` Eric Dumazet
2016-03-24 22:40                                                                       ` Yann Ylavic
2016-03-24 22:49                                                                         ` Eric Dumazet
2016-03-24 23:40                                                                           ` Yann Ylavic
2016-03-24 23:54                                                                             ` Tom Herbert
2016-03-25  0:01                                                                               ` Yann Ylavic
2016-03-25  5:28                                                                               ` Willy Tarreau
2016-03-25  6:49                                                                                 ` Eric Dumazet
2016-03-25  8:53                                                                                   ` Willy Tarreau
2016-03-25 11:21                                                                                     ` Yann Ylavic
2016-03-25 13:17                                                                                       ` Eric Dumazet
2016-03-25  0:25                                                                           ` David Miller
2016-03-25  0:24                                                                         ` David Miller
2016-03-24 18:00                                                                   ` Willy Tarreau
2016-03-24 18:21                                                                     ` Willy Tarreau
2016-03-24 18:32                                                                     ` Eric Dumazet
2016-03-25 15:29 Craig Gallek
2016-03-25 16:21 ` Alexei Starovoitov
2016-03-25 16:31   ` Craig Gallek
2016-03-25 17:00     ` Eric Dumazet
2016-03-25 18:31       ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com \
    --to=eric.dumazet@gmail.com \
    --cc=aconole@bytheb.org \
    --cc=cgallek@google.com \
    --cc=davem@davemloft.net \
    --cc=josh@code406.com \
    --cc=netdev@vger.kernel.org \
    --cc=tolga.ceylan@gmail.com \
    --cc=tom@herbertland.com \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git