All of lore.kernel.org
 help / color / mirror / Atom feed
* Resurrecting EPOLLROUNDROBIN
@ 2019-03-25 11:38 Marek Majkowski
  2019-03-25 16:54 ` Jason Baron
  2019-03-26  0:23 ` Andy Lutomirski
  0 siblings, 2 replies; 5+ messages in thread
From: Marek Majkowski @ 2019-03-25 11:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-api, jbaron

Hi,

Recently we noticed epoll is not helpful for load balancing when
called on a listen TCP socket. I described this in a blog post:

https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/

The short explanation: new connections going to a listen socket are
not evenly distributed across processes that wait on the EPOLLIN. In
practice the last process doing epoll_wait() will get the new
connection. See the trivial program to reproduce:

https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py

   $ ./epoll-and-accept.py &
   $ for i in `seq 6`; do echo | nc localhost 1024; done
   worker 0
   worker 0
   worker 0
   worker 0
   worker 0
   worker 0

Worker #0 did all the accept() calls. This is because the listen
socket wait queue is a LIFO (not FIFO!). With current behaviour, the
process calling epoll_wait() most recently will be woken up first.
This usually is the busiest process. This leads to uneven load
distribution across worker processes.

Notice, described problem is different from what EPOLLEXCLUSIVE tries
to solve. Exclusive flag is about waking up exactly one process, as
opposed to default behaviour of waking up all the subscribers
(thundering herd problem). Without EPOLLEXCLUSIVE the described
load-balancing problem is less prominent, since there is an inherent
race when all the woken up processes fight for the new connection. In
such case the other workers have some chance of getting the new
connection. The core problem still is there - accept calls are not
well balanced across waiting processes.

On a loaded server avoiding EPOLLEXCLUSIVE is wasteful. With high
number of new connections, and dozens of worker processes, waking up
everybody on every new connection is suboptimal.

Notice, that multiple threads doing blocking accept() have a proper
FIFO behaviour. In other words: you can achieve round-robin load
balancing by having multiple workers hang on accept(), while you can't
have that behaviour when waiting in epoll_wait().

We are using EPOLLEXCLUSIVE, and as a solution to load-balancing
problem we backported the EPOLLROUNDROBIN patch submitted by Jason
Byron in 2015. We are running this patch for last 6 months, and it
helped us to flatten the load across workers (and reduce tail
latency).

https://lists.openwall.net/linux-kernel/2015/02/17/723

(PS. generally speaking EPOLLROUNDROBIN makes no sense in conjunction
with SO_REUSEPORT sockets)

Jason, would you mind to resubmit it?

Cheers,
    Marek

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Resurrecting EPOLLROUNDROBIN
  2019-03-25 11:38 Resurrecting EPOLLROUNDROBIN Marek Majkowski
@ 2019-03-25 16:54 ` Jason Baron
  2019-03-26  0:23 ` Andy Lutomirski
  1 sibling, 0 replies; 5+ messages in thread
From: Jason Baron @ 2019-03-25 16:54 UTC (permalink / raw)
  To: Marek Majkowski, linux-fsdevel, linux-api



On 3/25/19 7:38 AM, Marek Majkowski wrote:
> Hi,
> 
> Recently we noticed epoll is not helpful for load balancing when
> called on a listen TCP socket. I described this in a blog post:
> 
> https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
> 
> The short explanation: new connections going to a listen socket are
> not evenly distributed across processes that wait on the EPOLLIN. In
> practice the last process doing epoll_wait() will get the new
> connection. See the trivial program to reproduce:
> 
> https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py
> 
>    $ ./epoll-and-accept.py &
>    $ for i in `seq 6`; do echo | nc localhost 1024; done
>    worker 0
>    worker 0
>    worker 0
>    worker 0
>    worker 0
>    worker 0
> 
> Worker #0 did all the accept() calls. This is because the listen
> socket wait queue is a LIFO (not FIFO!). With current behaviour, the
> process calling epoll_wait() most recently will be woken up first.
> This usually is the busiest process. This leads to uneven load
> distribution across worker processes.
> 
> Notice, described problem is different from what EPOLLEXCLUSIVE tries
> to solve. Exclusive flag is about waking up exactly one process, as
> opposed to default behaviour of waking up all the subscribers
> (thundering herd problem). Without EPOLLEXCLUSIVE the described
> load-balancing problem is less prominent, since there is an inherent
> race when all the woken up processes fight for the new connection. In
> such case the other workers have some chance of getting the new
> connection. The core problem still is there - accept calls are not
> well balanced across waiting processes.
> 
> On a loaded server avoiding EPOLLEXCLUSIVE is wasteful. With high
> number of new connections, and dozens of worker processes, waking up
> everybody on every new connection is suboptimal.
> 
> Notice, that multiple threads doing blocking accept() have a proper
> FIFO behaviour. In other words: you can achieve round-robin load
> balancing by having multiple workers hang on accept(), while you can't
> have that behaviour when waiting in epoll_wait().
> 
> We are using EPOLLEXCLUSIVE, and as a solution to load-balancing
> problem we backported the EPOLLROUNDROBIN patch submitted by Jason
> Byron in 2015. We are running this patch for last 6 months, and it
> helped us to flatten the load across workers (and reduce tail
> latency).
> 
> https://lists.openwall.net/linux-kernel/2015/02/17/723
> 
> (PS. generally speaking EPOLLROUNDROBIN makes no sense in conjunction
> with SO_REUSEPORT sockets)
> 
> Jason, would you mind to resubmit it?
> 
> Cheers,
>     Marek
> 


Hi Marek,

So I think there may have been a couple issues last time. First, I
wasn't convinced if anybody actually wanted this. Sounds like there is
interest now. Second, was that it touched some of the core wakeup bits
and although I don't think any of the scheduler maintainers objected, I
didn't want to change the core wakeup code for this epoll only feature.
I think I can probably register a generic wakeup with the core code and
then have only the epoll code be aware of the round robin behavior. So I
will cook up a patch like that and re-post.

Thanks,

-Jason


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Resurrecting EPOLLROUNDROBIN
  2019-03-25 11:38 Resurrecting EPOLLROUNDROBIN Marek Majkowski
  2019-03-25 16:54 ` Jason Baron
@ 2019-03-26  0:23 ` Andy Lutomirski
  2019-03-26 15:00   ` Jason Baron
  1 sibling, 1 reply; 5+ messages in thread
From: Andy Lutomirski @ 2019-03-26  0:23 UTC (permalink / raw)
  To: Marek Majkowski; +Cc: Linux FS Devel, Linux API, Jason Baron

On Mon, Mar 25, 2019 at 4:38 AM Marek Majkowski <marek@cloudflare.com> wrote:
>
> Hi,
>
> Recently we noticed epoll is not helpful for load balancing when
> called on a listen TCP socket. I described this in a blog post:
>
> https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
>
> The short explanation: new connections going to a listen socket are
> not evenly distributed across processes that wait on the EPOLLIN. In
> practice the last process doing epoll_wait() will get the new
> connection. See the trivial program to reproduce:
>
> https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py
>
>    $ ./epoll-and-accept.py &
>    $ for i in `seq 6`; do echo | nc localhost 1024; done
>    worker 0
>    worker 0
>    worker 0
>    worker 0
>    worker 0
>    worker 0
>
> Worker #0 did all the accept() calls. This is because the listen
> socket wait queue is a LIFO (not FIFO!). With current behaviour, the
> process calling epoll_wait() most recently will be woken up first.
> This usually is the busiest process. This leads to uneven load
> distribution across worker processes.

I recall a discussion of this at a conference several years ago, but
it's been several years.  Anyway:

I read the blog post, and I looked at your example, and the kernel
behavior actually seems quite sane to me.  From the kernel's
perspective, if you're calling accept in a loop in a bunch of threads
(mediated by epoll or otherwise), and one of those threads is able to
call accept() fast enough, then that thread *should* get all the
sockets.  It's cache hot, and bouncing around is expensive.

Now obviously the overall behavior here is suboptimal, but that's
arguably because the user process is being silly, not because the
kernel is doing it wrong.  Shouldn't the user process take the newly
accepted socket and hand it off to an appropriate thread for
servicing?  If I were doing this, I'd get a freshly accepted socket
and either forward it to a thread (or process) that is appropriately
lightly loaded or, even better, that is pinned to the CPU that RFS has
assigned to the flow assuming that that thread isn't overloaded.  If
the program is using threads, then this doesn't need to involve the
kernel at all and, if it's using processes, then SCM_RIGHTS would do
the trick.  But asking the kernel to arbitrarily and awkwardly
round-robin the sockets and then keeping the flows on the threads that
get picked means that, at best, each thread gets an arbitrary
selection of flows and the balancing isn't particularly good.

Now, if someone were to actually try doing this in userspace and it
was too slow, I could see adding some kernel mechanisms to accelerate
the process.  Perhaps a mechanism to ask to accept only new
connections that are RFSified to the calling CPU would be useful.  But
this shouldn't be an *epoll* mechanism, since there is no actual
guarantee that the CPU that returns first from epoll_wait() is the
same CPU that calls accept() under load.  (Under load, multiple new
connections could come in and wake multiple CPUs before any of them
manage to call accept().)

So I think that EPOLLROUNDROBIN is not a great solution to the
problem, and I think that the problem isn't obviously a *kernel*
problem in the first place.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Resurrecting EPOLLROUNDROBIN
  2019-03-26  0:23 ` Andy Lutomirski
@ 2019-03-26 15:00   ` Jason Baron
  2019-03-27 15:57     ` Marek Majkowski
  0 siblings, 1 reply; 5+ messages in thread
From: Jason Baron @ 2019-03-26 15:00 UTC (permalink / raw)
  To: Andy Lutomirski, Marek Majkowski; +Cc: Linux FS Devel, Linux API



On 3/25/19 8:23 PM, Andy Lutomirski wrote:
> On Mon, Mar 25, 2019 at 4:38 AM Marek Majkowski <marek@cloudflare.com> wrote:
>>
>> Hi,
>>
>> Recently we noticed epoll is not helpful for load balancing when
>> called on a listen TCP socket. I described this in a blog post:
>>
>> https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
>>
>> The short explanation: new connections going to a listen socket are
>> not evenly distributed across processes that wait on the EPOLLIN. In
>> practice the last process doing epoll_wait() will get the new
>> connection. See the trivial program to reproduce:
>>
>> https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py
>>
>>    $ ./epoll-and-accept.py &
>>    $ for i in `seq 6`; do echo | nc localhost 1024; done
>>    worker 0
>>    worker 0
>>    worker 0
>>    worker 0
>>    worker 0
>>    worker 0
>>
>> Worker #0 did all the accept() calls. This is because the listen
>> socket wait queue is a LIFO (not FIFO!). With current behaviour, the
>> process calling epoll_wait() most recently will be woken up first.
>> This usually is the busiest process. This leads to uneven load
>> distribution across worker processes.
> 
> I recall a discussion of this at a conference several years ago, but
> it's been several years.  Anyway:
> 
> I read the blog post, and I looked at your example, and the kernel
> behavior actually seems quite sane to me.  From the kernel's
> perspective, if you're calling accept in a loop in a bunch of threads
> (mediated by epoll or otherwise), and one of those threads is able to
> call accept() fast enough, then that thread *should* get all the
> sockets.  It's cache hot, and bouncing around is expensive.

Yes, the EPOLLEXCLUSIVE flag, was what we ended up after the last set of
discussions on this. Its meant as sort of a sane wakeup behavior when
you have one event source fd, that is attached to multiple epoll fds, or
epfds. Without the EPOLLEXCLUSIVE flag, you end up with all of the epfds
getting woken up.

> 
> Now obviously the overall behavior here is suboptimal, but that's
> arguably because the user process is being silly, not because the
> kernel is doing it wrong.  Shouldn't the user process take the newly
> accepted socket and hand it off to an appropriate thread for
> servicing?  If I were doing this, I'd get a freshly accepted socket
> and either forward it to a thread (or process) that is appropriately
> lightly loaded or, even better, that is pinned to the CPU that RFS has
> assigned to the flow assuming that that thread isn't overloaded.  If
> the program is using threads, then this doesn't need to involve the
> kernel at all and, if it's using processes, then SCM_RIGHTS would do
> the trick.  But asking the kernel to arbitrarily and awkwardly
> round-robin the sockets and then keeping the flows on the threads that
> get picked means that, at best, each thread gets an arbitrary
> selection of flows and the balancing isn't particularly good.
> 
> Now, if someone were to actually try doing this in userspace and it
> was too slow, I could see adding some kernel mechanisms to accelerate
> the process.  Perhaps a mechanism to ask to accept only new
> connections that are RFSified to the calling CPU would be useful.  But
> this shouldn't be an *epoll* mechanism, since there is no actual
> guarantee that the CPU that returns first from epoll_wait() is the
> same CPU that calls accept() under load.  (Under load, multiple new
> connections could come in and wake multiple CPUs before any of them
> manage to call accept().)
> 
> So I think that EPOLLROUNDROBIN is not a great solution to the
> problem, and I think that the problem isn't obviously a *kernel*
> problem in the first place.
> 

Another point in this direction is, yes you can try to 'balance' things
at accept() time, but over time things can get very unbalanced. Long
lived connections, for example, could end up on some cpus and not
others. So it seems like, some sort of periodic load balancing would be
necessary anyways.

That said, I'm not against some basic wakeup distribution strategies in
the kernel, such as round robin. Especially, if they can be entirely
contained to the epoll layer (which I think we can do). But clearly we
don't want to introduce a new wakeup distribution strategy for every
use-case.

Thanks,

-Jason


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Resurrecting EPOLLROUNDROBIN
  2019-03-26 15:00   ` Jason Baron
@ 2019-03-27 15:57     ` Marek Majkowski
  0 siblings, 0 replies; 5+ messages in thread
From: Marek Majkowski @ 2019-03-27 15:57 UTC (permalink / raw)
  To: Jason Baron; +Cc: Andy Lutomirski, Linux FS Devel, Linux API, kernel-team

Thanks for the response.

From a user point of view indeed, we want to select strategies on epoll layer:
 - LIFO for connected sockets on which we do read()
 - FIFO for listen sockets

I can't imagine more complex scenarios. I'm only asking for
FIFO/roundrobin and feature parity with blocking syscalls.

On Tue, Mar 26, 2019 at 4:00 PM Jason Baron <jbaron@akamai.com> wrote:
> On 3/25/19 8:23 PM, Andy Lutomirski wrote:
> > On Mon, Mar 25, 2019 at 4:38 AM Marek Majkowski <marek@cloudflare.com> wrote:
> >> Recently we noticed epoll is not helpful for load balancing when
> >> called on a listen TCP socket. I described this in a blog post:
> >>
> >> https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
> >>
> >> The short explanation: new connections going to a listen socket are
> >> not evenly distributed across processes that wait on the EPOLLIN. In
> >> practice the last process doing epoll_wait() will get the new
> >> connection. See the trivial program to reproduce:
> >>
> >> https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py
> >>
> >>    $ ./epoll-and-accept.py &
> >>    $ for i in `seq 6`; do echo | nc localhost 1024; done
> >>    worker 0
> >>    worker 0
> >>    worker 0
> >>    worker 0
> >>    worker 0
> >>    worker 0
> >>
> >> Worker #0 did all the accept() calls. This is because the listen
> >> socket wait queue is a LIFO (not FIFO!). With current behaviour, the
> >> process calling epoll_wait() most recently will be woken up first.
> >> This usually is the busiest process. This leads to uneven load
> >> distribution across worker processes.
> >
> > I recall a discussion of this at a conference several years ago, but
> > it's been several years.  Anyway:
> >
> > I read the blog post, and I looked at your example, and the kernel
> > behavior actually seems quite sane to me.  From the kernel's
> > perspective, if you're calling accept in a loop in a bunch of threads
> > (mediated by epoll or otherwise), and one of those threads is able to
> > call accept() fast enough, then that thread *should* get all the
> > sockets.  It's cache hot, and bouncing around is expensive.

Right. Then the accept() behaviour is wrong, since it does FIFO.

Blocking accept() having different semantics from epoll_wait +
non-blocking accept() is mind boggling. I would expect epoll to be
able to emulate at least the same semantics as the blocking syscalls.

> Yes, the EPOLLEXCLUSIVE flag, was what we ended up after the last set of
> discussions on this. Its meant as sort of a sane wakeup behavior when
> you have one event source fd, that is attached to multiple epoll fds, or
> epfds. Without the EPOLLEXCLUSIVE flag, you end up with all of the epfds
> getting woken up.
>
> >
> > Now obviously the overall behavior here is suboptimal, but that's
> > arguably because the user process is being silly, not because the
> > kernel is doing it wrong.  Shouldn't the user process take the newly
> > accepted socket and hand it off to an appropriate thread for
> > servicing?

People expect reasonable connection load distribution from epoll.
Nginx, and many other pieces of code are designed with that in mind.
The LIFO behaviour of epoll is undocumented and unexpected.

We could have the same argument for not implementing EPOLLEXCLUSIVE.
For example nginx and apache successfully work around lack of that
flag on older kernels with a mutex:

http://nginx.org/en/docs/ngx_core_module.html#accept_mutex
http://httpd.apache.org/docs/2.2/mod/mpm_common.html#acceptmutex

I'm being slightly unfair here, mutex in userspace is serializing the
accepts() to only one CPU. But again, we could argue that users having
large volume of incoming connections should allow multiple worker
threads to hold the listen socket.

A sensible epoll API would help me with basic FIFO/roundrobin TCP
connection distribution across worker processes. This is useful for
common stuff like HTTP servers. It's not hard, it's what users expect,
and it adds feature-parity with blocking accept().

> >  If I were doing this, I'd get a freshly accepted socket
> > and either forward it to a thread (or process) that is appropriately
> > lightly loaded or, even better, that is pinned to the CPU that RFS has
> > assigned to the flow assuming that that thread isn't overloaded.  If
> > the program is using threads, then this doesn't need to involve the
> > kernel at all and, if it's using processes, then SCM_RIGHTS would do
> > the trick.  But asking the kernel to arbitrarily and awkwardly
> > round-robin the sockets and then keeping the flows on the threads that
> > get picked means that, at best, each thread gets an arbitrary
> > selection of flows and the balancing isn't particularly good.
> >
> > Now, if someone were to actually try doing this in userspace and it
> > was too slow, I could see adding some kernel mechanisms to accelerate
> > the process.  Perhaps a mechanism to ask to accept only new
> > connections that are RFSified to the calling CPU would be useful.  But
> > this shouldn't be an *epoll* mechanism, since there is no actual
> > guarantee that the CPU that returns first from epoll_wait() is the
> > same CPU that calls accept() under load.  (Under load, multiple new
> > connections could come in and wake multiple CPUs before any of them
> > manage to call accept().)
> >
> > So I think that EPOLLROUNDROBIN is not a great solution to the
> > problem, and I think that the problem isn't obviously a *kernel*
> > problem in the first place.
> >
>
> Another point in this direction is, yes you can try to 'balance' things
> at accept() time, but over time things can get very unbalanced. Long
> lived connections, for example, could end up on some cpus and not
> others. So it seems like, some sort of periodic load balancing would be
> necessary anyways.
>
> That said, I'm not against some basic wakeup distribution strategies in
> the kernel, such as round robin. Especially, if they can be entirely
> contained to the epoll layer (which I think we can do). But clearly we
> don't want to introduce a new wakeup distribution strategy for every
> use-case.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-03-27 15:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-25 11:38 Resurrecting EPOLLROUNDROBIN Marek Majkowski
2019-03-25 16:54 ` Jason Baron
2019-03-26  0:23 ` Andy Lutomirski
2019-03-26 15:00   ` Jason Baron
2019-03-27 15:57     ` Marek Majkowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.