linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: epoll and multiple processes - eliminate unneeded process wake-ups
@ 2015-11-28 22:54 Madars Vitolins
  2015-11-30 19:45 ` Jason Baron
  0 siblings, 1 reply; 11+ messages in thread
From: Madars Vitolins @ 2015-11-28 22:54 UTC (permalink / raw)
  To: Jason Baron; +Cc: Eric Wong, linux-kernel

Hi Jason,

I did recently tests with multiprocessing and epoll() on Posix Queues.
You were right about "EP_MAX_NESTS", it is not related with how many 
processes are waken up when multiple process epoll_waits are waiting on 
one event source.

At doing epoll every process is added to wait queue for every monitored 
event source. Thus when message is sent to some queue (for example), all 
processes polling on it are activated during mq_timedsend() -> 
__do_notify () -> wake_up(&info->wait_q) kernel processing.

So to get one message to be processed only by one process of 
epoll_wait(), it requires that process in event source's  wait queue is 
added with exclusive flag set.

I could create a kernel patch, by adding new EPOLLEXCL flag which could 
result in following functionality:

- fs/eventpoll.c
================================================================================
/*
  * This is the callback that is used to add our wait queue to the
  * target file wakeup lists.
  */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t 
*whead,
                                  poll_table *pt)
{
         struct epitem *epi = ep_item_from_epqueue(pt);
         struct eppoll_entry *pwq;

         if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, 
GFP_KERNEL))) {
                 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
                 pwq->whead = whead;
                 pwq->base = epi;

                 if (epi->event.events & EPOLLEXCL) { <<<< New 
functionality here!!!
                         add_wait_queue_exclusive(whead, &pwq->wait);
                 } else {
                         add_wait_queue(whead, &pwq->wait);
                 }
                 list_add_tail(&pwq->llink, &epi->pwqlist);
                 epi->nwait++;
         } else {
                 /* We have to signal that an error occurred */
                 epi->nwait = -1;
         }
}
================================================================================

After doing test with EPOLLEXCL set in my multiprocessing application 
framework (now it is open source: http://www.endurox.org/ :) ), results 
were good, there were no extra wakeups. Thus more efficient processing.

Jason, how do you think would mainline accept such patch with new flag? 
Or are there any concerns about this? Also this will mean that new flag 
will be need to add to GNU C Library (/usr/include/sys/epoll.h).

Or maybe somebody else who is familiar with kernel epoll functionality 
can comment this?

Regarding the flag's bitmask, seems like (1<<28) needs to be taken for 
EPOLLEXCL as flags type for epoll_event.events is int32 and last bit 
1<<31 is used by EPOLLET (in include/uapi/linux/eventpoll.h).

Thanks a lot in advance,
Madars


Jason Baron @ 2015-08-05 15:32 rakstīja:
> On 08/05/2015 07:06 AM, Madars Vitolins wrote:
>> Jason Baron @ 2015-08-04 18:02 rakstīja:
>>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>>> Madars Vitolins <m@silodev.com> wrote:
>>>>> Hi Folks,
>>>>> 
>>>>> I am developing kind of open systems application, which uses
>>>>> multiple processes/executables where each of them monitors some set
>>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>>> example when 10 processes on same queue are in state of 
>>>>> epoll_wait()
>>>>> and one message arrives, all 10 processes gets woken up and all of
>>>>> them tries to read the message from Q. One succeeds, the others 
>>>>> gets
>>>>> EAGAIN error. The problem is with those others, which generates
>>>>> extra context switches - useless CPU usage. With more processes
>>>>> inefficiency gets higher.
>>>>> 
>>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable 
>>>>> for
>>>>> multi-threaded application and not for multi-process application.
>>>> 
>>>> Correct.  Most FDs are not shared across processes.
>>>> 
>>>>> Ideal mechanism for this would be:
>>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>>> more processes are in state of epoll_wait() - then send event only
>>>>> to one waiter.
>>>>> 2. If none of processes are in wait state, then send the event to
>>>>> all epoll sets (as it is currently). Then the first free process
>>>>> will grab the event.
>>>> 
>>>> Jason Baron was working on this (search LKML archives for
>>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>> 
>>>> However, I was unconvinced about modifying epoll.
>>>> 
>>>> Perhaps I may be more easily convinced about your mqueue case than 
>>>> his
>>>> case for listen sockets, though[*]
>>>> 
>>> 
>>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>>> multiple epoll fds (or epoll sets) attached to the same wakeup 
>>> source,
>>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>>> essentially walks the list of waiters, wakes up the first thread
>>> that is actively in epoll_wait(), stops and moves the woken up
>>> epoll set to the end of the list. So it attempts to balance
>>> the wakeups among the epoll sets, I think in the way that you
>>> were describing.
>>> 
>>> Here is the patchset:
>>> 
>>> https://lkml.org/lkml/2015/2/24/667
>>> 
>>> The test program shows how to use the API. Essentially, you
>>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>>> which you then attach to you're shared wakeup source and
>>> then to your epoll sets. Please let me know if its unclear.
>>> 
>>> Thanks,
>>> 
>>> -Jason
>> 
>> In my particular case I need to work with multiple 
>> processes/executables running (not threads) and listening on same 
>> queues (this concept allows to sysadmin easily manage those processes 
>> (start new ones for balancing or stop them with out service 
>> interruption), and if any process dies for some reason (signal, core, 
>> etc..), the whole application does not get killed, but only one 
>> transaction is lost).
>> 
>> Recently I did tests, and found out that kernel's epoll currently 
>> sends notifications to 4 processes (I think it is EP_MAX_NESTS 
>> constant) waiting on same resource (those other 6 from my example will 
>> stay in sleep state). So it is not as bad as I thought before. It 
>> could be nice if EP_MAX_NESTS could be configurable, but I guess 4 is 
>> fine too.
>> 
> 
> hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
> if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
> can't add in 'ep5'. Where the 'epN' above represent epoll file
> descriptors that are attached together via: EPOLL_CTL_ADD.
> 
> The nesting does not affect how wakeups are down. All epoll fds
> that are attached to the even source fd are going to get wakeups.
> 
> 
>> Jason, does your patch work for multi-process application? How hard it 
>> would be to implement this for such scenario?
> 
> I don't think it would be too hard, but it requires:
> 
> 1) adding the patches
> 2) re-compiling, running new kernel
> 3) modifying your app to the new API.
> 
> Thanks,
> 
> -Jason
> 
> 
>> 
>> Madars
>> 
>>> 
>>>> Typical applications have few (probably only one) listen sockets or
>>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>>> blocking syscalls (accept4 or mq_timedreceive).
>>>> 
>>>> Making blocking syscalls allows exclusive wakeups to avoid 
>>>> thundering
>>>> herds.
>>>> 
>>>>> How do you think, would it be real to implement this? How about
>>>>> concurrency?
>>>>> Can you please give me some hints from which points in code to 
>>>>> start
>>>>> to implement these changes?
>>>> 
>>>> For now, I suggest dedicating a thread in each process to do
>>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>>> of queues in your system.
>>>> 
>>>> 
>>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>>     staying on the same CPU as much as possible.
>>>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>>>     socket server (e.g. http/memcached/DB), a typical new client
>>>>     socket spends a fair amount of time idle.  Thus I don't believe
>>>>     memory locality inside the kernel is much concern when there's
>>>>     thousands of accepted client sockets.
>>>> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-11-28 22:54 epoll and multiple processes - eliminate unneeded process wake-ups Madars Vitolins
@ 2015-11-30 19:45 ` Jason Baron
  2015-11-30 21:28   ` Madars Vitolins
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Baron @ 2015-11-30 19:45 UTC (permalink / raw)
  To: Madars Vitolins; +Cc: Eric Wong, linux-kernel

Hi Madars,

On 11/28/2015 05:54 PM, Madars Vitolins wrote:
> Hi Jason,
> 
> I did recently tests with multiprocessing and epoll() on Posix Queues.
> You were right about "EP_MAX_NESTS", it is not related with how many
> processes are waken up when multiple process epoll_waits are waiting on
> one event source.
> 
> At doing epoll every process is added to wait queue for every monitored
> event source. Thus when message is sent to some queue (for example), all
> processes polling on it are activated during mq_timedsend() ->
> __do_notify () -> wake_up(&info->wait_q) kernel processing.
> 
> So to get one message to be processed only by one process of
> epoll_wait(), it requires that process in event source's  wait queue is
> added with exclusive flag set.
> 
> I could create a kernel patch, by adding new EPOLLEXCL flag which could
> result in following functionality:
> 
> - fs/eventpoll.c
> ================================================================================
> 
> /*
>  * This is the callback that is used to add our wait queue to the
>  * target file wakeup lists.
>  */
> static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t
> *whead,
>                                  poll_table *pt)
> {
>         struct epitem *epi = ep_item_from_epqueue(pt);
>         struct eppoll_entry *pwq;
> 
>         if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache,
> GFP_KERNEL))) {
>                 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>                 pwq->whead = whead;
>                 pwq->base = epi;
> 
>                 if (epi->event.events & EPOLLEXCL) { <<<< New
> functionality here!!!
>                         add_wait_queue_exclusive(whead, &pwq->wait);
>                 } else {
>                         add_wait_queue(whead, &pwq->wait);
>                 }
>                 list_add_tail(&pwq->llink, &epi->pwqlist);
>                 epi->nwait++;
>         } else {
>                 /* We have to signal that an error occurred */
>                 epi->nwait = -1;
>         }
> }
> ================================================================================
> 
> 
> After doing test with EPOLLEXCL set in my multiprocessing application
> framework (now it is open source: http://www.endurox.org/ :) ), results
> were good, there were no extra wakeups. Thus more efficient processing.
> 

Cool. If you have any performance numbers to share that would be more
supportive.

> Jason, how do you think would mainline accept such patch with new flag?
> Or are there any concerns about this? Also this will mean that new flag
> will be need to add to GNU C Library (/usr/include/sys/epoll.h).
>

This has come up several times - so imo I think it would be a reasonable
addition - but I'm only speaking for myself.

In terms of implementation it might make sense to return 0 from
ep_poll_callback() in case ep->wq is empty. That way we continue to
search for an active waiter. That way we service wakeups in a more
timely manner if some threads are busy. We probably also don't want to
allow the flag for nested ep descriptors.

Thanks,

-Jason

> Or maybe somebody else who is familiar with kernel epoll functionality
> can comment this?
> 
> Regarding the flag's bitmask, seems like (1<<28) needs to be taken for
> EPOLLEXCL as flags type for epoll_event.events is int32 and last bit
> 1<<31 is used by EPOLLET (in include/uapi/linux/eventpoll.h).
> 
> Thanks a lot in advance,
> Madars
> 
> 
> Jason Baron @ 2015-08-05 15:32 rakstīja:
>> On 08/05/2015 07:06 AM, Madars Vitolins wrote:
>>> Jason Baron @ 2015-08-04 18:02 rakstīja:
>>>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>>>> Madars Vitolins <m@silodev.com> wrote:
>>>>>> Hi Folks,
>>>>>>
>>>>>> I am developing kind of open systems application, which uses
>>>>>> multiple processes/executables where each of them monitors some set
>>>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>>>> example when 10 processes on same queue are in state of epoll_wait()
>>>>>> and one message arrives, all 10 processes gets woken up and all of
>>>>>> them tries to read the message from Q. One succeeds, the others gets
>>>>>> EAGAIN error. The problem is with those others, which generates
>>>>>> extra context switches - useless CPU usage. With more processes
>>>>>> inefficiency gets higher.
>>>>>>
>>>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
>>>>>> multi-threaded application and not for multi-process application.
>>>>>
>>>>> Correct.  Most FDs are not shared across processes.
>>>>>
>>>>>> Ideal mechanism for this would be:
>>>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>>>> more processes are in state of epoll_wait() - then send event only
>>>>>> to one waiter.
>>>>>> 2. If none of processes are in wait state, then send the event to
>>>>>> all epoll sets (as it is currently). Then the first free process
>>>>>> will grab the event.
>>>>>
>>>>> Jason Baron was working on this (search LKML archives for
>>>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>>>
>>>>> However, I was unconvinced about modifying epoll.
>>>>>
>>>>> Perhaps I may be more easily convinced about your mqueue case than his
>>>>> case for listen sockets, though[*]
>>>>>
>>>>
>>>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>>>> multiple epoll fds (or epoll sets) attached to the same wakeup source,
>>>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>>>> essentially walks the list of waiters, wakes up the first thread
>>>> that is actively in epoll_wait(), stops and moves the woken up
>>>> epoll set to the end of the list. So it attempts to balance
>>>> the wakeups among the epoll sets, I think in the way that you
>>>> were describing.
>>>>
>>>> Here is the patchset:
>>>>
>>>> https://lkml.org/lkml/2015/2/24/667
>>>>
>>>> The test program shows how to use the API. Essentially, you
>>>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>>>> which you then attach to you're shared wakeup source and
>>>> then to your epoll sets. Please let me know if its unclear.
>>>>
>>>> Thanks,
>>>>
>>>> -Jason
>>>
>>> In my particular case I need to work with multiple
>>> processes/executables running (not threads) and listening on same
>>> queues (this concept allows to sysadmin easily manage those processes
>>> (start new ones for balancing or stop them with out service
>>> interruption), and if any process dies for some reason (signal, core,
>>> etc..), the whole application does not get killed, but only one
>>> transaction is lost).
>>>
>>> Recently I did tests, and found out that kernel's epoll currently
>>> sends notifications to 4 processes (I think it is EP_MAX_NESTS
>>> constant) waiting on same resource (those other 6 from my example
>>> will stay in sleep state). So it is not as bad as I thought before.
>>> It could be nice if EP_MAX_NESTS could be configurable, but I guess 4
>>> is fine too.
>>>
>>
>> hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
>> if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
>> can't add in 'ep5'. Where the 'epN' above represent epoll file
>> descriptors that are attached together via: EPOLL_CTL_ADD.
>>
>> The nesting does not affect how wakeups are down. All epoll fds
>> that are attached to the even source fd are going to get wakeups.
>>
>>
>>> Jason, does your patch work for multi-process application? How hard
>>> it would be to implement this for such scenario?
>>
>> I don't think it would be too hard, but it requires:
>>
>> 1) adding the patches
>> 2) re-compiling, running new kernel
>> 3) modifying your app to the new API.
>>
>> Thanks,
>>
>> -Jason
>>
>>
>>>
>>> Madars
>>>
>>>>
>>>>> Typical applications have few (probably only one) listen sockets or
>>>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>>>> blocking syscalls (accept4 or mq_timedreceive).
>>>>>
>>>>> Making blocking syscalls allows exclusive wakeups to avoid thundering
>>>>> herds.
>>>>>
>>>>>> How do you think, would it be real to implement this? How about
>>>>>> concurrency?
>>>>>> Can you please give me some hints from which points in code to start
>>>>>> to implement these changes?
>>>>>
>>>>> For now, I suggest dedicating a thread in each process to do
>>>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>>>> of queues in your system.
>>>>>
>>>>>
>>>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>>>     staying on the same CPU as much as possible.
>>>>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>>>>     socket server (e.g. http/memcached/DB), a typical new client
>>>>>     socket spends a fair amount of time idle.  Thus I don't believe
>>>>>     memory locality inside the kernel is much concern when there's
>>>>>     thousands of accepted client sockets.
>>>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-11-30 19:45 ` Jason Baron
@ 2015-11-30 21:28   ` Madars Vitolins
  2015-12-01 20:11     ` Jason Baron
  0 siblings, 1 reply; 11+ messages in thread
From: Madars Vitolins @ 2015-11-30 21:28 UTC (permalink / raw)
  To: Jason Baron; +Cc: Eric Wong, linux-kernel

Hi Jason,

I today did search the mail archive and checked your offered patch did 
on February, it basically does the some (flag for 
add_wait_queue_exclusive() + balance).

So I plan to run off some tests with your patch, flag on/off and will 
provide results. I guess if I pull up 250 or 500 processes (which could 
real for production environment) waiting on one Q, then there could be a 
notable difference in performance with EPOLLEXCLUSIVE set or not.

During kernel hacking with debug print, with 10 processes waiting on one 
event source, with original kernel I did see lot un-needed processing 
inside of eventpoll.c, it got 10x calls to ep_poll_callback() and other 
stuff for single event, which results with few processes waken up in 
user space (count probably gets randomly depending on concurrency).


Meanwhile we are not the only ones who talk about this patch, see here: 
http://stackoverflow.com/questions/33226842/epollexclusive-and-epollroundrobin-flags-in-mainstream-kernel 
others are asking too.

So what is the current situation with your patch, what is the blocking 
for getting it into mainline?

Thanks,
Madars



Jason Baron @ 2015-11-30 21:45 rakstīja:
> Hi Madars,
> 
> On 11/28/2015 05:54 PM, Madars Vitolins wrote:
>> Hi Jason,
>> 
>> I did recently tests with multiprocessing and epoll() on Posix Queues.
>> You were right about "EP_MAX_NESTS", it is not related with how many
>> processes are waken up when multiple process epoll_waits are waiting 
>> on
>> one event source.
>> 
>> At doing epoll every process is added to wait queue for every 
>> monitored
>> event source. Thus when message is sent to some queue (for example), 
>> all
>> processes polling on it are activated during mq_timedsend() ->
>> __do_notify () -> wake_up(&info->wait_q) kernel processing.
>> 
>> So to get one message to be processed only by one process of
>> epoll_wait(), it requires that process in event source's  wait queue 
>> is
>> added with exclusive flag set.
>> 
>> I could create a kernel patch, by adding new EPOLLEXCL flag which 
>> could
>> result in following functionality:
>> 
>> - fs/eventpoll.c
>> ================================================================================
>> 
>> /*
>>  * This is the callback that is used to add our wait queue to the
>>  * target file wakeup lists.
>>  */
>> static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t
>> *whead,
>>                                  poll_table *pt)
>> {
>>         struct epitem *epi = ep_item_from_epqueue(pt);
>>         struct eppoll_entry *pwq;
>> 
>>         if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache,
>> GFP_KERNEL))) {
>>                 init_waitqueue_func_entry(&pwq->wait, 
>> ep_poll_callback);
>>                 pwq->whead = whead;
>>                 pwq->base = epi;
>> 
>>                 if (epi->event.events & EPOLLEXCL) { <<<< New
>> functionality here!!!
>>                         add_wait_queue_exclusive(whead, &pwq->wait);
>>                 } else {
>>                         add_wait_queue(whead, &pwq->wait);
>>                 }
>>                 list_add_tail(&pwq->llink, &epi->pwqlist);
>>                 epi->nwait++;
>>         } else {
>>                 /* We have to signal that an error occurred */
>>                 epi->nwait = -1;
>>         }
>> }
>> ================================================================================
>> 
>> 
>> After doing test with EPOLLEXCL set in my multiprocessing application
>> framework (now it is open source: http://www.endurox.org/ :) ), 
>> results
>> were good, there were no extra wakeups. Thus more efficient 
>> processing.
>> 
> 
> Cool. If you have any performance numbers to share that would be more
> supportive.
> 
>> Jason, how do you think would mainline accept such patch with new 
>> flag?
>> Or are there any concerns about this? Also this will mean that new 
>> flag
>> will be need to add to GNU C Library (/usr/include/sys/epoll.h).
>> 
> 
> This has come up several times - so imo I think it would be a 
> reasonable
> addition - but I'm only speaking for myself.
> 
> In terms of implementation it might make sense to return 0 from
> ep_poll_callback() in case ep->wq is empty. That way we continue to
> search for an active waiter. That way we service wakeups in a more
> timely manner if some threads are busy. We probably also don't want to
> allow the flag for nested ep descriptors.
> 
> Thanks,
> 
> -Jason
> 
>> Or maybe somebody else who is familiar with kernel epoll functionality
>> can comment this?
>> 
>> Regarding the flag's bitmask, seems like (1<<28) needs to be taken for
>> EPOLLEXCL as flags type for epoll_event.events is int32 and last bit
>> 1<<31 is used by EPOLLET (in include/uapi/linux/eventpoll.h).
>> 
>> Thanks a lot in advance,
>> Madars
>> 
>> 
>> Jason Baron @ 2015-08-05 15:32 rakstīja:
>>> On 08/05/2015 07:06 AM, Madars Vitolins wrote:
>>>> Jason Baron @ 2015-08-04 18:02 rakstīja:
>>>>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>>>>> Madars Vitolins <m@silodev.com> wrote:
>>>>>>> Hi Folks,
>>>>>>> 
>>>>>>> I am developing kind of open systems application, which uses
>>>>>>> multiple processes/executables where each of them monitors some 
>>>>>>> set
>>>>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>>>>> example when 10 processes on same queue are in state of 
>>>>>>> epoll_wait()
>>>>>>> and one message arrives, all 10 processes gets woken up and all 
>>>>>>> of
>>>>>>> them tries to read the message from Q. One succeeds, the others 
>>>>>>> gets
>>>>>>> EAGAIN error. The problem is with those others, which generates
>>>>>>> extra context switches - useless CPU usage. With more processes
>>>>>>> inefficiency gets higher.
>>>>>>> 
>>>>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable 
>>>>>>> for
>>>>>>> multi-threaded application and not for multi-process application.
>>>>>> 
>>>>>> Correct.  Most FDs are not shared across processes.
>>>>>> 
>>>>>>> Ideal mechanism for this would be:
>>>>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>>>>> more processes are in state of epoll_wait() - then send event 
>>>>>>> only
>>>>>>> to one waiter.
>>>>>>> 2. If none of processes are in wait state, then send the event to
>>>>>>> all epoll sets (as it is currently). Then the first free process
>>>>>>> will grab the event.
>>>>>> 
>>>>>> Jason Baron was working on this (search LKML archives for
>>>>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>>>> 
>>>>>> However, I was unconvinced about modifying epoll.
>>>>>> 
>>>>>> Perhaps I may be more easily convinced about your mqueue case than 
>>>>>> his
>>>>>> case for listen sockets, though[*]
>>>>>> 
>>>>> 
>>>>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>>>>> multiple epoll fds (or epoll sets) attached to the same wakeup 
>>>>> source,
>>>>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>>>>> essentially walks the list of waiters, wakes up the first thread
>>>>> that is actively in epoll_wait(), stops and moves the woken up
>>>>> epoll set to the end of the list. So it attempts to balance
>>>>> the wakeups among the epoll sets, I think in the way that you
>>>>> were describing.
>>>>> 
>>>>> Here is the patchset:
>>>>> 
>>>>> https://lkml.org/lkml/2015/2/24/667
>>>>> 
>>>>> The test program shows how to use the API. Essentially, you
>>>>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>>>>> which you then attach to you're shared wakeup source and
>>>>> then to your epoll sets. Please let me know if its unclear.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -Jason
>>>> 
>>>> In my particular case I need to work with multiple
>>>> processes/executables running (not threads) and listening on same
>>>> queues (this concept allows to sysadmin easily manage those 
>>>> processes
>>>> (start new ones for balancing or stop them with out service
>>>> interruption), and if any process dies for some reason (signal, 
>>>> core,
>>>> etc..), the whole application does not get killed, but only one
>>>> transaction is lost).
>>>> 
>>>> Recently I did tests, and found out that kernel's epoll currently
>>>> sends notifications to 4 processes (I think it is EP_MAX_NESTS
>>>> constant) waiting on same resource (those other 6 from my example
>>>> will stay in sleep state). So it is not as bad as I thought before.
>>>> It could be nice if EP_MAX_NESTS could be configurable, but I guess 
>>>> 4
>>>> is fine too.
>>>> 
>>> 
>>> hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
>>> if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
>>> can't add in 'ep5'. Where the 'epN' above represent epoll file
>>> descriptors that are attached together via: EPOLL_CTL_ADD.
>>> 
>>> The nesting does not affect how wakeups are down. All epoll fds
>>> that are attached to the even source fd are going to get wakeups.
>>> 
>>> 
>>>> Jason, does your patch work for multi-process application? How hard
>>>> it would be to implement this for such scenario?
>>> 
>>> I don't think it would be too hard, but it requires:
>>> 
>>> 1) adding the patches
>>> 2) re-compiling, running new kernel
>>> 3) modifying your app to the new API.
>>> 
>>> Thanks,
>>> 
>>> -Jason
>>> 
>>> 
>>>> 
>>>> Madars
>>>> 
>>>>> 
>>>>>> Typical applications have few (probably only one) listen sockets 
>>>>>> or
>>>>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>>>>> blocking syscalls (accept4 or mq_timedreceive).
>>>>>> 
>>>>>> Making blocking syscalls allows exclusive wakeups to avoid 
>>>>>> thundering
>>>>>> herds.
>>>>>> 
>>>>>>> How do you think, would it be real to implement this? How about
>>>>>>> concurrency?
>>>>>>> Can you please give me some hints from which points in code to 
>>>>>>> start
>>>>>>> to implement these changes?
>>>>>> 
>>>>>> For now, I suggest dedicating a thread in each process to do
>>>>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>>>>> of queues in your system.
>>>>>> 
>>>>>> 
>>>>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>>>>     staying on the same CPU as much as possible.
>>>>>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>>>>>     socket server (e.g. http/memcached/DB), a typical new client
>>>>>>     socket spends a fair amount of time idle.  Thus I don't 
>>>>>> believe
>>>>>>     memory locality inside the kernel is much concern when there's
>>>>>>     thousands of accepted client sockets.
>>>>>> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-11-30 21:28   ` Madars Vitolins
@ 2015-12-01 20:11     ` Jason Baron
  2015-12-05 11:47       ` Madars Vitolins
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Baron @ 2015-12-01 20:11 UTC (permalink / raw)
  To: Madars Vitolins; +Cc: Eric Wong, linux-kernel

Hi Madars,

On 11/30/2015 04:28 PM, Madars Vitolins wrote:
> Hi Jason,
> 
> I today did search the mail archive and checked your offered patch did on February, it basically does the some (flag for add_wait_queue_exclusive() + balance).
> 
> So I plan to run off some tests with your patch, flag on/off and will provide results. I guess if I pull up 250 or 500 processes (which could real for production environment) waiting on one Q, then there could be a notable difference in performance with EPOLLEXCLUSIVE set or not.
> 

Sounds good. Below is an updated patch if you want to try it - it only adds the 'EPOLLEXCLUSIVE' flag.


diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e009ca..265fa7b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,7 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1002,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1067,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
+		ewake = 1;
 		wake_up_locked(&ep->wq);
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1078,6 +1081,9 @@ out_unlock:
 	if (pwake)
 		ep_poll_safewake(&ep->poll_wait);
 
+	if (epi->event.events & EPOLLEXCLUSIVE)
+		return ewake;
+
 	return 1;
 }
 
@@ -1095,7 +1101,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (epi->event.events & EPOLLEXCLUSIVE)
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1861,6 +1870,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (f.file == tf.file || !is_file_epoll(f.file))
 		goto error_tgt_fput;
 
+	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
+		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
+		goto error_tgt_fput;
+
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..925bbfb 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,9 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.


> During kernel hacking with debug print, with 10 processes waiting on one event source, with original kernel I did see lot un-needed processing inside of eventpoll.c, it got 10x calls to ep_poll_callback() and other stuff for single event, which results with few processes waken up in user space (count probably gets randomly depending on concurrency).
> 
> 
> Meanwhile we are not the only ones who talk about this patch, see here: http://stackoverflow.com/questions/33226842/epollexclusive-and-epollroundrobin-flags-in-mainstream-kernel others are asking too.
> 
> So what is the current situation with your patch, what is the blocking for getting it into mainline?
> 

If we can show some good test results here I will re-submit it.

Thanks,

-Jason


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-12-01 20:11     ` Jason Baron
@ 2015-12-05 11:47       ` Madars Vitolins
  0 siblings, 0 replies; 11+ messages in thread
From: Madars Vitolins @ 2015-12-05 11:47 UTC (permalink / raw)
  To: Jason Baron; +Cc: Eric Wong, linux-kernel

Hi Jason,

I did the testing and wrote for it a blog article for this: 
https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/

But in summary is following:

Test case:
- One multi-threaded binary with 10 threads are doing total of 1'000'000 
calls to 250 single threaded processes doing epoll() on the Posix queue
- The 'call' are basically sending a message to shared queue (to those 
250 load balanced processed) and they send reply back to client thread's 
private queue

Tests done on following system:
- Host system: Linux Mint Mate 17.2 64bit, kernel: 3.13.0-24-generic
- CPU: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz (two cores)
- RAM: 16 GB
- Visualization platform: Oracle Virtual Box 4.3.28
- Guest OS: Gentoo Linux 2015.03, kernel 4.3.0-gentoo, 64 bit.
- CPU for guest: Two cores
- RAM for guest: 5GB (no swap usage, free about 4GB)
- Enduro/X version: 2.3.2


Results with original kernel (no EPOLLEXCLUSIVE):
Gives:

$ time ./bankcl
...

real 14m20.561s
user 0m21.823s
sys 10m49.821s


Patched kernel version with EPOLLEXCLUSIVE flag in use:
$ time ./bankcl
...
real 0m24.953s
user 0m17.497s
sys 0m4.445s

Thus 14 minutes vs 24 seconds! So EPOLLEXCLUSIVE flag makes application 
to run *35 times faster*!

Guys this is MUST HAVE patch!

Thanks,
Madars



Jason Baron @ 2015-12-01 22:11 rakstīja:
> Hi Madars,
> 
> On 11/30/2015 04:28 PM, Madars Vitolins wrote:
>> Hi Jason,
>> 
>> I today did search the mail archive and checked your offered patch did 
>> on February, it basically does the some (flag for 
>> add_wait_queue_exclusive() + balance).
>> 
>> So I plan to run off some tests with your patch, flag on/off and will 
>> provide results. I guess if I pull up 250 or 500 processes (which 
>> could real for production environment) waiting on one Q, then there 
>> could be a notable difference in performance with EPOLLEXCLUSIVE set 
>> or not.
>> 
> 
> Sounds good. Below is an updated patch if you want to try it - it only
> adds the 'EPOLLEXCLUSIVE' flag.
> 
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 1e009ca..265fa7b 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -92,7 +92,7 @@
>   */
> 
>  /* Epoll private bits inside the event mask */
> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | 
> EPOLLEXCLUSIVE)
> 
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> @@ -1002,6 +1002,7 @@ static int ep_poll_callback(wait_queue_t *wait,
> unsigned mode, int sync, void *k
>  	unsigned long flags;
>  	struct epitem *epi = ep_item_from_wait(wait);
>  	struct eventpoll *ep = epi->ep;
> +	int ewake = 0;
> 
>  	if ((unsigned long)key & POLLFREE) {
>  		ep_pwq_from_wait(wait)->whead = NULL;
> @@ -1066,8 +1067,10 @@ static int ep_poll_callback(wait_queue_t *wait,
> unsigned mode, int sync, void *k
>  	 * Wake up ( if active ) both the eventpoll wait list and the 
> ->poll()
>  	 * wait list.
>  	 */
> -	if (waitqueue_active(&ep->wq))
> +	if (waitqueue_active(&ep->wq)) {
> +		ewake = 1;
>  		wake_up_locked(&ep->wq);
> +	}
>  	if (waitqueue_active(&ep->poll_wait))
>  		pwake++;
> 
> @@ -1078,6 +1081,9 @@ out_unlock:
>  	if (pwake)
>  		ep_poll_safewake(&ep->poll_wait);
> 
> +	if (epi->event.events & EPOLLEXCLUSIVE)
> +		return ewake;
> +
>  	return 1;
>  }
> 
> @@ -1095,7 +1101,10 @@ static void ep_ptable_queue_proc(struct file
> *file, wait_queue_head_t *whead,
>  		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>  		pwq->whead = whead;
>  		pwq->base = epi;
> -		add_wait_queue(whead, &pwq->wait);
> +		if (epi->event.events & EPOLLEXCLUSIVE)
> +			add_wait_queue_exclusive(whead, &pwq->wait);
> +		else
> +			add_wait_queue(whead, &pwq->wait);
>  		list_add_tail(&pwq->llink, &epi->pwqlist);
>  		epi->nwait++;
>  	} else {
> @@ -1861,6 +1870,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  	if (f.file == tf.file || !is_file_epoll(f.file))
>  		goto error_tgt_fput;
> 
> +	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
> +		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
> +		goto error_tgt_fput;
> +
>  	/*
>  	 * At this point it is safe to assume that the "private_data" 
> contains
>  	 * our own data structure.
> diff --git a/include/uapi/linux/eventpoll.h 
> b/include/uapi/linux/eventpoll.h
> index bc81fb2..925bbfb 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,9 @@
>  #define EPOLL_CTL_DEL 2
>  #define EPOLL_CTL_MOD 3
> 
> +/* Add exclusively */
> +#define EPOLLEXCLUSIVE (1 << 28)
> +
>  /*
>   * Request the handling of system wakeup events so as to prevent
> system suspends
>   * from happening while those events are being processed.
> 
> 
>> During kernel hacking with debug print, with 10 processes waiting on 
>> one event source, with original kernel I did see lot un-needed 
>> processing inside of eventpoll.c, it got 10x calls to 
>> ep_poll_callback() and other stuff for single event, which results 
>> with few processes waken up in user space (count probably gets 
>> randomly depending on concurrency).
>> 
>> 
>> Meanwhile we are not the only ones who talk about this patch, see 
>> here: 
>> http://stackoverflow.com/questions/33226842/epollexclusive-and-epollroundrobin-flags-in-mainstream-kernel 
>> others are asking too.
>> 
>> So what is the current situation with your patch, what is the blocking 
>> for getting it into mainline?
>> 
> 
> If we can show some good test results here I will re-submit it.
> 
> Thanks,
> 
> -Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-08-05 11:06     ` Madars Vitolins
@ 2015-08-05 13:32       ` Jason Baron
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Baron @ 2015-08-05 13:32 UTC (permalink / raw)
  To: Madars Vitolins; +Cc: Eric Wong, linux-kernel

On 08/05/2015 07:06 AM, Madars Vitolins wrote:
> Jason Baron @ 2015-08-04 18:02 rakstīja:
>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>> Madars Vitolins <m@silodev.com> wrote:
>>>> Hi Folks,
>>>>
>>>> I am developing kind of open systems application, which uses
>>>> multiple processes/executables where each of them monitors some set
>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>> example when 10 processes on same queue are in state of epoll_wait()
>>>> and one message arrives, all 10 processes gets woken up and all of
>>>> them tries to read the message from Q. One succeeds, the others gets
>>>> EAGAIN error. The problem is with those others, which generates
>>>> extra context switches - useless CPU usage. With more processes
>>>> inefficiency gets higher.
>>>>
>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
>>>> multi-threaded application and not for multi-process application.
>>>
>>> Correct.  Most FDs are not shared across processes.
>>>
>>>> Ideal mechanism for this would be:
>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>> more processes are in state of epoll_wait() - then send event only
>>>> to one waiter.
>>>> 2. If none of processes are in wait state, then send the event to
>>>> all epoll sets (as it is currently). Then the first free process
>>>> will grab the event.
>>>
>>> Jason Baron was working on this (search LKML archives for
>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>
>>> However, I was unconvinced about modifying epoll.
>>>
>>> Perhaps I may be more easily convinced about your mqueue case than his
>>> case for listen sockets, though[*]
>>>
>>
>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>> multiple epoll fds (or epoll sets) attached to the same wakeup source,
>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>> essentially walks the list of waiters, wakes up the first thread
>> that is actively in epoll_wait(), stops and moves the woken up
>> epoll set to the end of the list. So it attempts to balance
>> the wakeups among the epoll sets, I think in the way that you
>> were describing.
>>
>> Here is the patchset:
>>
>> https://lkml.org/lkml/2015/2/24/667
>>
>> The test program shows how to use the API. Essentially, you
>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>> which you then attach to you're shared wakeup source and
>> then to your epoll sets. Please let me know if its unclear.
>>
>> Thanks,
>>
>> -Jason
> 
> In my particular case I need to work with multiple processes/executables running (not threads) and listening on same queues (this concept allows to sysadmin easily manage those processes (start new ones for balancing or stop them with out service interruption), and if any process dies for some reason (signal, core, etc..), the whole application does not get killed, but only one transaction is lost).
> 
> Recently I did tests, and found out that kernel's epoll currently sends notifications to 4 processes (I think it is EP_MAX_NESTS constant) waiting on same resource (those other 6 from my example will stay in sleep state). So it is not as bad as I thought before. It could be nice if EP_MAX_NESTS could be configurable, but I guess 4 is fine too.
> 

hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
can't add in 'ep5'. Where the 'epN' above represent epoll file
descriptors that are attached together via: EPOLL_CTL_ADD.

The nesting does not affect how wakeups are down. All epoll fds
that are attached to the even source fd are going to get wakeups.


> Jason, does your patch work for multi-process application? How hard it would be to implement this for such scenario?

I don't think it would be too hard, but it requires:

1) adding the patches
2) re-compiling, running new kernel
3) modifying your app to the new API.

Thanks,

-Jason


> 
> Madars
> 
>>
>>> Typical applications have few (probably only one) listen sockets or
>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>> blocking syscalls (accept4 or mq_timedreceive).
>>>
>>> Making blocking syscalls allows exclusive wakeups to avoid thundering
>>> herds.
>>>
>>>> How do you think, would it be real to implement this? How about
>>>> concurrency?
>>>> Can you please give me some hints from which points in code to start
>>>> to implement these changes?
>>>
>>> For now, I suggest dedicating a thread in each process to do
>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>> of queues in your system.
>>>
>>>
>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>     staying on the same CPU as much as possible.
>>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>>     socket server (e.g. http/memcached/DB), a typical new client
>>>     socket spends a fair amount of time idle.  Thus I don't believe
>>>     memory locality inside the kernel is much concern when there's
>>>     thousands of accepted client sockets.
>>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-08-04 15:02   ` Jason Baron
@ 2015-08-05 11:06     ` Madars Vitolins
  2015-08-05 13:32       ` Jason Baron
  0 siblings, 1 reply; 11+ messages in thread
From: Madars Vitolins @ 2015-08-05 11:06 UTC (permalink / raw)
  To: Jason Baron; +Cc: Eric Wong, linux-kernel

Jason Baron @ 2015-08-04 18:02 rakstīja:
> On 08/03/2015 07:48 PM, Eric Wong wrote:
>> Madars Vitolins <m@silodev.com> wrote:
>>> Hi Folks,
>>>
>>> I am developing kind of open systems application, which uses
>>> multiple processes/executables where each of them monitors some set
>>> of resources (in this case POSIX Queues) via epoll interface. For
>>> example when 10 processes on same queue are in state of 
>>> epoll_wait()
>>> and one message arrives, all 10 processes gets woken up and all of
>>> them tries to read the message from Q. One succeeds, the others 
>>> gets
>>> EAGAIN error. The problem is with those others, which generates
>>> extra context switches - useless CPU usage. With more processes
>>> inefficiency gets higher.
>>>
>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable 
>>> for
>>> multi-threaded application and not for multi-process application.
>>
>> Correct.  Most FDs are not shared across processes.
>>
>>> Ideal mechanism for this would be:
>>> 1. If multiple epoll sets in kernel matches same event and one or
>>> more processes are in state of epoll_wait() - then send event only
>>> to one waiter.
>>> 2. If none of processes are in wait state, then send the event to
>>> all epoll sets (as it is currently). Then the first free process
>>> will grab the event.
>>
>> Jason Baron was working on this (search LKML archives for
>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>
>> However, I was unconvinced about modifying epoll.
>>
>> Perhaps I may be more easily convinced about your mqueue case than 
>> his
>> case for listen sockets, though[*]
>>
>
> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
> multiple epoll fds (or epoll sets) attached to the same wakeup 
> source,
> and have the wakeups 'rotate' among the epoll sets. The wakeup
> essentially walks the list of waiters, wakes up the first thread
> that is actively in epoll_wait(), stops and moves the woken up
> epoll set to the end of the list. So it attempts to balance
> the wakeups among the epoll sets, I think in the way that you
> were describing.
>
> Here is the patchset:
>
> https://lkml.org/lkml/2015/2/24/667
>
> The test program shows how to use the API. Essentially, you
> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
> which you then attach to you're shared wakeup source and
> then to your epoll sets. Please let me know if its unclear.
>
> Thanks,
>
> -Jason

In my particular case I need to work with multiple 
processes/executables running (not threads) and listening on same queues 
(this concept allows to sysadmin easily manage those processes (start 
new ones for balancing or stop them with out service interruption), and 
if any process dies for some reason (signal, core, etc..), the whole 
application does not get killed, but only one transaction is lost).

Recently I did tests, and found out that kernel's epoll currently sends 
notifications to 4 processes (I think it is EP_MAX_NESTS constant) 
waiting on same resource (those other 6 from my example will stay in 
sleep state). So it is not as bad as I thought before. It could be nice 
if EP_MAX_NESTS could be configurable, but I guess 4 is fine too.

Jason, does your patch work for multi-process application? How hard it 
would be to implement this for such scenario?

Madars

>
>> Typical applications have few (probably only one) listen sockets or
>> POSIX mqueues; so I would rather use dedicated threads to issue
>> blocking syscalls (accept4 or mq_timedreceive).
>>
>> Making blocking syscalls allows exclusive wakeups to avoid 
>> thundering
>> herds.
>>
>>> How do you think, would it be real to implement this? How about
>>> concurrency?
>>> Can you please give me some hints from which points in code to 
>>> start
>>> to implement these changes?
>>
>> For now, I suggest dedicating a thread in each process to do
>> mq_timedreceive/mq_receive, assuming you only have a small amount
>> of queues in your system.
>>
>>
>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>     staying on the same CPU as much as possible.
>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>     socket server (e.g. http/memcached/DB), a typical new client
>>     socket spends a fair amount of time idle.  Thus I don't believe
>>     memory locality inside the kernel is much concern when there's
>>     thousands of accepted client sockets.
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-08-03 23:48 ` Eric Wong
@ 2015-08-04 15:02   ` Jason Baron
  2015-08-05 11:06     ` Madars Vitolins
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Baron @ 2015-08-04 15:02 UTC (permalink / raw)
  To: Eric Wong, Madars Vitolins; +Cc: linux-kernel



On 08/03/2015 07:48 PM, Eric Wong wrote:
> Madars Vitolins <m@silodev.com> wrote:
>> Hi Folks,
>>
>> I am developing kind of open systems application, which uses
>> multiple processes/executables where each of them monitors some set
>> of resources (in this case POSIX Queues) via epoll interface. For
>> example when 10 processes on same queue are in state of epoll_wait()
>> and one message arrives, all 10 processes gets woken up and all of
>> them tries to read the message from Q. One succeeds, the others gets
>> EAGAIN error. The problem is with those others, which generates
>> extra context switches - useless CPU usage. With more processes
>> inefficiency gets higher.
>>
>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
>> multi-threaded application and not for multi-process application.
> 
> Correct.  Most FDs are not shared across processes.
> 
>> Ideal mechanism for this would be:
>> 1. If multiple epoll sets in kernel matches same event and one or
>> more processes are in state of epoll_wait() - then send event only
>> to one waiter.
>> 2. If none of processes are in wait state, then send the event to
>> all epoll sets (as it is currently). Then the first free process
>> will grab the event.
> 
> Jason Baron was working on this (search LKML archives for
> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
> 
> However, I was unconvinced about modifying epoll.
> 
> Perhaps I may be more easily convinced about your mqueue case than his
> case for listen sockets, though[*]
> 

Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
multiple epoll fds (or epoll sets) attached to the same wakeup source,
and have the wakeups 'rotate' among the epoll sets. The wakeup
essentially walks the list of waiters, wakes up the first thread
that is actively in epoll_wait(), stops and moves the woken up
epoll set to the end of the list. So it attempts to balance
the wakeups among the epoll sets, I think in the way that you
were describing.

Here is the patchset:

https://lkml.org/lkml/2015/2/24/667

The test program shows how to use the API. Essentially, you
have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
which you then attach to you're shared wakeup source and
then to your epoll sets. Please let me know if its unclear.

Thanks,

-Jason

> Typical applications have few (probably only one) listen sockets or
> POSIX mqueues; so I would rather use dedicated threads to issue
> blocking syscalls (accept4 or mq_timedreceive).
> 
> Making blocking syscalls allows exclusive wakeups to avoid thundering
> herds.
> 
>> How do you think, would it be real to implement this? How about
>> concurrency?
>> Can you please give me some hints from which points in code to start
>> to implement these changes?
> 
> For now, I suggest dedicating a thread in each process to do
> mq_timedreceive/mq_receive, assuming you only have a small amount
> of queues in your system.
> 
> 
> [*] mq_timedreceive may copy a largish buffer which benefits from
>     staying on the same CPU as much as possible.
>     Contrary, accept4 only creates a client socket.  With a C10K+
>     socket server (e.g. http/memcached/DB), a typical new client
>     socket spends a fair amount of time idle.  Thus I don't believe
>     memory locality inside the kernel is much concern when there's
>     thousands of accepted client sockets.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-07-13 12:34 Madars Vitolins
  2015-07-15 13:07 ` Madars Vitolins
@ 2015-08-03 23:48 ` Eric Wong
  2015-08-04 15:02   ` Jason Baron
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Wong @ 2015-08-03 23:48 UTC (permalink / raw)
  To: Madars Vitolins; +Cc: linux-kernel, Jason Baron

Madars Vitolins <m@silodev.com> wrote:
> Hi Folks,
> 
> I am developing kind of open systems application, which uses
> multiple processes/executables where each of them monitors some set
> of resources (in this case POSIX Queues) via epoll interface. For
> example when 10 processes on same queue are in state of epoll_wait()
> and one message arrives, all 10 processes gets woken up and all of
> them tries to read the message from Q. One succeeds, the others gets
> EAGAIN error. The problem is with those others, which generates
> extra context switches - useless CPU usage. With more processes
> inefficiency gets higher.
> 
> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
> multi-threaded application and not for multi-process application.

Correct.  Most FDs are not shared across processes.

> Ideal mechanism for this would be:
> 1. If multiple epoll sets in kernel matches same event and one or
> more processes are in state of epoll_wait() - then send event only
> to one waiter.
> 2. If none of processes are in wait state, then send the event to
> all epoll sets (as it is currently). Then the first free process
> will grab the event.

Jason Baron was working on this (search LKML archives for
EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)

However, I was unconvinced about modifying epoll.

Perhaps I may be more easily convinced about your mqueue case than his
case for listen sockets, though[*]

Typical applications have few (probably only one) listen sockets or
POSIX mqueues; so I would rather use dedicated threads to issue
blocking syscalls (accept4 or mq_timedreceive).

Making blocking syscalls allows exclusive wakeups to avoid thundering
herds.

> How do you think, would it be real to implement this? How about
> concurrency?
> Can you please give me some hints from which points in code to start
> to implement these changes?

For now, I suggest dedicating a thread in each process to do
mq_timedreceive/mq_receive, assuming you only have a small amount
of queues in your system.


[*] mq_timedreceive may copy a largish buffer which benefits from
    staying on the same CPU as much as possible.
    Contrary, accept4 only creates a client socket.  With a C10K+
    socket server (e.g. http/memcached/DB), a typical new client
    socket spends a fair amount of time idle.  Thus I don't believe
    memory locality inside the kernel is much concern when there's
    thousands of accepted client sockets.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
  2015-07-13 12:34 Madars Vitolins
@ 2015-07-15 13:07 ` Madars Vitolins
  2015-08-03 23:48 ` Eric Wong
  1 sibling, 0 replies; 11+ messages in thread
From: Madars Vitolins @ 2015-07-15 13:07 UTC (permalink / raw)
  To: linux-kernel

Any comments?

Madars

Madars Vitolins @ 2015-07-13 15:34 rakstīja:
> Hi Folks,
>
> I am developing kind of open systems application, which uses multiple
> processes/executables where each of them monitors some set of
> resources (in this case POSIX Queues) via epoll interface. For 
> example
> when 10 processes on same queue are in state of epoll_wait() and one
> message arrives, all 10 processes gets woken up and all of them tries
> to read the message from Q. One succeeds, the others gets EAGAIN
> error. The problem is with those others, which generates extra 
> context
> switches - useless CPU usage. With more processes inefficiency gets
> higher.
>
> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
> multi-threaded application and not for multi-process application.
>
> Ideal mechanism for this would be:
> 1. If multiple epoll sets in kernel matches same event and one or
> more processes are in state of epoll_wait() - then send event only to
> one waiter.
> 2. If none of processes are in wait state, then send the event to all
> epoll sets (as it is currently). Then the first free process will 
> grab
> the event.
>
> How do you think, would it be real to implement this? How about 
> concurrency?
> Can you please give me some hints from which points in code to start
> to implement these changes?
>
>
> Thanks a lot in advance,
> Madars


^ permalink raw reply	[flat|nested] 11+ messages in thread

* epoll and multiple processes - eliminate unneeded process wake-ups
@ 2015-07-13 12:34 Madars Vitolins
  2015-07-15 13:07 ` Madars Vitolins
  2015-08-03 23:48 ` Eric Wong
  0 siblings, 2 replies; 11+ messages in thread
From: Madars Vitolins @ 2015-07-13 12:34 UTC (permalink / raw)
  To: linux-kernel

Hi Folks,

I am developing kind of open systems application, which uses multiple 
processes/executables where each of them monitors some set of resources 
(in this case POSIX Queues) via epoll interface. For example when 10 
processes on same queue are in state of epoll_wait() and one message 
arrives, all 10 processes gets woken up and all of them tries to read 
the message from Q. One succeeds, the others gets EAGAIN error. The 
problem is with those others, which generates extra context switches - 
useless CPU usage. With more processes inefficiency gets higher.

I tried to use EPOLLONESHOT, but no help. Seems this is suitable for 
multi-threaded application and not for multi-process application.

Ideal mechanism for this would be:
1. If multiple epoll sets in kernel matches same event and one or more 
processes are in state of epoll_wait() - then send event only to one 
waiter.
2. If none of processes are in wait state, then send the event to all 
epoll sets (as it is currently). Then the first free process will grab 
the event.

How do you think, would it be real to implement this? How about 
concurrency?
Can you please give me some hints from which points in code to start to 
implement these changes?


Thanks a lot in advance,
Madars

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-12-05 11:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-28 22:54 epoll and multiple processes - eliminate unneeded process wake-ups Madars Vitolins
2015-11-30 19:45 ` Jason Baron
2015-11-30 21:28   ` Madars Vitolins
2015-12-01 20:11     ` Jason Baron
2015-12-05 11:47       ` Madars Vitolins
  -- strict thread matches above, loose matches on Subject: below --
2015-07-13 12:34 Madars Vitolins
2015-07-15 13:07 ` Madars Vitolins
2015-08-03 23:48 ` Eric Wong
2015-08-04 15:02   ` Jason Baron
2015-08-05 11:06     ` Madars Vitolins
2015-08-05 13:32       ` Jason Baron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).