Userspace RCU: workqueue with batching, cheap wakeup, and work stealing

All of lore.kernel.org
 help / color / mirror / Atom feed

* Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found] <836028170.11148.1413630118955.JavaMail.zimbra@efficios.com>
@ 2014-10-18 21:13 ` Mathieu Desnoyers
       [not found] ` <1521253780.11371.1413666790992.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-18 21:13 UTC (permalink / raw)
  To: Ben Maurer, lttng-dev; +Cc: Paul E. McKenney, Yannick Brosseau

Hi,

I have written an implementation of a workqueue based
on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
including a work-stealing scheme. I think this could
scale very naturally to large workloads, and it skips
the overhead of sys_futex when wakees are already running.
Since the wakeup scheme is very lightweight when work
is ongoing, we use it extensively.

It also makes extensive use of wfcqueue "splice" operation
to move work batch in bulk amongst queue and worker
threads. It is a O(1) operation which does not need to
touch the content of the queue.

RCU is used to lookup the sibling worker threads for
work stealing and to wake the siblings of a thread
busy handling work.

It's at:

git://git.urcu.so/urcu.git
branch: urcu/workqueue-wakeup

or gitweb of the top commit:
http://git.lttng.org/?p=userspace-rcu.git;a=commit;h=2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74

Note that this is work in progress, compile-tested only.
I also need to write extensive tests.

Feedback is welcome,

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found] ` <1521253780.11371.1413666790992.JavaMail.zimbra@efficios.com>
@ 2014-10-19 19:15   ` Ben Maurer
  2014-10-22 10:57   ` Lai Jiangshan
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Ben Maurer @ 2014-10-19 19:15 UTC (permalink / raw)
  To: Mathieu Desnoyers, lttng-dev; +Cc: Yannick Brosseau, Paul E. McKenney

Hey,

That's pretty cool! Do you have any thoughts on the LifoSem implementation in folly? What are the differences in design decisions between your workqueue and lifosem?

One other folly abstraction you might want to look at is MPMCQueue -- this is our fifo workqueue that we combine with lifosem: https://github.com/facebook/folly/blob/master/folly/MPMCQueue.h.

One thing I'd point out about lifo waiting that really applies to rcu is the ability to quiesce threads. When we use futexes to implement data structures like LifoSem, we use an abstraction called MemoryIdler (https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h). When MemoryIdler calls futex() it passes a maximum timeout of 5 seconds. If it takes > 5 seconds to wake up the thread, MemoryIdler releases some of the local resources of that thread. For example, it calls madvise() to release unused portions of the thread's stack and tells jemalloc to free unused per-thread caches. Because we use lifo waiting, the oldest waiting threads tend to sleep for a long time. 

For RCU, I'm guessing that because you require thread registration parts of the library iterate through all active threads. An abstraction like MemoryIdler could allow threads that sleep for a long time to unregister themselves reducing the cost of iteration.

One question I've been asking myself: is there an API the kernel could be providing us to do a better job here? One of the things that annoys me about LifoSem is that we don't need exactly LIFO behavior. All we're trying to do is avoid having threads go to sleep if they could spin a bit and acquire new work. We also want inactive threads to sleep for a long time. But among threads that have been running recently, I wish we could give the kernel more freedom in choosing. For example if the most recently run thread in the LIFO stack ran on CPU 12 but a task is currently running on CPU 12, it'd be great if the kernel could choose a different thread -- one that had recently run on a CPU where it can be immediately scheduled.

Something to think about here: most applications I've seen don't really stress the performance of this queue. The bulk of the performance win we've gotten from LifoSem has been in the application itself -- LIFO wait queues increase application performance through better cache usage. I believe this is where we'd get further wins -- eg, if the kernel could do a better job of selecting which thread to use.

-b
________________________________________
From: Mathieu Desnoyers [mathieu.desnoyers@efficios.com]
Sent: Saturday, October 18, 2014 2:13 PM
To: Ben Maurer; lttng-dev
Cc: Yannick Brosseau; Paul E. McKenney; Lai Jiangshan; Stephen Hemminger
Subject: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing

Hi,

I have written an implementation of a workqueue based
on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
including a work-stealing scheme. I think this could
scale very naturally to large workloads, and it skips
the overhead of sys_futex when wakees are already running.
Since the wakeup scheme is very lightweight when work
is ongoing, we use it extensively.

It also makes extensive use of wfcqueue "splice" operation
to move work batch in bulk amongst queue and worker
threads. It is a O(1) operation which does not need to
touch the content of the queue.

RCU is used to lookup the sibling worker threads for
work stealing and to wake the siblings of a thread
busy handling work.

It's at:

git://git.urcu.so/urcu.git
branch: urcu/workqueue-wakeup

or gitweb of the top commit:
https://urldefense.proofpoint.com/v1/url?u=http://git.lttng.org/?p%3Duserspace-rcu.git%3Ba%3Dcommit%3Bh%3D2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=ykFqYBj5kD4j7jlBP4F60A%3D%3D%0A&m=Lj6FdYEksb4oceU%2BtcZSfPjzIApytWcuNFbjvS2oDHc%3D%0A&s=f6323913816c4c2fae2116d0db4689082678cb921efaa7962e3bdede01907454

Note that this is work in progress, compile-tested only.
I also need to write extensive tests.

Feedback is welcome,

Thanks!

Mathieu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found] ` <1521253780.11371.1413666790992.JavaMail.zimbra@efficios.com>
  2014-10-19 19:15   ` Ben Maurer
@ 2014-10-22 10:57   ` Lai Jiangshan
       [not found]   ` <54478D8E.1080608@cn.fujitsu.com>
       [not found]   ` <5CDDBDF2D36D9F43B9F5E99003F6A0D4861BCD3E@PRN-MBX02-1.TheFacebook.com>
  3 siblings, 0 replies; 16+ messages in thread
From: Lai Jiangshan @ 2014-10-22 10:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, lttng-dev, Ben Maurer, Yannick Brosseau

Hi

I just glance the code quickly...

__urcu_steal_work() and __urcu_wakeup_siblings don't test
if they visit the queue->sibling_head.

All the tasks may continue stealing works without any progress.

Thanks,
Lai

On 10/19/2014 05:13 AM, Mathieu Desnoyers wrote:
> Hi,
> 
> I have written an implementation of a workqueue based
> on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
> including a work-stealing scheme. I think this could
> scale very naturally to large workloads, and it skips
> the overhead of sys_futex when wakees are already running.
> Since the wakeup scheme is very lightweight when work
> is ongoing, we use it extensively.
> 
> It also makes extensive use of wfcqueue "splice" operation
> to move work batch in bulk amongst queue and worker
> threads. It is a O(1) operation which does not need to
> touch the content of the queue.
> 
> RCU is used to lookup the sibling worker threads for
> work stealing and to wake the siblings of a thread
> busy handling work.
> 
> It's at:
> 
> git://git.urcu.so/urcu.git
> branch: urcu/workqueue-wakeup
> 
> or gitweb of the top commit:
> http://git.lttng.org/?p=userspace-rcu.git;a=commit;h=2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74
> 
> Note that this is work in progress, compile-tested only.
> I also need to write extensive tests.
> 
> Feedback is welcome,
> 
> Thanks!
> 
> Mathieu
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]   ` <54478D8E.1080608@cn.fujitsu.com>
@ 2014-10-22 11:20     ` Mathieu Desnoyers
       [not found]     ` <1569300159.298.1413976834389.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-22 11:20 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Paul E. McKenney, lttng-dev, Ben Maurer, Yannick Brosseau

Hi Lai,

I pushed a few updates within the past days, including one
fixing this issue. The updated branch is available at:

git://git.urcu.so/urcu.git
branch: urcu/workqueue-wakeup

gitweb:
http://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/workqueue-wakeup

Please let me know if you notice anything else,

Thanks!

Mathieu

----- Original Message -----
> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger" <stephen@networkplumber.org>
> Sent: Wednesday, October 22, 2014 6:57:18 AM
> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> Hi
> 
> I just glance the code quickly...
> 
> __urcu_steal_work() and __urcu_wakeup_siblings don't test
> if they visit the queue->sibling_head.
> 
> All the tasks may continue stealing works without any progress.
> 
> Thanks,
> Lai
> 
> On 10/19/2014 05:13 AM, Mathieu Desnoyers wrote:
> > Hi,
> > 
> > I have written an implementation of a workqueue based
> > on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
> > including a work-stealing scheme. I think this could
> > scale very naturally to large workloads, and it skips
> > the overhead of sys_futex when wakees are already running.
> > Since the wakeup scheme is very lightweight when work
> > is ongoing, we use it extensively.
> > 
> > It also makes extensive use of wfcqueue "splice" operation
> > to move work batch in bulk amongst queue and worker
> > threads. It is a O(1) operation which does not need to
> > touch the content of the queue.
> > 
> > RCU is used to lookup the sibling worker threads for
> > work stealing and to wake the siblings of a thread
> > busy handling work.
> > 
> > It's at:
> > 
> > git://git.urcu.so/urcu.git
> > branch: urcu/workqueue-wakeup
> > 
> > or gitweb of the top commit:
> > http://git.lttng.org/?p=userspace-rcu.git;a=commit;h=2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74
> > 
> > Note that this is work in progress, compile-tested only.
> > I also need to write extensive tests.
> > 
> > Feedback is welcome,
> > 
> > Thanks!
> > 
> > Mathieu
> > 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]   ` <5CDDBDF2D36D9F43B9F5E99003F6A0D4861BCD3E@PRN-MBX02-1.TheFacebook.com>
@ 2014-10-22 12:44     ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-22 12:44 UTC (permalink / raw)
  To: Ben Maurer; +Cc: Yannick Brosseau, lttng-dev, Paul E. McKenney

----- Original Message -----
> From: "Ben Maurer" <bmaurer@fb.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, "lttng-dev" <lttng-dev@lists.lttng.org>
> Cc: "Yannick Brosseau" <yannick.brosseau@fb.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Lai Jiangshan"
> <laijs@cn.fujitsu.com>, "Stephen Hemminger" <stephen@networkplumber.org>
> Sent: Sunday, October 19, 2014 3:15:28 PM
> Subject: RE: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> Hey,
> 
> That's pretty cool! Do you have any thoughts on the LifoSem implementation in
> folly? What are the differences in design decisions between your workqueue
> and lifosem?

Hi Ben,

I have to admit that I did not have time to look at thoroughly as I would
have liked at the Folly implementation. So I'll focus on the design guidelines
I'm aiming at for my workqueue implementation, and we'll be able to find out
the comparison with folly from there. (more below)

> 
> One other folly abstraction you might want to look at is MPMCQueue -- this is
> our fifo workqueue that we combine with lifosem:
> https://github.com/facebook/folly/blob/master/folly/MPMCQueue.h.
> 
> One thing I'd point out about lifo waiting that really applies to rcu is the
> ability to quiesce threads. When we use futexes to implement data structures
> like LifoSem, we use an abstraction called MemoryIdler
> (https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h).
> When MemoryIdler calls futex() it passes a maximum timeout of 5 seconds. If
> it takes > 5 seconds to wake up the thread, MemoryIdler releases some of the
> local resources of that thread. For example, it calls madvise() to release
> unused portions of the thread's stack and tells jemalloc to free unused
> per-thread caches. Because we use lifo waiting, the oldest waiting threads
> tend to sleep for a long time.
> 
> For RCU, I'm guessing that because you require thread registration parts of
> the library iterate through all active threads. An abstraction like
> MemoryIdler could allow threads that sleep for a long time to unregister
> themselves reducing the cost of iteration.

Yes, this could indeed speed up iteration a bit.

> 
> One question I've been asking myself: is there an API the kernel could be
> providing us to do a better job here? One of the things that annoys me about
> LifoSem is that we don't need exactly LIFO behavior. All we're trying to do
> is avoid having threads go to sleep if they could spin a bit and acquire new
> work. We also want inactive threads to sleep for a long time. But among
> threads that have been running recently, I wish we could give the kernel
> more freedom in choosing. For example if the most recently run thread in the
> LIFO stack ran on CPU 12 but a task is currently running on CPU 12, it'd be
> great if the kernel could choose a different thread -- one that had recently
> run on a CPU where it can be immediately scheduled.
> 
> Something to think about here: most applications I've seen don't really
> stress the performance of this queue. The bulk of the performance win we've
> gotten from LifoSem has been in the application itself -- LIFO wait queues
> increase application performance through better cache usage. I believe this
> is where we'd get further wins -- eg, if the kernel could do a better job of
> selecting which thread to use.

Ah-Ah! Well this is exactly one of the thing I'm trying to achieve with my
workqueue/waitqueue implementation: provide exact LIFO ordering of wakeups.
All done in user-space. I achieve this by having one futex int32_t per worker,
and by pushing the worker structures into a stack (lfstack). Therefore, the
LIFO ordering of wait (stack push) and wakeup (stack pop) is all done in
userspace. Also, the waiters are responsible for busy-waiting for a while
before calling futex wait, which achieve adaptative busy-wait/blocking
similarly to pthread mutexes.

I try to bias speed and low latency guarantees in favor of the dispatcher
(equivalent of your "network" thread). All the dispatcher has to do is to
enqueue work into the workqueue (xchg + store), and then wakeup one
worker thread (waitqueue stack pop (cmpxchg), check if worker is running,
if not, futex wake).

Worker threads grab the entire content of the workqueue when they are
looking for work to do. It minimizes the amount of cacheline bouncing
that would be otherwise caused by picking work items one by one. We
can do this efficiently thanks to wfcqueue splicing operation, which
allow moving the entire content of a queue without locking and without
touching the internal nodes of the queue.

In order to make sure we don't end up in situations where a thread would
grab a lot of work, and delay its execution while other threads are
waiting, I added a work-stealing scheme between neighboring threads.
When a thread is busy processing a work item, it awakens one of its
siblings, and the sibling is going to try to steal the entire work
batch owned by the busy thread. It should allow work batches to trickle
nicely amongst workers. Worker threads always do this when they wake up:
try to pick up work from the system workqueue, if no work is available,
then they try to steal work from a neighbor.

Another goal I have in this design is that if the system is in a state
where all workers are always busy, no calls to sys_futex() are required.
All the synchronization is done in user-space.

Comments, thoughts ?

Thanks,

Mathieu

> 
> -b
> ________________________________________
> From: Mathieu Desnoyers [mathieu.desnoyers@efficios.com]
> Sent: Saturday, October 18, 2014 2:13 PM
> To: Ben Maurer; lttng-dev
> Cc: Yannick Brosseau; Paul E. McKenney; Lai Jiangshan; Stephen Hemminger
> Subject: Userspace RCU: workqueue with batching, cheap wakeup, and work
> stealing
> 
> Hi,
> 
> I have written an implementation of a workqueue based
> on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
> including a work-stealing scheme. I think this could
> scale very naturally to large workloads, and it skips
> the overhead of sys_futex when wakees are already running.
> Since the wakeup scheme is very lightweight when work
> is ongoing, we use it extensively.
> 
> It also makes extensive use of wfcqueue "splice" operation
> to move work batch in bulk amongst queue and worker
> threads. It is a O(1) operation which does not need to
> touch the content of the queue.
> 
> RCU is used to lookup the sibling worker threads for
> work stealing and to wake the siblings of a thread
> busy handling work.
> 
> It's at:
> 
> git://git.urcu.so/urcu.git
> branch: urcu/workqueue-wakeup
> 
> or gitweb of the top commit:
> https://urldefense.proofpoint.com/v1/url?u=http://git.lttng.org/?p%3Duserspace-rcu.git%3Ba%3Dcommit%3Bh%3D2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=ykFqYBj5kD4j7jlBP4F60A%3D%3D%0A&m=Lj6FdYEksb4oceU%2BtcZSfPjzIApytWcuNFbjvS2oDHc%3D%0A&s=f6323913816c4c2fae2116d0db4689082678cb921efaa7962e3bdede01907454
> 
> Note that this is work in progress, compile-tested only.
> I also need to write extensive tests.
> 
> Feedback is welcome,
> 
> Thanks!
> 
> Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]     ` <1569300159.298.1413976834389.JavaMail.zimbra@efficios.com>
@ 2014-10-23  4:26       ` Lai Jiangshan
       [not found]       ` <54488392.2030008@cn.fujitsu.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Lai Jiangshan @ 2014-10-23  4:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, lttng-dev, Ben Maurer, Yannick Brosseau

Hi, Mathieu

In the code, how this problem be solved?
"All the tasks may continue stealing works without any progress."

Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?

It is not a good idea for a worker to call synchronize_rcu() (in urcu_accept_work()).
A new work may be coming soon.

thanks,
Lai

On 10/22/2014 07:20 PM, Mathieu Desnoyers wrote:
> Hi Lai,
> 
> I pushed a few updates within the past days, including one
> fixing this issue. The updated branch is available at:
> 
> git://git.urcu.so/urcu.git
> branch: urcu/workqueue-wakeup
> 
> gitweb:
> http://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/workqueue-wakeup
> 
> Please let me know if you notice anything else,
> 
> Thanks!
> 
> Mathieu
> 
> ----- Original Message -----
>> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
>> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
>> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger" <stephen@networkplumber.org>
>> Sent: Wednesday, October 22, 2014 6:57:18 AM
>> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
>>
>> Hi
>>
>> I just glance the code quickly...
>>
>> __urcu_steal_work() and __urcu_wakeup_siblings don't test
>> if they visit the queue->sibling_head.
>>
>> All the tasks may continue stealing works without any progress.
>>
>> Thanks,
>> Lai
>>
>> On 10/19/2014 05:13 AM, Mathieu Desnoyers wrote:
>>> Hi,
>>>
>>> I have written an implementation of a workqueue based
>>> on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
>>> including a work-stealing scheme. I think this could
>>> scale very naturally to large workloads, and it skips
>>> the overhead of sys_futex when wakees are already running.
>>> Since the wakeup scheme is very lightweight when work
>>> is ongoing, we use it extensively.
>>>
>>> It also makes extensive use of wfcqueue "splice" operation
>>> to move work batch in bulk amongst queue and worker
>>> threads. It is a O(1) operation which does not need to
>>> touch the content of the queue.
>>>
>>> RCU is used to lookup the sibling worker threads for
>>> work stealing and to wake the siblings of a thread
>>> busy handling work.
>>>
>>> It's at:
>>>
>>> git://git.urcu.so/urcu.git
>>> branch: urcu/workqueue-wakeup
>>>
>>> or gitweb of the top commit:
>>> http://git.lttng.org/?p=userspace-rcu.git;a=commit;h=2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74
>>>
>>> Note that this is work in progress, compile-tested only.
>>> I also need to write extensive tests.
>>>
>>> Feedback is welcome,
>>>
>>> Thanks!
>>>
>>> Mathieu
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]       ` <54488392.2030008@cn.fujitsu.com>
@ 2014-10-23  9:28         ` Lai Jiangshan
  2014-10-23 13:06         ` Mathieu Desnoyers
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Lai Jiangshan @ 2014-10-23  9:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: lttng-dev, Paul E. McKenney, Ben Maurer, Yannick Brosseau

Hi Mathieu:

what happen when the wait_node is still queued when urcu_worker_unregister() is called?

On 10/23/2014 12:26 PM, Lai Jiangshan wrote:
>  Content preview:  Hi, Mathieu In the code, how this problem be solved? "All
>    the tasks may continue stealing works without any progress." Why ___urcu_steal_work()
>     needs cds_wfcq_dequeue_lock()? [...] 
>  
>  Content analysis details:   (1.3 points, 5.0 required)
>  
>   pts rule name              description
>  ---- ---------------------- --------------------------------------------------
>   1.3 RDNS_NONE              Delivered to internal network by a host with no rDNS
> X-Poly-FromMTA: (services.dorsal.polymtl.ca [132.207.72.61]) at Thu, 23 Oct 2014 04:24:22 +0000
> 
> Hi, Mathieu
> 
> In the code, how this problem be solved?
> "All the tasks may continue stealing works without any progress."

Can __urcu_wakeup_siblings() be moved to the end of the urcu_dequeue_work()?

> 
> Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?
Sorry, it does need, it locks the source queue...
but we have cds_wfcq_splice_blocking() which wrap the locking and the splicing.

> 
> It is not a good idea for a worker to call synchronize_rcu() (in urcu_accept_work()).
> A new work may be coming soon.


I suggest to add a new API:
__cds_lfs_pop_if(stack, expect) {
	atomic {
		if the top == expect {
			pop it;
		}
	}
}

Then all the worker can only be woken up by urcu_adaptative_wake_up(),
rather than urcu_dequeue_wake_single().

workers dequeues itself when successfully urcu_accept_work() by using
__cds_lfs_pop_if().

Thanks,
Lai

> 
> thanks,
> Lai
> 
> On 10/22/2014 07:20 PM, Mathieu Desnoyers wrote:
>> Hi Lai,
>>
>> I pushed a few updates within the past days, including one
>> fixing this issue. The updated branch is available at:
>>
>> git://git.urcu.so/urcu.git
>> branch: urcu/workqueue-wakeup
>>
>> gitweb:
>> http://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/workqueue-wakeup
>>
>> Please let me know if you notice anything else,
>>
>> Thanks!
>>
>> Mathieu
>>
>> ----- Original Message -----
>>> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
>>> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
>>> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger" <stephen@networkplumber.org>
>>> Sent: Wednesday, October 22, 2014 6:57:18 AM
>>> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
>>>
>>> Hi
>>>
>>> I just glance the code quickly...
>>>
>>> __urcu_steal_work() and __urcu_wakeup_siblings don't test
>>> if they visit the queue->sibling_head.
>>>
>>> All the tasks may continue stealing works without any progress.
>>>
>>> Thanks,
>>> Lai
>>>
>>> On 10/19/2014 05:13 AM, Mathieu Desnoyers wrote:
>>>> Hi,
>>>>
>>>> I have written an implementation of a workqueue based
>>>> on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
>>>> including a work-stealing scheme. I think this could
>>>> scale very naturally to large workloads, and it skips
>>>> the overhead of sys_futex when wakees are already running.
>>>> Since the wakeup scheme is very lightweight when work
>>>> is ongoing, we use it extensively.
>>>>
>>>> It also makes extensive use of wfcqueue "splice" operation
>>>> to move work batch in bulk amongst queue and worker
>>>> threads. It is a O(1) operation which does not need to
>>>> touch the content of the queue.
>>>>
>>>> RCU is used to lookup the sibling worker threads for
>>>> work stealing and to wake the siblings of a thread
>>>> busy handling work.
>>>>
>>>> It's at:
>>>>
>>>> git://git.urcu.so/urcu.git
>>>> branch: urcu/workqueue-wakeup
>>>>
>>>> or gitweb of the top commit:
>>>> http://git.lttng.org/?p=userspace-rcu.git;a=commit;h=2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74
>>>>
>>>> Note that this is work in progress, compile-tested only.
>>>> I also need to write extensive tests.
>>>>
>>>> Feedback is welcome,
>>>>
>>>> Thanks!
>>>>
>>>> Mathieu
>>>>
>>>
>>>
>>
> 
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> .
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]       ` <54488392.2030008@cn.fujitsu.com>
  2014-10-23  9:28         ` Lai Jiangshan
@ 2014-10-23 13:06         ` Mathieu Desnoyers
       [not found]         ` <5448CA38.2010801@cn.fujitsu.com>
       [not found]         ` <1961985429.339.1414069564156.JavaMail.zimbra@efficios.com>
  3 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-23 13:06 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Paul E. McKenney, lttng-dev, Ben Maurer, Yannick Brosseau

----- Original Message -----
> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger" <stephen@networkplumber.org>
> Sent: Thursday, October 23, 2014 12:26:58 AM
> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> Hi, Mathieu

Hi Lai,

> 
> In the code, how this problem be solved?
> "All the tasks may continue stealing works without any progress."

Good point. It is not. I see two possible solutions for this:

1) We break the circular chain of siblings at the list head.
   It has the downside that work could accumulate at the first
   worker of the list, or
2) Whenever we grab work (either from the global queue or from
   a sibling), we could dequeue one element and keep it to ourself,
   out of reach of siblings. This would ensure progress, since
   worker stealing work would always keep one work element to itself,
   which would therefore ensure progress.

I personally prefer (2), since it has fewer downsides.

> 
> Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?

This lock is to protect operations on the sibling workqueue.
Here are the various actors touching it:

- The worker attempting to steal from the sibling (with splice-from),
- The sibling itself, in urcu_dequeue_work(), with a dequeue
  operation,

Looking at the wfcq synchronization table:

 * Synchronization table:
 *
 * External synchronization techniques described in the API below is
 * required between pairs marked with "X". No external synchronization
 * required between pairs marked with "-".
 *
 * Legend:
 * [1] cds_wfcq_enqueue
 * [2] __cds_wfcq_splice (destination queue)
 * [3] __cds_wfcq_dequeue
 * [4] __cds_wfcq_splice (source queue)
 * [5] __cds_wfcq_first
 * [6] __cds_wfcq_next
 *
 *     [1] [2] [3] [4] [5] [6]
 * [1]  -   -   -   -   -   -
 * [2]  -   -   -   -   -   -
 * [3]  -   -   X   X   X   X
 * [4]  -   -   X   -   X   X
 * [5]  -   -   X   X   -   -
 * [6]  -   -   X   X   -   -

We need a mutex between splice-from and dequeue since they
are executed by different threads. I can replace this with
a call to cds_wfcq_splice_blocking() instead, which takes the
lock internally. That would be neater.

> 
> It is not a good idea for a worker to call synchronize_rcu() (in
> urcu_accept_work()).
> A new work may be coming soon.

The reason why the worker calls synchronize_rcu() there is to protect
the waitqueue lfstack "pop" operation from ABA. It matches the RCU
read-side critical sections encompassing urcu_dequeue_wake_single().

Another way to achieve this would be to grab a mutex across calls to
urcu_dequeue_wake_single(), but unfortunately this call needs to be
done by the dispatcher thread, and I'm trying to keep a low-latency
bias in favor of the dispatcher here. In this case, it is done at the
expense of overhead in the worker when the worker is about to busy-wait
and eventually block, so I thought the tradeoff was not so bad.

Thoughts ?

Thanks for the feedback!

Mathieu

> 
> thanks,
> Lai
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]         ` <5448CA38.2010801@cn.fujitsu.com>
@ 2014-10-23 13:30           ` Mathieu Desnoyers
       [not found]           ` <984401415.431.1414071012829.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-23 13:30 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: lttng-dev, Paul E. McKenney, Ben Maurer, Yannick Brosseau

----- Original Message -----
> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Ben Maurer"
> <bmaurer@fb.com>, "Yannick Brosseau" <scientist@fb.com>
> Sent: Thursday, October 23, 2014 5:28:24 AM
> Subject: Re: [lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> Hi Mathieu:
> 
> what happen when the wait_node is still queued when urcu_worker_unregister()
> is called?

Hrm, I think this could indeed be an issue. I'm thinking to add a "shutdown"
mode to the workqueue (similarly to folly lifosem). Perhaps this could help
us handling this situation.

> 
> On 10/23/2014 12:26 PM, Lai Jiangshan wrote:
[...]
> > 
> > Hi, Mathieu
> > 
> > In the code, how this problem be solved?
> > "All the tasks may continue stealing works without any progress."
> 
> Can __urcu_wakeup_siblings() be moved to the end of the urcu_dequeue_work()?

It might help, but I don't think it would solve the theoretical
lack-of-progress issue. If we are in a state where all workers threads
are running, none need to be awakened to steal work, and they could fall
into an endless cycle of work-stealing followed by facing an empty work
queue due to another concurrent stealing. I think getting threads to keep
one work item for themselves when they grab work from the global queue
or steal work offers better forward progress guarantees in this case.

> 
> > 
> > Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?
> Sorry, it does need, it locks the source queue...
> but we have cds_wfcq_splice_blocking() which wrap the locking and the
> splicing.

Yes,

> 
> > 
> > It is not a good idea for a worker to call synchronize_rcu() (in
> > urcu_accept_work()).
> > A new work may be coming soon.
> 
> 
> I suggest to add a new API:
> __cds_lfs_pop_if(stack, expect) {
> 	atomic {
> 		if the top == expect {
> 			pop it;
> 		}
> 	}
> }
> 
> Then all the worker can only be woken up by urcu_adaptative_wake_up(),
> rather than urcu_dequeue_wake_single().
> 
> workers dequeues itself when successfully urcu_accept_work() by using
> __cds_lfs_pop_if().

There is something I don't like about this approach: what if, for some
reason, a worker being awakened takes a lot of time to start running.
It would therefore stall the entire waitqueue until it can be executed
and pop itself from the waitqueue.

Thoughts ?

Thanks!

Mathieu

> 
> Thanks,
> Lai
> 
> > 
> > thanks,
> > Lai
> > 
> > On 10/22/2014 07:20 PM, Mathieu Desnoyers wrote:
> >> Hi Lai,
> >>
> >> I pushed a few updates within the past days, including one
> >> fixing this issue. The updated branch is available at:
> >>
> >> git://git.urcu.so/urcu.git
> >> branch: urcu/workqueue-wakeup
> >>
> >> gitweb:
> >> http://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/workqueue-wakeup
> >>
> >> Please let me know if you notice anything else,
> >>
> >> Thanks!
> >>
> >> Mathieu
> >>
> >> ----- Original Message -----
> >>> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> >>> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> >>> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev"
> >>> <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
> >>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger"
> >>> <stephen@networkplumber.org>
> >>> Sent: Wednesday, October 22, 2014 6:57:18 AM
> >>> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and
> >>> work stealing
> >>>
> >>> Hi
> >>>
> >>> I just glance the code quickly...
> >>>
> >>> __urcu_steal_work() and __urcu_wakeup_siblings don't test
> >>> if they visit the queue->sibling_head.
> >>>
> >>> All the tasks may continue stealing works without any progress.
> >>>
> >>> Thanks,
> >>> Lai
> >>>
> >>> On 10/19/2014 05:13 AM, Mathieu Desnoyers wrote:
> >>>> Hi,
> >>>>
> >>>> I have written an implementation of a workqueue based
> >>>> on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
> >>>> including a work-stealing scheme. I think this could
> >>>> scale very naturally to large workloads, and it skips
> >>>> the overhead of sys_futex when wakees are already running.
> >>>> Since the wakeup scheme is very lightweight when work
> >>>> is ongoing, we use it extensively.
> >>>>
> >>>> It also makes extensive use of wfcqueue "splice" operation
> >>>> to move work batch in bulk amongst queue and worker
> >>>> threads. It is a O(1) operation which does not need to
> >>>> touch the content of the queue.
> >>>>
> >>>> RCU is used to lookup the sibling worker threads for
> >>>> work stealing and to wake the siblings of a thread
> >>>> busy handling work.
> >>>>
> >>>> It's at:
> >>>>
> >>>> git://git.urcu.so/urcu.git
> >>>> branch: urcu/workqueue-wakeup
> >>>>
> >>>> or gitweb of the top commit:
> >>>> http://git.lttng.org/?p=userspace-rcu.git;a=commit;h=2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74
> >>>>
> >>>> Note that this is work in progress, compile-tested only.
> >>>> I also need to write extensive tests.
> >>>>
> >>>> Feedback is welcome,
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Mathieu
> >>>>
> >>>
> >>>
> >>
> > 
> > 
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev@lists.lttng.org
> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > .
> > 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]           ` <984401415.431.1414071012829.JavaMail.zimbra@efficios.com>
@ 2014-10-23 16:23             ` Mathieu Desnoyers
       [not found]             ` <1736995491.608.1414081381945.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-23 16:23 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: lttng-dev, Paul E. McKenney, Ben Maurer, Yannick Brosseau

----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> To: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> Cc: "lttng-dev" <lttng-dev@lists.lttng.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Ben Maurer"
> <bmaurer@fb.com>, "Yannick Brosseau" <scientist@fb.com>
> Sent: Thursday, October 23, 2014 9:30:12 AM
> Subject: Re: [lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> ----- Original Message -----
> > From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "lttng-dev"
> > <lttng-dev@lists.lttng.org>, "Ben Maurer"
> > <bmaurer@fb.com>, "Yannick Brosseau" <scientist@fb.com>
> > Sent: Thursday, October 23, 2014 5:28:24 AM
> > Subject: Re: [lttng-dev] Userspace RCU: workqueue with batching, cheap
> > wakeup, and work stealing
> > 
> > Hi Mathieu:
> > 
> > what happen when the wait_node is still queued when
> > urcu_worker_unregister()
> > is called?
> 
> Hrm, I think this could indeed be an issue. I'm thinking to add a "shutdown"
> mode to the workqueue (similarly to folly lifosem). Perhaps this could help
> us handling this situation.
> 
> > 
> > On 10/23/2014 12:26 PM, Lai Jiangshan wrote:
> [...]
> > > 
> > > Hi, Mathieu
> > > 
> > > In the code, how this problem be solved?
> > > "All the tasks may continue stealing works without any progress."
> > 
> > Can __urcu_wakeup_siblings() be moved to the end of the
> > urcu_dequeue_work()?
> 
> It might help, but I don't think it would solve the theoretical
> lack-of-progress issue. If we are in a state where all workers threads
> are running, none need to be awakened to steal work, and they could fall
> into an endless cycle of work-stealing followed by facing an empty work
> queue due to another concurrent stealing. I think getting threads to keep
> one work item for themselves when they grab work from the global queue
> or steal work offers better forward progress guarantees in this case.
> 
> > 
> > > 
> > > Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?
> > Sorry, it does need, it locks the source queue...
> > but we have cds_wfcq_splice_blocking() which wrap the locking and the
> > splicing.
> 
> Yes,
> 
> > 
> > > 
> > > It is not a good idea for a worker to call synchronize_rcu() (in
> > > urcu_accept_work()).
> > > A new work may be coming soon.
> > 
> > 
> > I suggest to add a new API:
> > __cds_lfs_pop_if(stack, expect) {
> > 	atomic {
> > 		if the top == expect {
> > 			pop it;
> > 		}
> > 	}
> > }
> > 
> > Then all the worker can only be woken up by urcu_adaptative_wake_up(),
> > rather than urcu_dequeue_wake_single().
> > 
> > workers dequeues itself when successfully urcu_accept_work() by using
> > __cds_lfs_pop_if().
> 
> There is something I don't like about this approach: what if, for some
> reason, a worker being awakened takes a lot of time to start running.
> It would therefore stall the entire waitqueue until it can be executed
> and pop itself from the waitqueue.

All the above comments should be taken care of by this string of commits:

commit a649215929e3af5a0dd26d9fc5bb9ed49e677773
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Oct 23 12:19:31 2014 -0400

    workqueue: keep one work item from stealing co-workers
    
    Takes care of lack of progress issue that could theoretically occur due
    to a steady work-stealing circular loop amongst co-workers. Keeping a
    work item hidden from co-workers ensures forward progress.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

commit 6e17009c7e4276b3a90aada07e7c9835e9a788be
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Oct 23 11:27:58 2014 -0400

    workqueue: ensure worker is removed from waitqueue upon unregister
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

commit 1b0a9891cd2f9100dc87745d77a4d0069a21adb7
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Oct 23 11:13:24 2014 -0400

    workqueue: implement shutdown
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

commit 0695bd205e8aa88b3b032f483b7dc98c16450784
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Oct 23 09:01:29 2014 -0400

    workqueue: use cds_wfcq_splice_blocking() rather than explicit locks
    
    cds_wfcq_splice_blocking() internally takes the locks.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Please let me know if I missed anything,

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]             ` <1736995491.608.1414081381945.JavaMail.zimbra@efficios.com>
@ 2014-10-23 21:57               ` Mathieu Desnoyers
       [not found]               ` <1650317292.810.1414101428789.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-23 21:57 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: lttng-dev, Paul E. McKenney, Ben Maurer, Yannick Brosseau

The next thing I'm wondering now: should we include an
optional bound to the global workqueue size in the API ?

I've just had cases here where I stress test the queue
with very frequent dispatch, and it can fill up memory
relatively quickly if the workers have a large amount of
work to do per work-item.

I think the usual way to do this would be to make the
behavior nonblocking when the queue is full, so the
dispatcher can take action and move the work away to
another machine, or report congestion.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]               ` <1650317292.810.1414101428789.JavaMail.zimbra@efficios.com>
@ 2014-10-23 22:09                 ` Ben Maurer
       [not found]                 ` <5CDDBDF2D36D9F43B9F5E99003F6A0D4861C3F96@PRN-MBX02-1.TheFacebook.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Ben Maurer @ 2014-10-23 22:09 UTC (permalink / raw)
  To: Mathieu Desnoyers, Lai Jiangshan
  Cc: Yannick Brosseau, lttng-dev, Paul E. McKenney

Bounds are pretty critical :-), often during operational incidents we will get large buildups in our queues and these cause problems.

For us, one of the most critical things isn't the memory usage but the delay caused to the client. For example, if a server has a queue that incoming requests are put into if that queue grows large clients experience large delays. Since most calls to the server have a short timeout (seconds), we'd rather prevent items from entering the queue so that we fail fast.

Some of our applications switch to LIFO processing of work items when the queue is large. What this does is to focus the processing effort on recent requests -- ones which will hopefully get back to the user in time for them to see a response.

Long story short: when a queue is overloaded, we'd rather drop some requests quickly and serve the other requests with minimal queuing delay. Think of queues as bufferbloat applied to work items. In fact, we have experimented with some of the bufferbloat techniques on our work queues (specifically, CoDEL)

-b
________________________________________
From: Mathieu Desnoyers [mathieu.desnoyers@efficios.com]
Sent: Thursday, October 23, 2014 2:57 PM
To: Lai Jiangshan
Cc: lttng-dev; Paul E. McKenney; Ben Maurer; Yannick Brosseau
Subject: Re: [lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing

The next thing I'm wondering now: should we include an
optional bound to the global workqueue size in the API ?

I've just had cases here where I stress test the queue
with very frequent dispatch, and it can fill up memory
relatively quickly if the workers have a large amount of
work to do per work-item.

I think the usual way to do this would be to make the
behavior nonblocking when the queue is full, so the
dispatcher can take action and move the work away to
another machine, or report congestion.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]                 ` <5CDDBDF2D36D9F43B9F5E99003F6A0D4861C3F96@PRN-MBX02-1.TheFacebook.com>
@ 2014-10-23 22:48                   ` Mathieu Desnoyers
       [not found]                   ` <710327797.877.1414104532195.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-23 22:48 UTC (permalink / raw)
  To: Ben Maurer; +Cc: Yannick Brosseau, lttng-dev, Paul E. McKenney

Interesting point about bufferbloat!

I've just pushed the "approximate queue upper bound" feature
in the last commit.

Going further, when the queue is full, there are indeed a few
options:

1) sleep for a few ms and retry enqueue,
2) grab the entire content of the global workqueue, and discard
   its work elements one by one,
3) in addition to (2), also steal work from all worker
   threads, and discard their work elements.

Making the dispatcher act as a dummy "worker thread" would
allow it to easily accomplish (2). We'd need some tweaks
to "steal all worker's work elements" (3) (new API). This
could be presented as a "urcu_queue_steal_all" or something
like that, and then the dispatcher could iterate on the
work items and either discard them, or perform the appropriate
socket action.

Thoughts ?

Thanks,

Mathieu


----- Original Message -----
> From: "Ben Maurer" <bmaurer@fb.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, "Lai Jiangshan" <laijs@cn.fujitsu.com>
> Cc: "lttng-dev" <lttng-dev@lists.lttng.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Yannick Brosseau"
> <yannick.brosseau@fb.com>
> Sent: Thursday, October 23, 2014 6:09:11 PM
> Subject: RE: [lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> Bounds are pretty critical :-), often during operational incidents we will
> get large buildups in our queues and these cause problems.
> 
> For us, one of the most critical things isn't the memory usage but the delay
> caused to the client. For example, if a server has a queue that incoming
> requests are put into if that queue grows large clients experience large
> delays. Since most calls to the server have a short timeout (seconds), we'd
> rather prevent items from entering the queue so that we fail fast.
> 
> Some of our applications switch to LIFO processing of work items when the
> queue is large. What this does is to focus the processing effort on recent
> requests -- ones which will hopefully get back to the user in time for them
> to see a response.
> 
> Long story short: when a queue is overloaded, we'd rather drop some requests
> quickly and serve the other requests with minimal queuing delay. Think of
> queues as bufferbloat applied to work items. In fact, we have experimented
> with some of the bufferbloat techniques on our work queues (specifically,
> CoDEL)
> 
> -b
> ________________________________________
> From: Mathieu Desnoyers [mathieu.desnoyers@efficios.com]
> Sent: Thursday, October 23, 2014 2:57 PM
> To: Lai Jiangshan
> Cc: lttng-dev; Paul E. McKenney; Ben Maurer; Yannick Brosseau
> Subject: Re: [lttng-dev] Userspace RCU: workqueue with batching, cheap
> wakeup, and work stealing
> 
> The next thing I'm wondering now: should we include an
> optional bound to the global workqueue size in the API ?
> 
> I've just had cases here where I stress test the queue
> with very frequent dispatch, and it can fill up memory
> relatively quickly if the workers have a large amount of
> work to do per work-item.
> 
> I think the usual way to do this would be to make the
> behavior nonblocking when the queue is full, so the
> dispatcher can take action and move the work away to
> another machine, or report congestion.
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]                   ` <710327797.877.1414104532195.JavaMail.zimbra@efficios.com>
@ 2014-10-23 23:24                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-23 23:24 UTC (permalink / raw)
  To: Ben Maurer; +Cc: Yannick Brosseau, lttng-dev, Paul E. McKenney

I just implemented a urcu_workqueue_steal_all(), and the
benchmark test program now uses this when there is congestion
in the queue.

Thanks,

Mathieu

----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> To: "Ben Maurer" <bmaurer@fb.com>
> Cc: "Yannick Brosseau" <yannick.brosseau@fb.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Paul E. McKenney"
> <paulmck@linux.vnet.ibm.com>
> Sent: Thursday, October 23, 2014 6:48:52 PM
> Subject: Re: [lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> Interesting point about bufferbloat!
> 
> I've just pushed the "approximate queue upper bound" feature
> in the last commit.
> 
> Going further, when the queue is full, there are indeed a few
> options:
> 
> 1) sleep for a few ms and retry enqueue,
> 2) grab the entire content of the global workqueue, and discard
>    its work elements one by one,
> 3) in addition to (2), also steal work from all worker
>    threads, and discard their work elements.
> 
> Making the dispatcher act as a dummy "worker thread" would
> allow it to easily accomplish (2). We'd need some tweaks
> to "steal all worker's work elements" (3) (new API). This
> could be presented as a "urcu_queue_steal_all" or something
> like that, and then the dispatcher could iterate on the
> work items and either discard them, or perform the appropriate
> socket action.
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> 
> ----- Original Message -----
> > From: "Ben Maurer" <bmaurer@fb.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, "Lai Jiangshan"
> > <laijs@cn.fujitsu.com>
> > Cc: "lttng-dev" <lttng-dev@lists.lttng.org>, "Paul E. McKenney"
> > <paulmck@linux.vnet.ibm.com>, "Yannick Brosseau"
> > <yannick.brosseau@fb.com>
> > Sent: Thursday, October 23, 2014 6:09:11 PM
> > Subject: RE: [lttng-dev] Userspace RCU: workqueue with batching, cheap
> > wakeup, and work stealing
> > 
> > Bounds are pretty critical :-), often during operational incidents we will
> > get large buildups in our queues and these cause problems.
> > 
> > For us, one of the most critical things isn't the memory usage but the
> > delay
> > caused to the client. For example, if a server has a queue that incoming
> > requests are put into if that queue grows large clients experience large
> > delays. Since most calls to the server have a short timeout (seconds), we'd
> > rather prevent items from entering the queue so that we fail fast.
> > 
> > Some of our applications switch to LIFO processing of work items when the
> > queue is large. What this does is to focus the processing effort on recent
> > requests -- ones which will hopefully get back to the user in time for them
> > to see a response.
> > 
> > Long story short: when a queue is overloaded, we'd rather drop some
> > requests
> > quickly and serve the other requests with minimal queuing delay. Think of
> > queues as bufferbloat applied to work items. In fact, we have experimented
> > with some of the bufferbloat techniques on our work queues (specifically,
> > CoDEL)
> > 
> > -b
> > ________________________________________
> > From: Mathieu Desnoyers [mathieu.desnoyers@efficios.com]
> > Sent: Thursday, October 23, 2014 2:57 PM
> > To: Lai Jiangshan
> > Cc: lttng-dev; Paul E. McKenney; Ben Maurer; Yannick Brosseau
> > Subject: Re: [lttng-dev] Userspace RCU: workqueue with batching, cheap
> > wakeup, and work stealing
> > 
> > The next thing I'm wondering now: should we include an
> > optional bound to the global workqueue size in the API ?
> > 
> > I've just had cases here where I stress test the queue
> > with very frequent dispatch, and it can fill up memory
> > relatively quickly if the workers have a large amount of
> > work to do per work-item.
> > 
> > I think the usual way to do this would be to make the
> > behavior nonblocking when the queue is full, so the
> > dispatcher can take action and move the work away to
> > another machine, or report congestion.
> > 
> > Thoughts ?
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]         ` <1961985429.339.1414069564156.JavaMail.zimbra@efficios.com>
@ 2014-10-24  1:45           ` Lai Jiangshan
       [not found]           ` <5449AF57.2040507@cn.fujitsu.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Lai Jiangshan @ 2014-10-24  1:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, lttng-dev, Ben Maurer, Yannick Brosseau

On 10/23/2014 09:06 PM, Mathieu Desnoyers wrote:
> ----- Original Message -----
>> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
>> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
>> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger" <stephen@networkplumber.org>
>> Sent: Thursday, October 23, 2014 12:26:58 AM
>> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
>>
>> Hi, Mathieu
> 
> Hi Lai,
> 
>>
>> In the code, how this problem be solved?
>> "All the tasks may continue stealing works without any progress."
> 
> Good point. It is not. I see two possible solutions for this:
> 
> 1) We break the circular chain of siblings at the list head.
>    It has the downside that work could accumulate at the first
>    worker of the list, or
> 2) Whenever we grab work (either from the global queue or from
>    a sibling), we could dequeue one element and keep it to ourself,
>    out of reach of siblings. This would ensure progress, since
>    worker stealing work would always keep one work element to itself,
>    which would therefore ensure progress.
> 
> I personally prefer (2), since it has fewer downsides.
> 
>>
>> Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?
> 
> This lock is to protect operations on the sibling workqueue.
> Here are the various actors touching it:
> 
> - The worker attempting to steal from the sibling (with splice-from),
> - The sibling itself, in urcu_dequeue_work(), with a dequeue
>   operation,
> 
> Looking at the wfcq synchronization table:
> 
>  * Synchronization table:
>  *
>  * External synchronization techniques described in the API below is
>  * required between pairs marked with "X". No external synchronization
>  * required between pairs marked with "-".
>  *
>  * Legend:
>  * [1] cds_wfcq_enqueue
>  * [2] __cds_wfcq_splice (destination queue)
>  * [3] __cds_wfcq_dequeue
>  * [4] __cds_wfcq_splice (source queue)
>  * [5] __cds_wfcq_first
>  * [6] __cds_wfcq_next
>  *
>  *     [1] [2] [3] [4] [5] [6]
>  * [1]  -   -   -   -   -   -
>  * [2]  -   -   -   -   -   -
>  * [3]  -   -   X   X   X   X
>  * [4]  -   -   X   -   X   X
>  * [5]  -   -   X   X   -   -
>  * [6]  -   -   X   X   -   -
> 
> We need a mutex between splice-from and dequeue since they
> are executed by different threads. I can replace this with
> a call to cds_wfcq_splice_blocking() instead, which takes the
> lock internally. That would be neater.
> 
>>
>> It is not a good idea for a worker to call synchronize_rcu() (in
>> urcu_accept_work()).
>> A new work may be coming soon.
> 
> The reason why the worker calls synchronize_rcu() there is to protect
> the waitqueue lfstack "pop" operation from ABA. It matches the RCU
> read-side critical sections encompassing urcu_dequeue_wake_single().
> 
> Another way to achieve this would be to grab a mutex across calls to
> urcu_dequeue_wake_single(), but unfortunately this call needs to be
> done by the dispatcher thread, and I'm trying to keep a low-latency
> bias in favor of the dispatcher here. In this case, it is done at the
> expense of overhead in the worker when the worker is about to busy-wait
> and eventually block, so I thought the tradeoff was not so bad.

"single worker pattern" does be a frequent use-case.
you can image the lag and jitter in this case.


How about a mutex&rcu protected idle-workers?

idle-worker is wokenup via rcu-protected scheme.
and dequeued&requeued (in urcu_accept_work() or urcu_worker_unregister()) via mutex-protected scheme.

> 
> Thoughts ?
> 
> Thanks for the feedback!
> 
> Mathieu
> 
>>
>> thanks,
>> Lai
>>
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
       [not found]           ` <5449AF57.2040507@cn.fujitsu.com>
@ 2014-10-24 13:22             ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-10-24 13:22 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Paul E. McKenney, lttng-dev, Ben Maurer, Yannick Brosseau

----- Original Message -----
> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev" <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger" <stephen@networkplumber.org>
> Sent: Thursday, October 23, 2014 9:45:59 PM
> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> On 10/23/2014 09:06 PM, Mathieu Desnoyers wrote:
> > ----- Original Message -----
> >> From: "Lai Jiangshan" <laijs@cn.fujitsu.com>
> >> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> >> Cc: "Ben Maurer" <bmaurer@fb.com>, "lttng-dev"
> >> <lttng-dev@lists.lttng.org>, "Yannick Brosseau" <scientist@fb.com>,
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Stephen Hemminger"
> >> <stephen@networkplumber.org>
> >> Sent: Thursday, October 23, 2014 12:26:58 AM
> >> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and
> >> work stealing
> > 
[...]
> >>
> >> It is not a good idea for a worker to call synchronize_rcu() (in
> >> urcu_accept_work()).
> >> A new work may be coming soon.
> > 
> > The reason why the worker calls synchronize_rcu() there is to protect
> > the waitqueue lfstack "pop" operation from ABA. It matches the RCU
> > read-side critical sections encompassing urcu_dequeue_wake_single().
> > 
> > Another way to achieve this would be to grab a mutex across calls to
> > urcu_dequeue_wake_single(), but unfortunately this call needs to be
> > done by the dispatcher thread, and I'm trying to keep a low-latency
> > bias in favor of the dispatcher here. In this case, it is done at the
> > expense of overhead in the worker when the worker is about to busy-wait
> > and eventually block, so I thought the tradeoff was not so bad.
> 
> "single worker pattern" does be a frequent use-case.
> you can image the lag and jitter in this case.

I'd be happy to find a way to reduce unnecessary jitter, indeed.
However, we need to keep in mind that this is issued only when
the worker thread has no more work to do. And if there are few
registered RCU threads, synchronize_rcu() should be pretty quick,
especially now that multiple grace periods can run concurrently.

> 
> 
> How about a mutex&rcu protected idle-workers?
> 
> idle-worker is wokenup via rcu-protected scheme.
> and dequeued&requeued (in urcu_accept_work() or urcu_worker_unregister()) via
> mutex-protected scheme.

I'm not sure I understand your proposal. Can you give more
detail about the RCU protected scheme you refer to here,
along with how it is intended to interact with mutex ?

Note that if we use RCU read-side around lfstack pop, it
needs to be matched by a grace period between all other
"pop"/"pop all" and re-push to deal with ABA.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-10-24 13:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <836028170.11148.1413630118955.JavaMail.zimbra@efficios.com>
2014-10-18 21:13 ` Userspace RCU: workqueue with batching, cheap wakeup, and work stealing Mathieu Desnoyers
     [not found] ` <1521253780.11371.1413666790992.JavaMail.zimbra@efficios.com>
2014-10-19 19:15   ` Ben Maurer
2014-10-22 10:57   ` Lai Jiangshan
     [not found]   ` <54478D8E.1080608@cn.fujitsu.com>
2014-10-22 11:20     ` Mathieu Desnoyers
     [not found]     ` <1569300159.298.1413976834389.JavaMail.zimbra@efficios.com>
2014-10-23  4:26       ` Lai Jiangshan
     [not found]       ` <54488392.2030008@cn.fujitsu.com>
2014-10-23  9:28         ` Lai Jiangshan
2014-10-23 13:06         ` Mathieu Desnoyers
     [not found]         ` <5448CA38.2010801@cn.fujitsu.com>
2014-10-23 13:30           ` Mathieu Desnoyers
     [not found]           ` <984401415.431.1414071012829.JavaMail.zimbra@efficios.com>
2014-10-23 16:23             ` Mathieu Desnoyers
     [not found]             ` <1736995491.608.1414081381945.JavaMail.zimbra@efficios.com>
2014-10-23 21:57               ` Mathieu Desnoyers
     [not found]               ` <1650317292.810.1414101428789.JavaMail.zimbra@efficios.com>
2014-10-23 22:09                 ` Ben Maurer
     [not found]                 ` <5CDDBDF2D36D9F43B9F5E99003F6A0D4861C3F96@PRN-MBX02-1.TheFacebook.com>
2014-10-23 22:48                   ` Mathieu Desnoyers
     [not found]                   ` <710327797.877.1414104532195.JavaMail.zimbra@efficios.com>
2014-10-23 23:24                     ` Mathieu Desnoyers
     [not found]         ` <1961985429.339.1414069564156.JavaMail.zimbra@efficios.com>
2014-10-24  1:45           ` Lai Jiangshan
     [not found]           ` <5449AF57.2040507@cn.fujitsu.com>
2014-10-24 13:22             ` Mathieu Desnoyers
     [not found]   ` <5CDDBDF2D36D9F43B9F5E99003F6A0D4861BCD3E@PRN-MBX02-1.TheFacebook.com>
2014-10-22 12:44     ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.