All of lore.kernel.org
 help / color / mirror / Atom feed
* call_rcu seems inefficient without futex
       [not found] <157982514329.691.6168767011604689030.ref@pink>
@ 2020-01-24  0:19 ` Alex Xu via lttng-dev
  2020-01-27 15:38   ` Mathieu Desnoyers
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Xu via lttng-dev @ 2020-01-24  0:19 UTC (permalink / raw)
  To: lttng-dev

Hi,

I recently installed knot dns for a very small FreeBSD server. I noticed
that it uses a surprising amount of CPU, even when there is no load:
about 0.25%. That's not huge, but it seems unnecessarily high when my
QPS is less than 0.01.

After some profiling, I came to the conclusion that this is caused by
call_rcu_wait using futex_async to repeatedly wait. Since there is no
futex on FreeBSD (without the Linux compatibility layer), this
effectively turns into a permanent busy waiting loop.

I think futex_noasync can be used here instead. call_rcu_wait is only
supposed to be called from call_rcu_thread, never from a signal context.
call_rcu calls get_call_rcu_data, which may call
get_default_call_rcu_data, which calls pthread_mutex_lock through
call_rcu_lock. Therefore, call_rcu is not async-signal-safe already.
Also, I think it only makes sense to use call_rcu around a RCU write,
which contradicts the README saying that only RCU reads are allowed in
signal handlers.

I applied "sed -i -e 's/futex_async/futex_noasync/'
src/urcu-call-rcu-impl.h" and knot seems to work correctly with only
0.01% CPU now. I also ran tests/unit and tests/regression with default
and signal backends and all completed successfully.

I think that the other two usages of futex_async are also a little
suspicious, but I didn't look too closely.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: call_rcu seems inefficient without futex
  2020-01-24  0:19 ` call_rcu seems inefficient without futex Alex Xu via lttng-dev
@ 2020-01-27 15:38   ` Mathieu Desnoyers
  2020-01-27 18:25     ` Alex Xu via lttng-dev
  2020-01-28  3:45     ` Paul E. McKenney
  0 siblings, 2 replies; 5+ messages in thread
From: Mathieu Desnoyers @ 2020-01-27 15:38 UTC (permalink / raw)
  To: Alex Xu, paulmck; +Cc: lttng-dev

----- On Jan 23, 2020, at 7:19 PM, lttng-dev lttng-dev@lists.lttng.org wrote:

> Hi,
> 
> I recently installed knot dns for a very small FreeBSD server. I noticed
> that it uses a surprising amount of CPU, even when there is no load:
> about 0.25%. That's not huge, but it seems unnecessarily high when my
> QPS is less than 0.01.
> 
> After some profiling, I came to the conclusion that this is caused by
> call_rcu_wait using futex_async to repeatedly wait. Since there is no
> futex on FreeBSD (without the Linux compatibility layer), this
> effectively turns into a permanent busy waiting loop.
> 
> I think futex_noasync can be used here instead. call_rcu_wait is only
> supposed to be called from call_rcu_thread, never from a signal context.
> call_rcu calls get_call_rcu_data, which may call
> get_default_call_rcu_data, which calls pthread_mutex_lock through
> call_rcu_lock. Therefore, call_rcu is not async-signal-safe already.

call_rcu() is meant to be async-signal-safe and lock-free after that
initialization has been performed on first use. Paul, do you know where
we have documented this in liburcu ?

> Also, I think it only makes sense to use call_rcu around a RCU write,
> which contradicts the README saying that only RCU reads are allowed in
> signal handlers.

Not sure what you mean by "use call_rcu around a RCU write" ?

Is there anything similar to sys_futex on FreeBSD ?

It would be good to look into alternative ways to fix this that do not
involve changing the guarantees provided by call_rcu() for that fallback
scenario (no futex available). Perhaps in your use-case you may want to
tweak the retry delay for compat_futex_async(). Currently
src/compat_futex.c:compat_futex_async() has a 10ms delay. Would 100ms
be more acceptable ?

Thanks,

Mathieu

> 
> I applied "sed -i -e 's/futex_async/futex_noasync/'
> src/urcu-call-rcu-impl.h" and knot seems to work correctly with only
> 0.01% CPU now. I also ran tests/unit and tests/regression with default
> and signal backends and all completed successfully.
> 
> I think that the other two usages of futex_async are also a little
> suspicious, but I didn't look too closely.
> 
> Thanks,
> Alex.
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: call_rcu seems inefficient without futex
  2020-01-27 15:38   ` Mathieu Desnoyers
@ 2020-01-27 18:25     ` Alex Xu via lttng-dev
  2020-01-28  3:45     ` Paul E. McKenney
  1 sibling, 0 replies; 5+ messages in thread
From: Alex Xu via lttng-dev @ 2020-01-27 18:25 UTC (permalink / raw)
  To: lttng-dev

Quoting Mathieu Desnoyers (2020-01-27 15:38:05)
> ----- On Jan 23, 2020, at 7:19 PM, lttng-dev lttng-dev@lists.lttng.org wrote:
> call_rcu() is meant to be async-signal-safe and lock-free after that
> initialization has been performed on first use. Paul, do you know where
> we have documented this in liburcu ?

Hm... reading it a little more carefully, it does seem that as long as
you manually initialize it, then it is async-signal-safe afterwards.

> > Also, I think it only makes sense to use call_rcu around a RCU write,
> > which contradicts the README saying that only RCU reads are allowed in
> > signal handlers.
> 
> Not sure what you mean by "use call_rcu around a RCU write" ?

I mean that in general, the pattern is usually to do an RCU write (to
remove an item from a list, for example), then do call_rcu to
aynchronously clean up the item.

> Is there anything similar to sys_futex on FreeBSD ?

Doing some more research, it seems that _umtx_op is allowed to be used
by userspace applications, see
https://patchwork.freedesktop.org/patch/200456/.

So we can probably just use that. I'll send a patch shortly.

> It would be good to look into alternative ways to fix this that do not
> involve changing the guarantees provided by call_rcu() for that fallback
> scenario (no futex available). Perhaps in your use-case you may want to
> tweak the retry delay for compat_futex_async(). Currently
> src/compat_futex.c:compat_futex_async() has a 10ms delay. Would 100ms
> be more acceptable ?

I don't completely understand what poll does here. Does it just mean
that call_rcu callbacks would be delayed up to 100 ms? That's probably
OK for cleanup uses, but I guess other uses may need faster reactions.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: call_rcu seems inefficient without futex
  2020-01-27 15:38   ` Mathieu Desnoyers
  2020-01-27 18:25     ` Alex Xu via lttng-dev
@ 2020-01-28  3:45     ` Paul E. McKenney
  2020-01-28 14:59       ` Mathieu Desnoyers
  1 sibling, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2020-01-28  3:45 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev

On Mon, Jan 27, 2020 at 10:38:05AM -0500, Mathieu Desnoyers wrote:
> ----- On Jan 23, 2020, at 7:19 PM, lttng-dev lttng-dev@lists.lttng.org wrote:
> 
> > Hi,
> > 
> > I recently installed knot dns for a very small FreeBSD server. I noticed
> > that it uses a surprising amount of CPU, even when there is no load:
> > about 0.25%. That's not huge, but it seems unnecessarily high when my
> > QPS is less than 0.01.
> > 
> > After some profiling, I came to the conclusion that this is caused by
> > call_rcu_wait using futex_async to repeatedly wait. Since there is no
> > futex on FreeBSD (without the Linux compatibility layer), this
> > effectively turns into a permanent busy waiting loop.
> > 
> > I think futex_noasync can be used here instead. call_rcu_wait is only
> > supposed to be called from call_rcu_thread, never from a signal context.
> > call_rcu calls get_call_rcu_data, which may call
> > get_default_call_rcu_data, which calls pthread_mutex_lock through
> > call_rcu_lock. Therefore, call_rcu is not async-signal-safe already.
> 
> call_rcu() is meant to be async-signal-safe and lock-free after that
> initialization has been performed on first use. Paul, do you know where
> we have documented this in liburcu ?

Lock freedom is the goal, but when not in real-time mode, call_rcu()
does invoke futex_async(), which can acquire locks within the Linux
kernel.

Should BSD instead use POSIX condvars for the call_rcu() waits and
wakeups?

> > Also, I think it only makes sense to use call_rcu around a RCU write,
> > which contradicts the README saying that only RCU reads are allowed in
> > signal handlers.

I do not believe that it is always safe to invoke call_rcu() from within
a signal handler.  If you made sure to invoke it outside a signal handler
the first time, and then used real-time mode, that should work.  But in
that case, you aren't invoking the futex code.

> Not sure what you mean by "use call_rcu around a RCU write" ?

I confess to some curiosity on this point as well.  Maybe what is meant
is "around a RCU write" as in "near to an RCU write" as in "in place of
using synchronize_rcu()"?

> Is there anything similar to sys_futex on FreeBSD ?
> 
> It would be good to look into alternative ways to fix this that do not
> involve changing the guarantees provided by call_rcu() for that fallback
> scenario (no futex available). Perhaps in your use-case you may want to
> tweak the retry delay for compat_futex_async(). Currently
> src/compat_futex.c:compat_futex_async() has a 10ms delay. Would 100ms
> be more acceptable ?

If this works for knot dns, it would of course be simpler.

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > I applied "sed -i -e 's/futex_async/futex_noasync/'
> > src/urcu-call-rcu-impl.h" and knot seems to work correctly with only
> > 0.01% CPU now. I also ran tests/unit and tests/regression with default
> > and signal backends and all completed successfully.
> > 
> > I think that the other two usages of futex_async are also a little
> > suspicious, but I didn't look too closely.
> > 
> > Thanks,
> > Alex.
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev@lists.lttng.org
> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: call_rcu seems inefficient without futex
  2020-01-28  3:45     ` Paul E. McKenney
@ 2020-01-28 14:59       ` Mathieu Desnoyers
  0 siblings, 0 replies; 5+ messages in thread
From: Mathieu Desnoyers @ 2020-01-28 14:59 UTC (permalink / raw)
  To: paulmck; +Cc: lttng-dev

----- On Jan 27, 2020, at 10:45 PM, paulmck paulmck@kernel.org wrote:

> On Mon, Jan 27, 2020 at 10:38:05AM -0500, Mathieu Desnoyers wrote:
>> ----- On Jan 23, 2020, at 7:19 PM, lttng-dev lttng-dev@lists.lttng.org wrote:
>> 
>> > Hi,
>> > 
>> > I recently installed knot dns for a very small FreeBSD server. I noticed
>> > that it uses a surprising amount of CPU, even when there is no load:
>> > about 0.25%. That's not huge, but it seems unnecessarily high when my
>> > QPS is less than 0.01.
>> > 
>> > After some profiling, I came to the conclusion that this is caused by
>> > call_rcu_wait using futex_async to repeatedly wait. Since there is no
>> > futex on FreeBSD (without the Linux compatibility layer), this
>> > effectively turns into a permanent busy waiting loop.
>> > 
>> > I think futex_noasync can be used here instead. call_rcu_wait is only
>> > supposed to be called from call_rcu_thread, never from a signal context.
>> > call_rcu calls get_call_rcu_data, which may call
>> > get_default_call_rcu_data, which calls pthread_mutex_lock through
>> > call_rcu_lock. Therefore, call_rcu is not async-signal-safe already.
>> 
>> call_rcu() is meant to be async-signal-safe and lock-free after that
>> initialization has been performed on first use. Paul, do you know where
>> we have documented this in liburcu ?
> 
> Lock freedom is the goal, but when not in real-time mode, call_rcu()
> does invoke futex_async(), which can acquire locks within the Linux
> kernel.
> 
> Should BSD instead use POSIX condvars for the call_rcu() waits and
> wakeups?

There are two distinct benefit to lock-freedom which I think are relevant
here (at least):

- As you stated, lock-freedom is useful for real-time algorithms because it
does not require careful handling of locks (priority inversion and so on),

- Moreover, another characteristic of lock-free algorithms which is useful
beyond the scope of real-time systems is its ability to fail gracefully.
Basically, if a lock-free algorithm crashes at any point, the rest of the
system can still go on. This is especially useful for data structures over
shared memory between processes.

This last point highlights why being lock-free in user-space vs being lock-free
over the entire system (including kernel system call implementation) do not
cover exactly the same requirements. For RT, indeed, the requirement is to
be lock-free on both sides of user/kernel boundary, because timings are what
matter. However, if lock-freedom is used as a mean to recover from failure
gracefully, it can be sufficient to achieve lock-freedom in the userspace
part of the algorithm, and then rely on non-lock-free algorithms within the
kernel, because failure within the kernel is an internal kernel failure
which affects the entire system anyways.

> 
>> > Also, I think it only makes sense to use call_rcu around a RCU write,
>> > which contradicts the README saying that only RCU reads are allowed in
>> > signal handlers.
> 
> I do not believe that it is always safe to invoke call_rcu() from within
> a signal handler.  If you made sure to invoke it outside a signal handler
> the first time, and then used real-time mode, that should work.  But in
> that case, you aren't invoking the futex code.

Other that the initialization, what prevents using non-rt call_rcu() from a
signal handler context ? AFAIU it should be safe to issue futex WAKEUP from
a signal handler context.

> 
>> Not sure what you mean by "use call_rcu around a RCU write" ?
> 
> I confess to some curiosity on this point as well.  Maybe what is meant
> is "around a RCU write" as in "near to an RCU write" as in "in place of
> using synchronize_rcu()"?

From Alex Xu's reply:

"I mean that in general, the pattern is usually to do an RCU write (to
remove an item from a list, for example), then do call_rcu to
aynchronously clean up the item."

> 
>> Is there anything similar to sys_futex on FreeBSD ?

Alex Xu provided a patch set in a separate thread implementing "umtx"
support to basically provide OS support for futex on FreeBSD and
DragonflyBSD.

https://lists.lttng.org/pipermail/lttng-dev/2020-January/029507.html
https://lists.lttng.org/pipermail/lttng-dev/2020-January/029510.html

>> 
>> It would be good to look into alternative ways to fix this that do not
>> involve changing the guarantees provided by call_rcu() for that fallback
>> scenario (no futex available). Perhaps in your use-case you may want to
>> tweak the retry delay for compat_futex_async(). Currently
>> src/compat_futex.c:compat_futex_async() has a 10ms delay. Would 100ms
>> be more acceptable ?
> 
> If this works for knot dns, it would of course be simpler.

I think we should not put too much effort in tweaking the fallback for
scenarios where futex is missing. The proper approach seems to be to
implement proper support for futex-like APIs provided by each OS kernel.

Thanks,

Mathieu

> 
>							Thanx, Paul
> 
>> Thanks,
>> 
>> Mathieu
>> 
>> > 
>> > I applied "sed -i -e 's/futex_async/futex_noasync/'
>> > src/urcu-call-rcu-impl.h" and knot seems to work correctly with only
>> > 0.01% CPU now. I also ran tests/unit and tests/regression with default
>> > and signal backends and all completed successfully.
>> > 
>> > I think that the other two usages of futex_async are also a little
>> > suspicious, but I didn't look too closely.
>> > 
>> > Thanks,
>> > Alex.
>> > _______________________________________________
>> > lttng-dev mailing list
>> > lttng-dev@lists.lttng.org
>> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-01-28 14:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <157982514329.691.6168767011604689030.ref@pink>
2020-01-24  0:19 ` call_rcu seems inefficient without futex Alex Xu via lttng-dev
2020-01-27 15:38   ` Mathieu Desnoyers
2020-01-27 18:25     ` Alex Xu via lttng-dev
2020-01-28  3:45     ` Paul E. McKenney
2020-01-28 14:59       ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.