On Mon, Nov 27 2017, Paul E. McKenney wrote:

> On Mon, Nov 27, 2017 at 12:27:04PM +0100, Florian Weimer wrote:
>> On 10/26/2017 02:27 PM, Paul E. McKenney wrote:
>> >But just for completeness, one way to make this work across the board
>> >might be to instead use call_rcu(), with the callback function kicking
>> >off a workqueue handler to do the rest of the unmount.  Of course,
>> >in saying that, I am ignoring any mutexes that you might be holding
>> >across this whole thing, and also ignoring any problems that might arise
>> >when returning to userspace with some portion of the unmount operation
>> >still pending.  (For example, someone unmounting a filesystem and then
>> >immediately remounting that same filesystem.)
>> 
>> You really need to complete all side effects of deallocating a
>> resource before returning to user space.  Otherwise, it will never
>> be possible to allocate and deallocate resources in a tight loop
>> because you either get spurious failures because too many
>> unaccounted deallocations are stuck somewhere in the system (and the
>> user can't tell that this is due to a race), or you get an OOM
>> because the user manages to queue up too much state.
>> 
>> We already have this problem with RLIMIT_NPROC, where waitpid etc.
>> return before the process is completely gone.  On some
>> kernels/configurations, the resulting race is so wide that parallel
>> make no longer works reliable because it runs into fork failures.
>
> Or alternatively, use rcu_barrier() occasionally to wait for all
> preceding deferred deallocations.  And there are quite a few other
> ways to take on this problem.

So, supposing we could package up everything that has to happen after
the current synchronize_rcu() and put it in an call_rcu() call back,
then instead of calling synchronize_rcu_expedited() at the end of
namespace_unlock(), we could possibly call call_rcu() there and
rcu_barrier() at the start of namespace_lock().....

That would mean a single unmount would have low impact, but it would
still slow down a sequence of 1000 consecutive unmounts.
Maybe we would only need the rcu_barrier() before select
namespace_lock() calls.  I would need to study the code closely to
form an opinion.  Interesting idea though.

Hopefully the _expedited() patch will be accepted - I haven't had a
"nak" yet...

thanks,
NeilBrown