Re: [PATCH] locking/percpu_rwsem: Rewrite to not use rwsem

From: Oleg Nesterov <oleg@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel@vger.kernel.org, bigeasy@linutronix.de,
	juri.lelli@redhat.com, williams@redhat.com, bristot@redhat.com,
	longman@redhat.com, dave@stgolabs.net, jack@suse.com
Subject: Re: [PATCH] locking/percpu_rwsem: Rewrite to not use rwsem
Date: Tue, 6 Aug 2019 18:17:42 +0200	[thread overview]
Message-ID: <20190806161741.GC21454@redhat.com> (raw)
In-Reply-To: <20190805140241.GI2332@hirez.programming.kicks-ass.net>

On 08/05, Peter Zijlstra wrote:
>
> Re-implement percpu_rwsem without relying on rwsem.

looks correct... But,

> +bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
>  {
>  	/*
>  	 * Due to having preemption disabled the decrement happens on
>  	 * the same CPU as the increment, avoiding the
>  	 * increment-on-one-CPU-and-decrement-on-another problem.
>  	 *
> -	 * If the reader misses the writer's assignment of readers_block, then
> -	 * the writer is guaranteed to see the reader's increment.
> +	 * If the reader misses the writer's assignment of sem->block, then the
> +	 * writer is guaranteed to see the reader's increment.
>  	 *
>  	 * Conversely, any readers that increment their sem->read_count after
> -	 * the writer looks are guaranteed to see the readers_block value,
> -	 * which in turn means that they are guaranteed to immediately
> -	 * decrement their sem->read_count, so that it doesn't matter that the
> -	 * writer missed them.
> +	 * the writer looks are guaranteed to see the sem->block value, which
> +	 * in turn means that they are guaranteed to immediately decrement
> +	 * their sem->read_count, so that it doesn't matter that the writer
> +	 * missed them.
>  	 */
>  
> +again:
>  	smp_mb(); /* A matches D */
>  
>  	/*
> -	 * If !readers_block the critical section starts here, matched by the
> +	 * If !sem->block the critical section starts here, matched by the
>  	 * release in percpu_up_write().
>  	 */
> -	if (likely(!smp_load_acquire(&sem->readers_block)))
> -		return 1;
> +	if (likely(!atomic_read_acquire(&sem->block)))
> +		return true;
>  
>  	/*
>  	 * Per the above comment; we still have preemption disabled and
>  	 * will thus decrement on the same CPU as we incremented.
>  	 */
> -	__percpu_up_read(sem);
> +	__percpu_up_read(sem); /* implies preempt_enable() */

...

>  void __percpu_up_read(struct percpu_rw_semaphore *sem)
>  {
>  	smp_mb(); /* B matches C */
> @@ -103,9 +109,10 @@ void __percpu_up_read(struct percpu_rw_s
>  	 * critical section.
>  	 */
>  	__this_cpu_dec(*sem->read_count);
> +	preempt_enable();
>  
> -	/* Prod writer to recheck readers_active */
> -	rcuwait_wake_up(&sem->writer);
> +	/* Prod writer to re-evaluate readers_active_check() */
> +	wake_up(&sem->waiters);

but this will also wake all the pending readers up. Every reader will burn
CPU for no reason and likely delay the writer.

In fact I'm afraid this can lead to live-lock, because every reader in turn
will call __percpu_up_read().

To simplify, lets consider a single-cpu machine.

Suppose we have an active reader which sleeps somewhere and a pending writer
sleeping in wait_event(readers_active_check).

A new reader R1 comes and blocks in wait_event(!sem->block).

Another reader R2 comes and does wake_up(sem->waiters). Quite possibly it will
enter wait_event(!sem->block) before R1 gets CPU.

Then it is quite possible that R1 does __this_cpu_inc() before the writer
passes wait_event(readers_active_check()) or even before it gets CPU.

Now, R1 calls __percpu_up_read(), wakes R2 up, and so on.

How about 2 wait queues?

This way we can also make percpu_up_write() more fair, it can probably check
waitqueue_active(writers) before wake_up(readers).

Oleg.