From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752864AbcKRUcs (ORCPT <rfc822;w@1wt.eu>);
        Fri, 18 Nov 2016 15:32:48 -0500
Received: from mail-pf0-f196.google.com ([209.85.192.196]:34367 "EHLO
        mail-pf0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752452AbcKRUcr (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 18 Nov 2016 15:32:47 -0500
Date: Fri, 18 Nov 2016 12:33:00 -0800
From: Lance Roy <ldr709@gmail.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com,
        dipankar@in.ibm.com, akpm@linux-foundation.org,
        mathieu.desnoyers@efficios.com, josh@joshtriplett.org,
        tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org,
        dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com,
        fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com
Subject: Re: [RFC PATCH] SRCU: More efficient reader counts.
Message-ID: <20161118123300.3eda4a78@gmail.com>
In-Reply-To: <20161118140845.GQ3612@linux.vnet.ibm.com>
References: <20161117220335.20177-1-ldr709@gmail.com>
        <20161118140845.GQ3612@linux.vnet.ibm.com>
X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, 18 Nov 2016 06:08:45 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> However, let's first take a look at the overflow issue.
> 
> If a given program could have ULONG_MAX or more readers at any given
> time, there would of course be overflow.  However, each read must have
> an srcu_read_lock() outstanding, and the resulting four-byte return
> value must be stored somewhere.  Because the full address space is at
> most ULONG_MAX, the maximum number of outstanding readers is at most
> ULONG_MAX/4, even in the degenerate case where a single CPU/task invokes
> srcu_read_lock() in a tight loop.  And even this assumes that the entire
> address space can somehow be devoted to srcu_read_lock() return values.
> ULONG_MAX/4 is therefore a hard upper bound on the number of outstanding
> SRCU readers.
I agree that there should be at most ULONG_MAX/4 active readers at a time, as
long as the compiler doesn't do something crazy like noticing that
srcu_read_lock() always returns 0 or 1 and then packing the return values into
smaller variables.

> Now srcu_readers_active_idx_check() checks for strict equality between
> the number of locks and unlocks, so we can in theory tolerate ULONG_MAX-1
> readers.  So, the question is whether ULONG_MAX/4 readers can result
> in the updater seeing ULONG_MAX reads, due to memory misordering and
> other issues.
> 
> Because there are no memory barriers surrounding srcu_flip(), the updater
> could miss an extremely large number of srcu_read_unlock()s.  However,
> each missed srcu_read_unlock() must have a corresponding srcu_read_lock(),
> and there is a full memory barrier between between the srcu_flip() and
> the read of the lock count.  There is also a full barrier between any
> srcu_read_lock()'s increment of the lock count and that CPU's/task's next
> srcu_read_lock()'s fetch of the index.  Therefore, if the updater misses
> counting a given srcu_read_lock(), that CPU's/task's next srcu_read_lock()
> must see the new value of the index.  Because srcu_read_lock() disables
> preemption across the index fetch and the lock increment, there can be at
> most NR_CPUS-1 srcu_read_lock() calls that missed the recent srcu_flip()'s
> update to the index.  (I said NR_CPUS earlier, but Mathieu is correct
> in pointing out that srcu_flip() has to run somewhere.)
The trouble is that disabling preemption is not enough to ensure that there
is at most one srcu_read_lock() call per CPU that missed the srcu_flip().

Define a reader to be an SRCU lock+unlock pair. A reader is called active if it
has incremented ->lock_count[] but hasn't incremented ->unlock_count[] yet, and
completed if it has incremented ->unlock_count[]. I think that we only want to
limit the number of active readers and the number of CPUs. In particular, I
don't think there should be a limit on the rate of completion of read side
critical section.

The goal of srcu_readers_active_idx_check() is to verify that there were zero
active readers on the inactive index at some time during its execution. To do
this, it totals the unlock counts, executes a memory barrier, totals the lock
counts, and takes the difference. This difference counts the readers that are
active when srcu_readers_lock_idx() gets to their CPU, plus the readers that
completed after srcu_readers_unlock_idx() and before srcu_readers_lock_idx().
If the true (infinite precision) value of the difference is zero, then there
were no active readers at some point while srcu_readers_lock_idx() is running.
However, the difference is only stored in a long, so there is a potential for
overflow if too many readers complete during srcu_readers_active_idx_check().

For example, let's say there are three threads, each running on their own CPU:

int data, flag;
struct srcu_struct *sp = /* ... */;

Thread 0:
	data = 1;
	synchronize_srcu(sp);
	flag = 1;

Thread 1:
	int data_r, flag_r;
	int idx = srcu_read_lock(sp);
	data_r = data;
	flag_r = flag;
	srcu_read_unlock(sp, idx);
	BUG_ON(flag_r == 1 && data_r == 0);

Thread 2:
	while (true) {
		int idx = srcu_read_lock(sp);
		srcu_read_unlock(sp, idx);
	}

Let's take the following execution order. Thread 1 increments
the CPU 1 version of sp->lock_count[0], sets idx to zero, and loads data (0)
into data_r. Thread 0 then sets data to be 1, verifies that there are no
readers on index 1, and increments sp->completed, but the CPU actually doesn't
preform the last operation, putting it off until the next memory barrier. Thread
0 then calls srcu_readers_active_idx_check() on index 0, which runs
srcu_readers_unlock_idx() (returning 0). Right after srcu_readers_unlock_idx()
completes, thread 2 starts running. Since Thread 0 hasn't actually incremented
sp->completed in any way that is visible to thread 2, srcu_read_lock() will
still return 0. Thread 2 can then run for ULONG_MAX iterations, setting
the CPU 2 version of sp->unlock_count[0] to ULONG_MAX. CPU 0 then finally gets
around to incrementing sp->completed, runs its memory barrier, and then reads
the lock counts: 1, 0, ULONG_MAX. The total of ULONG_MAX+1 will overflow to 0
and compare equal with earlier unlock count. Thread 0 will then think that the
grace period is over and set flag to one. Thread 1 can then read flag (1) into
flag_r and run srcu_read_unlock(). The BUG_ON statement will then fail.

Although ULONG_MAX readers completed during srcu_readers_active_idx_check(),
there were at most 2 active readers at any time, so this program doesn't run
into any limit.

I hope that was clear enough.

Thanks,
Lance