All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] percpu_ref, RCU: Audit RCU usages in percpu_ref users
@ 2018-03-06 17:26 Tejun Heo
  2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:26 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team

Hello,

Jann Horn found that aio was depending on the internal RCU grace
periods of percpu-ref and that it's broken because aio uses regular
RCU while percpu_ref uses sched-RCU.

Depending on percpu_ref's internal grace periods isn't a good idea
because

 * The RCU type might not match.

 * percpu_ref's grace periods are used to switch to atomic mode.  They
   aren't between the last put and the invocation of the last release.
   This is easy to get confused about and can lead to subtle bugs.

 * percpu_ref might not have grace periods at all depending on its
   current operation mode.

This patchset audits all percpu_ref users for their RCU usages,
clarifies percpu_ref documentation that the internal grace periods
must not be depended upon, and introduces rcu_work to simplify
bouncing to a workqueue after an RCU grace period.

This patchset contains the following seven patches.

 0001-fs-aio-Add-explicit-RCU-grace-period-when-freeing-ki.patch
 0002-fs-aio-Use-RCU-accessors-for-kioctx_table-table.patch
 0003-RDMAVT-Fix-synchronization-around-percpu_ref.patch
 0004-HMM-Remove-superflous-RCU-protection-around-radix-tr.patch
 0005-block-Remove-superflous-rcu_read_-un-lock_sched-in-b.patch
 0006-percpu_ref-Update-doc-to-dissuade-users-from-dependi.patch
 0007-RCU-workqueue-Implement-rcu_work.patch

0001-0003 are fixes and tagged -stable.

0004-0005 remove (seemingly) superflous RCU read lock usages.

0006 updates the doc and 0007 introduces rcu_work.

This patchset is also available in the following git tree.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git percpu_ref-rcu-audit

diffstat follows.  Thanks.

 block/blk-core.c                  |    2 -
 drivers/infiniband/sw/rdmavt/mr.c |   10 ++++---
 fs/aio.c                          |   39 ++++++++++++++++-----------
 include/linux/cgroup-defs.h       |    2 -
 include/linux/percpu-refcount.h   |   18 ++++++++----
 include/linux/workqueue.h         |   38 ++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c            |   21 ++++----------
 kernel/workqueue.c                |   54 ++++++++++++++++++++++++++++++++++++++
 lib/percpu-refcount.c             |    2 +
 mm/hmm.c                          |   12 +-------
 10 files changed, 145 insertions(+), 53 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx
  2018-03-06 17:26 [PATCHSET] percpu_ref, RCU: Audit RCU usages in percpu_ref users Tejun Heo
@ 2018-03-06 17:33 ` Tejun Heo
  2018-03-06 17:33   ` [PATCH 2/7] fs/aio: Use RCU accessors for kioctx_table->table[] Tejun Heo
                     ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, stable

While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU.  This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.

Fix it by updating free_ioctx() to go through call_rcu() explicitly.

v2: Comment added to explain double bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: e34ecee2ae79 ("aio: Fix a trinity splat")
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
---
 fs/aio.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index a062d75..eb2e0cf 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -115,7 +115,8 @@ struct kioctx {
 	struct page		**ring_pages;
 	long			nr_pages;
 
-	struct work_struct	free_work;
+	struct rcu_head		free_rcu;
+	struct work_struct	free_work;	/* see free_ioctx() */
 
 	/*
 	 * signals when all in-flight requests are done
@@ -588,6 +589,12 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 	return cancel(&kiocb->common);
 }
 
+/*
+ * free_ioctx() should be RCU delayed to synchronize against the RCU
+ * protected lookup_ioctx() and also needs process context to call
+ * aio_free_ring(), so the double bouncing through kioctx->free_rcu and
+ * ->free_work.
+ */
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -601,6 +608,14 @@ static void free_ioctx(struct work_struct *work)
 	kmem_cache_free(kioctx_cachep, ctx);
 }
 
+static void free_ioctx_rcufn(struct rcu_head *head)
+{
+	struct kioctx *ctx = container_of(head, struct kioctx, free_rcu);
+
+	INIT_WORK(&ctx->free_work, free_ioctx);
+	schedule_work(&ctx->free_work);
+}
+
 static void free_ioctx_reqs(struct percpu_ref *ref)
 {
 	struct kioctx *ctx = container_of(ref, struct kioctx, reqs);
@@ -609,8 +624,8 @@ static void free_ioctx_reqs(struct percpu_ref *ref)
 	if (ctx->rq_wait && atomic_dec_and_test(&ctx->rq_wait->count))
 		complete(&ctx->rq_wait->comp);
 
-	INIT_WORK(&ctx->free_work, free_ioctx);
-	schedule_work(&ctx->free_work);
+	/* Synchronize against RCU protected table->table[] dereferences */
+	call_rcu(&ctx->free_rcu, free_ioctx_rcufn);
 }
 
 /*
@@ -838,7 +853,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	table->table[ctx->id] = NULL;
 	spin_unlock(&mm->ioctx_lock);
 
-	/* percpu_ref_kill() will do the necessary call_rcu() */
+	/* free_ioctx_reqs() will do the necessary RCU synchronization */
 	wake_up_all(&ctx->wait);
 
 	/*
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/7] fs/aio: Use RCU accessors for kioctx_table->table[]
  2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
@ 2018-03-06 17:33   ` Tejun Heo
  2018-03-06 17:33   ` [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, stable

While converting ioctx index from a list to a table, db446a08c23d
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors.  This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
---
 fs/aio.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index eb2e0cf..6bcd3fb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -68,9 +68,9 @@ struct aio_ring {
 #define AIO_RING_PAGES	8
 
 struct kioctx_table {
-	struct rcu_head	rcu;
-	unsigned	nr;
-	struct kioctx	*table[];
+	struct rcu_head		rcu;
+	unsigned		nr;
+	struct kioctx __rcu	*table[];
 };
 
 struct kioctx_cpu {
@@ -330,7 +330,7 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
 	for (i = 0; i < table->nr; i++) {
 		struct kioctx *ctx;
 
-		ctx = table->table[i];
+		ctx = rcu_dereference(table->table[i]);
 		if (ctx && ctx->aio_ring_file == file) {
 			if (!atomic_read(&ctx->dead)) {
 				ctx->user_id = ctx->mmap_base = vma->vm_start;
@@ -666,9 +666,9 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 	while (1) {
 		if (table)
 			for (i = 0; i < table->nr; i++)
-				if (!table->table[i]) {
+				if (!rcu_access_pointer(table->table[i])) {
 					ctx->id = i;
-					table->table[i] = ctx;
+					rcu_assign_pointer(table->table[i], ctx);
 					spin_unlock(&mm->ioctx_lock);
 
 					/* While kioctx setup is in progress,
@@ -849,8 +849,8 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	}
 
 	table = rcu_dereference_raw(mm->ioctx_table);
-	WARN_ON(ctx != table->table[ctx->id]);
-	table->table[ctx->id] = NULL;
+	WARN_ON(ctx != rcu_access_pointer(table->table[ctx->id]));
+	RCU_INIT_POINTER(table->table[ctx->id], NULL);
 	spin_unlock(&mm->ioctx_lock);
 
 	/* free_ioctx_reqs() will do the necessary RCU synchronization */
@@ -895,7 +895,8 @@ void exit_aio(struct mm_struct *mm)
 
 	skipped = 0;
 	for (i = 0; i < table->nr; ++i) {
-		struct kioctx *ctx = table->table[i];
+		struct kioctx *ctx =
+			rcu_dereference_protected(table->table[i], true);
 
 		if (!ctx) {
 			skipped++;
@@ -1084,7 +1085,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 	if (!table || id >= table->nr)
 		goto out;
 
-	ctx = table->table[id];
+	ctx = rcu_dereference(table->table[id]);
 	if (ctx && ctx->user_id == ctx_id) {
 		percpu_ref_get(&ctx->users);
 		ret = ctx;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref
  2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
  2018-03-06 17:33   ` [PATCH 2/7] fs/aio: Use RCU accessors for kioctx_table->table[] Tejun Heo
@ 2018-03-06 17:33   ` Tejun Heo
  2018-03-07 15:39     ` Dennis Dalessandro
  2018-03-06 17:33     ` Tejun Heo
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo,
	Dennis Dalessandro, Mike Marciniszyn, linux-rdma

rvt_mregion uses percpu_ref for reference counting and RCU to protect
accesses from lkey_table.  When a rvt_mregion needs to be freed, it
first gets unregistered from lkey_table and then rvt_check_refs() is
called to wait for in-flight usages before the rvt_mregion is freed.

rvt_check_refs() seems to have a couple issues.

* It has a fast exit path which tests percpu_ref_is_zero().  However,
  a percpu_ref reading zero doesn't mean that the object can be
  released.  In fact, the ->release() callback might not even have
  started executing yet.  Proceeding with freeing can lead to
  use-after-free.

* lkey_table is RCU protected but there is no RCU grace period in the
  free path.  percpu_ref uses RCU internally but it's sched-RCU whose
  grace periods are different from regular RCU.  Also, it generally
  isn't a good idea to depend on internal behaviors like this.

To address the above issues, this patch removes the the fast exit and
adds an explicit synchronize_rcu().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: linux-rdma@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
Hello, Dennis, Mike.

I don't know RDMA at all and this patch is only compile tested.  Can
you please take a careful look?

Thanks.

 drivers/infiniband/sw/rdmavt/mr.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
index 1b2e536..cc429b5 100644
--- a/drivers/infiniband/sw/rdmavt/mr.c
+++ b/drivers/infiniband/sw/rdmavt/mr.c
@@ -489,11 +489,13 @@ static int rvt_check_refs(struct rvt_mregion *mr, const char *t)
 	unsigned long timeout;
 	struct rvt_dev_info *rdi = ib_to_rvt(mr->pd->device);
 
-	if (percpu_ref_is_zero(&mr->refcount))
-		return 0;
-	/* avoid dma mr */
-	if (mr->lkey)
+	if (mr->lkey) {
+		/* avoid dma mr */
 		rvt_dereg_clean_qps(mr);
+		/* @mr was indexed on rcu protected @lkey_table */
+		synchronize_rcu();
+	}
+
 	timeout = wait_for_completion_timeout(&mr->comp, 5 * HZ);
 	if (!timeout) {
 		rvt_pr_err(rdi,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] HMM: Remove superflous RCU protection around radix tree lookup
  2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
@ 2018-03-06 17:33     ` Tejun Heo
  2018-03-06 17:33   ` [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo,
	Jérôme Glisse, linux-mm

hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
which actually uses the RCU protection.  The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superflous rcu_read_lock/unlock() around the function.

This doesn't add anything and just adds to confusion.  Remove the RCU
protection and open-code the radix tree lookup.  If this needs to
become more sophisticated in the future, let's add them back when
necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
Hello, Jérôme.

This came up while auditing percpu_ref users for missing explicit RCU
grace periods.  HMM doesn't seem to depend on RCU protection at all,
so I thought it'd be better to remove it for now.  It's only compile
tested.

Thanks.

 mm/hmm.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98..d4627c5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	hmm_devmem_radix_release(resource);
 }
 
-static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
-{
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
-}
-
 static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
 		struct hmm_devmem *dup;
 
-		rcu_read_lock();
-		dup = hmm_devmem_find(key);
-		rcu_read_unlock();
+		dup = radix_tree_lookup(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT);
 		if (dup) {
 			dev_err(device, "%s: collides with mapping for %s\n",
 				__func__, dev_name(dup->device));
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] HMM: Remove superflous RCU protection around radix tree lookup
@ 2018-03-06 17:33     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo,
	Jérôme Glisse, linux-mm

hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
which actually uses the RCU protection.  The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superflous rcu_read_lock/unlock() around the function.

This doesn't add anything and just adds to confusion.  Remove the RCU
protection and open-code the radix tree lookup.  If this needs to
become more sophisticated in the future, let's add them back when
necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
Hello, JA(C)rA'me.

This came up while auditing percpu_ref users for missing explicit RCU
grace periods.  HMM doesn't seem to depend on RCU protection at all,
so I thought it'd be better to remove it for now.  It's only compile
tested.

Thanks.

 mm/hmm.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98..d4627c5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	hmm_devmem_radix_release(resource);
 }
 
-static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
-{
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
-}
-
 static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
 		struct hmm_devmem *dup;
 
-		rcu_read_lock();
-		dup = hmm_devmem_find(key);
-		rcu_read_unlock();
+		dup = radix_tree_lookup(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT);
 		if (dup) {
 			dev_err(device, "%s: collides with mapping for %s\n",
 				__func__, dev_name(dup->device));
-- 
2.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()
  2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
                     ` (2 preceding siblings ...)
  2018-03-06 17:33     ` Tejun Heo
@ 2018-03-06 17:33   ` Tejun Heo
  2018-03-06 17:52     ` Bart Van Assche
  2018-03-06 17:33   ` [PATCH 6/7] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods Tejun Heo
  2018-03-06 17:33   ` [PATCH 7/7] RCU, workqueue: Implement rcu_work Tejun Heo
  5 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, Bart Van Assche,
	Jens Axboe

3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work
reliably") added rcu_read_[un]lock_sched() to blk_queue_enter() along
with other changes but it doesn't seem to be doing anything.

blk_queue_enter() is called with @q - the pointer to the target
request_queue, so the caller obviously has to guarantee that @q can be
dereferenced, and inside the RCU-sched protected area, there's nothing
which needs further protection.

Let's remove the superflous rcu_read_[un]lock_sched().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
Hello, Bart.

This came up while auditing percpu_ref users for problematic RCU
usages.  I couldn't understand what the RCU protection was doing.
It'd be great if you can take a look and tell me whether I missed
something.

Thanks.

 block/blk-core.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f..e3b4d1849 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -827,7 +827,6 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		bool success = false;
 		int ret;
 
-		rcu_read_lock_sched();
 		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
 			/*
 			 * The code that sets the PREEMPT_ONLY flag is
@@ -840,7 +839,6 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 				percpu_ref_put(&q->q_usage_counter);
 			}
 		}
-		rcu_read_unlock_sched();
 
 		if (success)
 			return 0;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/7] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
  2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
                     ` (3 preceding siblings ...)
  2018-03-06 17:33   ` [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter() Tejun Heo
@ 2018-03-06 17:33   ` Tejun Heo
  2018-03-06 17:33   ` [PATCH 7/7] RCU, workqueue: Implement rcu_work Tejun Heo
  5 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo

percpu_ref internally uses sched-RCU to implement the percpu -> atomic
mode switching and the documentation suggested that this could be
depended upon.  This doesn't seem like a good idea.

* percpu_ref uses sched-RCU which has different grace periods regular
  RCU.  Users may combine percpu_ref with regular RCU usage and
  incorrectly believe that regular RCU grace periods are performed by
  percpu_ref.  This can lead to, for example, use-after-free due to
  premature freeing.

* percpu_ref has a grace period when switching from percpu to atomic
  mode.  It doesn't have one between the last put and release.  This
  distinction is subtle and can lead to surprising bugs.

* percpu_ref allows starting in and switching to atomic mode manually
  for debugging and other purposes.  This means that there may not be
  any grace periods from kill to release.

This patch makes it clear that the grace periods are percpu_ref's
internal implementation detail and can't be depended upon by the
users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/percpu-refcount.h | 18 ++++++++++++------
 lib/percpu-refcount.c           |  2 ++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 864d167..009cdf3 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -30,10 +30,14 @@
  * calls io_destroy() or the process exits.
  *
  * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
- * calls percpu_ref_kill(), then hlist_del_rcu() and synchronize_rcu() to remove
- * the kioctx from the proccess's list of kioctxs - after that, there can't be
- * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop
- * the initial ref with percpu_ref_put().
+ * removes the kioctx from the proccess's table of kioctxs and kills percpu_ref.
+ * After that, there can't be any new users of the kioctx (from lookup_ioctx())
+ * and it's then safe to drop the initial ref with percpu_ref_put().
+ *
+ * Note that the free path, free_ioctx(), needs to go through explicit call_rcu()
+ * to synchronize with RCU protected lookup_ioctx().  percpu_ref operations don't
+ * imply RCU grace periods of any kind and if a user wants to combine percpu_ref
+ * with RCU protection, it must be done explicitly.
  *
  * Code that does a two stage shutdown like this often needs some kind of
  * explicit synchronization to ensure the initial refcount can only be dropped
@@ -113,8 +117,10 @@ void percpu_ref_reinit(struct percpu_ref *ref);
  * Must be used to drop the initial ref on a percpu refcount; must be called
  * precisely once before shutdown.
  *
- * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the
- * percpu counters and dropping the initial ref.
+ * Switches @ref into atomic mode before gathering up the percpu counters
+ * and dropping the initial ref.
+ *
+ * There are no implied RCU grace periods between kill and release.
  */
 static inline void percpu_ref_kill(struct percpu_ref *ref)
 {
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 30e7dd8..9f96fa7 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -322,6 +322,8 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
  * This function normally doesn't block and can be called from any context
  * but it may block if @confirm_kill is specified and @ref is in the
  * process of switching to atomic mode by percpu_ref_switch_to_atomic().
+ *
+ * There are no implied RCU grace periods between kill and release.
  */
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
                     ` (4 preceding siblings ...)
  2018-03-06 17:33   ` [PATCH 6/7] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods Tejun Heo
@ 2018-03-06 17:33   ` Tejun Heo
  2018-03-06 18:30     ` Linus Torvalds
  2018-03-07  2:49     ` Lai Jiangshan
  5 siblings, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-06 17:33 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo

There are cases where RCU callback needs to be bounced to a sleepable
context.  This is currently done by the RCU callback queueing a work
item, which can be cumbersome to write and confusing to read.

This patch introduces rcu_work, a workqueue work variant which gets
executed after a RCU grace period, and converts the open coded
bouncing in fs/aio and kernel/cgroup.

v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
    completion of previously queued rcu callback as per Paul.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/aio.c                    | 21 +++++-------------
 include/linux/cgroup-defs.h |  2 +-
 include/linux/workqueue.h   | 38 +++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c      | 21 ++++++------------
 kernel/workqueue.c          | 54 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 106 insertions(+), 30 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6bcd3fb..88d7927 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -115,8 +115,7 @@ struct kioctx {
 	struct page		**ring_pages;
 	long			nr_pages;
 
-	struct rcu_head		free_rcu;
-	struct work_struct	free_work;	/* see free_ioctx() */
+	struct rcu_work		free_rwork;	/* see free_ioctx() */
 
 	/*
 	 * signals when all in-flight requests are done
@@ -592,13 +591,12 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 /*
  * free_ioctx() should be RCU delayed to synchronize against the RCU
  * protected lookup_ioctx() and also needs process context to call
- * aio_free_ring(), so the double bouncing through kioctx->free_rcu and
- * ->free_work.
+ * aio_free_ring().  Use rcu_work.
  */
 static void free_ioctx(struct work_struct *work)
 {
-	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
-
+	struct kioctx *ctx = container_of(to_rcu_work(work), struct kioctx,
+					  free_rwork);
 	pr_debug("freeing %p\n", ctx);
 
 	aio_free_ring(ctx);
@@ -608,14 +606,6 @@ static void free_ioctx(struct work_struct *work)
 	kmem_cache_free(kioctx_cachep, ctx);
 }
 
-static void free_ioctx_rcufn(struct rcu_head *head)
-{
-	struct kioctx *ctx = container_of(head, struct kioctx, free_rcu);
-
-	INIT_WORK(&ctx->free_work, free_ioctx);
-	schedule_work(&ctx->free_work);
-}
-
 static void free_ioctx_reqs(struct percpu_ref *ref)
 {
 	struct kioctx *ctx = container_of(ref, struct kioctx, reqs);
@@ -625,7 +615,8 @@ static void free_ioctx_reqs(struct percpu_ref *ref)
 		complete(&ctx->rq_wait->comp);
 
 	/* Synchronize against RCU protected table->table[] dereferences */
-	call_rcu(&ctx->free_rcu, free_ioctx_rcufn);
+	INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
+	queue_rcu_work(system_wq, &ctx->free_rwork);
 }
 
 /*
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 9f242b8..92d7640 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -151,8 +151,8 @@ struct cgroup_subsys_state {
 	atomic_t online_cnt;
 
 	/* percpu_ref killing and RCU release */
-	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
+	struct rcu_work destroy_rwork;
 
 	/*
 	 * PI: the parent css.	Placed here for cache proximity to following
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index bc0cda1..b39f3a4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -13,6 +13,7 @@
 #include <linux/threads.h>
 #include <linux/atomic.h>
 #include <linux/cpumask.h>
+#include <linux/rcupdate.h>
 
 struct workqueue_struct;
 
@@ -120,6 +121,15 @@ struct delayed_work {
 	int cpu;
 };
 
+struct rcu_work {
+	struct work_struct work;
+	struct rcu_head rcu;
+
+	/* target workqueue and CPU ->rcu uses to queue ->work */
+	struct workqueue_struct *wq;
+	int cpu;
+};
+
 /**
  * struct workqueue_attrs - A struct for workqueue attributes.
  *
@@ -151,6 +161,11 @@ static inline struct delayed_work *to_delayed_work(struct work_struct *work)
 	return container_of(work, struct delayed_work, work);
 }
 
+static inline struct rcu_work *to_rcu_work(struct work_struct *work)
+{
+	return container_of(work, struct rcu_work, work);
+}
+
 struct execute_work {
 	struct work_struct work;
 };
@@ -266,6 +281,12 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
 #define INIT_DEFERRABLE_WORK_ONSTACK(_work, _func)			\
 	__INIT_DELAYED_WORK_ONSTACK(_work, _func, TIMER_DEFERRABLE)
 
+#define INIT_RCU_WORK(_work, _func)					\
+	INIT_WORK(&(_work)->work, (_func))
+
+#define INIT_RCU_WORK_ONSTACK(_work, _func)				\
+	INIT_WORK_ONSTACK(&(_work)->work, (_func))
+
 /**
  * work_pending - Find out whether a work item is currently pending
  * @work: The work item in question
@@ -447,6 +468,8 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *dwork, unsigned long delay);
+extern bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
+			struct rcu_work *rwork);
 
 extern void flush_workqueue(struct workqueue_struct *wq);
 extern void drain_workqueue(struct workqueue_struct *wq);
@@ -463,6 +486,8 @@ extern bool flush_delayed_work(struct delayed_work *dwork);
 extern bool cancel_delayed_work(struct delayed_work *dwork);
 extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
 
+extern bool flush_rcu_work(struct rcu_work *rwork);
+
 extern void workqueue_set_max_active(struct workqueue_struct *wq,
 				     int max_active);
 extern struct work_struct *current_work(void);
@@ -520,6 +545,19 @@ static inline bool mod_delayed_work(struct workqueue_struct *wq,
 }
 
 /**
+ * queue_rcu_work - queue work on a workqueue after a RCU grace period
+ * @wq: workqueue to use
+ * @rwork: RCU work to queue
+ *
+ * Equivalent to queue_rcu_work_on() but tries to use the local CPU.
+ */
+static inline bool queue_rcu_work(struct workqueue_struct *wq,
+				  struct rcu_work *rwork)
+{
+	return queue_rcu_work_on(WORK_CPU_UNBOUND, wq, rwork);
+}
+
+/**
  * schedule_work_on - put work task on a specific cpu
  * @cpu: cpu to put the work task on
  * @work: job to be done
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8cda3bc..4c5d4ca0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4514,10 +4514,10 @@ static struct cftype cgroup_base_files[] = {
  * and thus involve punting to css->destroy_work adding two additional
  * steps to the already complex sequence.
  */
-static void css_free_work_fn(struct work_struct *work)
+static void css_free_rwork_fn(struct work_struct *work)
 {
-	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+	struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
+				struct cgroup_subsys_state, destroy_rwork);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
@@ -4563,15 +4563,6 @@ static void css_free_work_fn(struct work_struct *work)
 	}
 }
 
-static void css_free_rcu_fn(struct rcu_head *rcu_head)
-{
-	struct cgroup_subsys_state *css =
-		container_of(rcu_head, struct cgroup_subsys_state, rcu_head);
-
-	INIT_WORK(&css->destroy_work, css_free_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
-}
-
 static void css_release_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
@@ -4621,7 +4612,8 @@ static void css_release_work_fn(struct work_struct *work)
 
 	mutex_unlock(&cgroup_mutex);
 
-	call_rcu(&css->rcu_head, css_free_rcu_fn);
+	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
+	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 }
 
 static void css_release(struct percpu_ref *ref)
@@ -4755,7 +4747,8 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	call_rcu(&css->rcu_head, css_free_rcu_fn);
+	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
+	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
 }
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bb9a519..e26c2f4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1604,6 +1604,40 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(mod_delayed_work_on);
 
+static void rcu_work_rcufn(struct rcu_head *rcu)
+{
+	struct rcu_work *rwork = container_of(rcu, struct rcu_work, rcu);
+
+	/* read the comment in __queue_work() */
+	local_irq_disable();
+	__queue_work(rwork->cpu, rwork->wq, &rwork->work);
+	local_irq_enable();
+}
+
+/**
+ * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
+ * @cpu: CPU number to execute work on
+ * @wq: workqueue to use
+ * @rwork: work to queue
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
+		       struct rcu_work *rwork)
+{
+	struct work_struct *work = &rwork->work;
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		rwork->wq = wq;
+		rwork->cpu = cpu;
+		call_rcu(&rwork->rcu, rcu_work_rcufn);
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(queue_rcu_work_on);
+
 /**
  * worker_enter_idle - enter idle state
  * @worker: worker which is entering idle state
@@ -3001,6 +3035,26 @@ bool flush_delayed_work(struct delayed_work *dwork)
 }
 EXPORT_SYMBOL(flush_delayed_work);
 
+/**
+ * flush_rcu_work - wait for a rwork to finish executing the last queueing
+ * @rwork: the rcu work to flush
+ *
+ * Return:
+ * %true if flush_rcu_work() waited for the work to finish execution,
+ * %false if it was already idle.
+ */
+bool flush_rcu_work(struct rcu_work *rwork)
+{
+	if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
+		rcu_barrier();
+		flush_work(&rwork->work);
+		return true;
+	} else {
+		return flush_work(&rwork->work);
+	}
+}
+EXPORT_SYMBOL(flush_rcu_work);
+
 static bool __cancel_work(struct work_struct *work, bool is_dwork)
 {
 	unsigned long flags;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()
  2018-03-06 17:33   ` [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter() Tejun Heo
@ 2018-03-06 17:52     ` Bart Van Assche
  2018-03-14 18:46       ` tj
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2018-03-06 17:52 UTC (permalink / raw)
  To: torvalds, tj, viro, jannh, paulmck, bcrl, kent.overstreet
  Cc: security, linux-kernel, kernel-team, axboe

On Tue, 2018-03-06 at 09:33 -0800, Tejun Heo wrote:
> 3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work
> reliably") added rcu_read_[un]lock_sched() to blk_queue_enter() along
> with other changes but it doesn't seem to be doing anything.
> 
> blk_queue_enter() is called with @q - the pointer to the target
> request_queue, so the caller obviously has to guarantee that @q can be
> dereferenced, and inside the RCU-sched protected area, there's nothing
> which needs further protection.
> 
> Let's remove the superflous rcu_read_[un]lock_sched().
> 
> [ ... ]
>
> This came up while auditing percpu_ref users for problematic RCU
> usages.  I couldn't understand what the RCU protection was doing.
> It'd be great if you can take a look and tell me whether I missed
> something.

Hello Tejun,

I think the rcu_read_lock() and rcu_read_unlock() are really necessary in this
code. In the LWN article "The RCU-barrier menagerie" it is explained that RCU
can be used to enforce write ordering globally if the code that reads the writes
that are ordered with an RCU read lock (https://lwn.net/Articles/573497/). See
also the following comment in scsi_device_quiesce():

	/*
	 * Ensure that the effect of blk_set_preempt_only() will be visible
	 * for percpu_ref_tryget() callers that occur after the queue
	 * unfreeze even if the queue was already frozen before this function
	 * was called. See also https://lwn.net/Articles/573497/.
	 */

Since this patch introduces a subtle and hard to debug race condition, please
drop this patch.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] HMM: Remove superflous RCU protection around radix tree lookup
  2018-03-06 17:33     ` Tejun Heo
@ 2018-03-06 17:59       ` Jerome Glisse
  -1 siblings, 0 replies; 26+ messages in thread
From: Jerome Glisse @ 2018-03-06 17:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team, linux-mm

On Tue, Mar 06, 2018 at 09:33:13AM -0800, Tejun Heo wrote:
> hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
> which actually uses the RCU protection.  The only caller is
> hmm_devmem_pages_create() which already grabs the mutex and does
> superflous rcu_read_lock/unlock() around the function.
> 
> This doesn't add anything and just adds to confusion.  Remove the RCU
> protection and open-code the radix tree lookup.  If this needs to
> become more sophisticated in the future, let's add them back when
> necessary.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> Cc: linux-mm@kvack.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>

> ---
> Hello, Jérôme.
> 
> This came up while auditing percpu_ref users for missing explicit RCU
> grace periods.  HMM doesn't seem to depend on RCU protection at all,
> so I thought it'd be better to remove it for now.  It's only compile
> tested.

Good catch some left over of old logic. I have more cleanup queued up
now that i am about to post nouveau patches to use all this. Thanks for
fixing this.

> 
> Thanks.
> 
>  mm/hmm.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98..d4627c5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  	hmm_devmem_radix_release(resource);
>  }
>  
> -static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
> -{
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> -
> -	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
> -}
> -
>  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
>  		struct hmm_devmem *dup;
>  
> -		rcu_read_lock();
> -		dup = hmm_devmem_find(key);
> -		rcu_read_unlock();
> +		dup = radix_tree_lookup(&hmm_devmem_radix,
> +					key >> PA_SECTION_SHIFT);
>  		if (dup) {
>  			dev_err(device, "%s: collides with mapping for %s\n",
>  				__func__, dev_name(dup->device));
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] HMM: Remove superflous RCU protection around radix tree lookup
@ 2018-03-06 17:59       ` Jerome Glisse
  0 siblings, 0 replies; 26+ messages in thread
From: Jerome Glisse @ 2018-03-06 17:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team, linux-mm

On Tue, Mar 06, 2018 at 09:33:13AM -0800, Tejun Heo wrote:
> hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
> which actually uses the RCU protection.  The only caller is
> hmm_devmem_pages_create() which already grabs the mutex and does
> superflous rcu_read_lock/unlock() around the function.
> 
> This doesn't add anything and just adds to confusion.  Remove the RCU
> protection and open-code the radix tree lookup.  If this needs to
> become more sophisticated in the future, let's add them back when
> necessary.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Jerome Glisse <jglisse@redhat.com>

> Cc: linux-mm@kvack.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>

> ---
> Hello, Jerome.
> 
> This came up while auditing percpu_ref users for missing explicit RCU
> grace periods.  HMM doesn't seem to depend on RCU protection at all,
> so I thought it'd be better to remove it for now.  It's only compile
> tested.

Good catch some left over of old logic. I have more cleanup queued up
now that i am about to post nouveau patches to use all this. Thanks for
fixing this.

> 
> Thanks.
> 
>  mm/hmm.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98..d4627c5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  	hmm_devmem_radix_release(resource);
>  }
>  
> -static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
> -{
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> -
> -	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
> -}
> -
>  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
>  		struct hmm_devmem *dup;
>  
> -		rcu_read_lock();
> -		dup = hmm_devmem_find(key);
> -		rcu_read_unlock();
> +		dup = radix_tree_lookup(&hmm_devmem_radix,
> +					key >> PA_SECTION_SHIFT);
>  		if (dup) {
>  			dev_err(device, "%s: collides with mapping for %s\n",
>  				__func__, dev_name(dup->device));
> -- 
> 2.9.5
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-06 17:33   ` [PATCH 7/7] RCU, workqueue: Implement rcu_work Tejun Heo
@ 2018-03-06 18:30     ` Linus Torvalds
  2018-03-09 15:37       ` Tejun Heo
  2018-03-07  2:49     ` Lai Jiangshan
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2018-03-06 18:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jann Horn, Paul McKenney, Benjamin LaHaise, Al Viro,
	Kent Overstreet, security, Linux Kernel Mailing List,
	kernel-team

On Tue, Mar 6, 2018 at 9:33 AM, Tejun Heo <tj@kernel.org> wrote:
>
> This patch introduces rcu_work, a workqueue work variant which gets
> executed after a RCU grace period, and converts the open coded
> bouncing in fs/aio and kernel/cgroup.

So I like the concept, but I have two comments:

 - can we split this patch up, so that if somebody bisects a problem
to it, we'll see if it's cgroup or aio that triggers it?

 - this feels wrong:

> +struct rcu_work {
> +       struct work_struct work;
> +       struct rcu_head rcu;
> +
> +       /* target workqueue and CPU ->rcu uses to queue ->work */
> +       struct workqueue_struct *wq;
> +       int cpu;
> +};

That "int cpu" really doesn't feel like it makes sense for an
rcu_work. The rcu_call() part fundamentally will happen on any CPU,
and sure, it could then schedule the work on something else, but that
doesn't sound like a particularly sound interface.

So I'd like to either just make the thing always just use
WORK_CPU_UNBOUND, or hear some kind of (handwaving ok) explanation for
why something else would ever make sense. If the action is
fundamentally delayed by RCU, why would it make a difference which CPU
it runs on?

One reason for that is that I get this feeling that the multiple
stages of waiting *might* be unnecessary. Are there no situations
where a "rcu_work" might just end up devolving to be just a regular
work? Or maybe situations where the rcu callback is done in process
context, and the work can just be done immediately? I'm a tiny bit
worried about queueing artifacts, where we end up having tons of
resources in flight.

But this really is just a "this feels wrong". I have no real hard
technical reason.

                Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-06 17:33   ` [PATCH 7/7] RCU, workqueue: Implement rcu_work Tejun Heo
  2018-03-06 18:30     ` Linus Torvalds
@ 2018-03-07  2:49     ` Lai Jiangshan
  2018-03-07 14:54       ` Paul E. McKenney
  1 sibling, 1 reply; 26+ messages in thread
From: Lai Jiangshan @ 2018-03-07  2:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, Paul E. McKenney, bcrl, viro, kent.overstreet,
	security, LKML, kernel-team

On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <tj@kernel.org> wrote:

> +/**
> + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
> + * @cpu: CPU number to execute work on
> + * @wq: workqueue to use
> + * @rwork: work to queue

For many people, "RCU grace period" is clear enough, but not ALL.

So please make it a little more clear that it just queues work after
a *Normal* RCU grace period. it supports only one RCU variant.


> + *
> + * Return: %false if @work was already on a queue, %true otherwise.
> + */

I'm afraid this will be a hard-using API.

The user can't find a plan B when it returns false, especially when
the user expects the work must be called at least once again
after an RCU grace period.

And the error-prone part of it is that, like other queue_work() functions,
the return value of it is often ignored and makes the problem worse.

So, since workqueue.c provides this API, it should handle this
problem. For example, by calling call_rcu() again in this case, but
everything will be much more complex: a synchronization is needed
for "calling call_rcu() again" and allowing the work item called
twice after the last queue_rcu_work() is not workqueue style.

Some would argue that the delayed_work has the same problem
when the user expects the work must be called at least once again
after a period of time. But time interval is easy to detect, the user
can check the time and call the queue_delayed_work() again
when needed which is also a frequent design pattern. And
for rcu, it is hard to use this design pattern since it is hard
to detect (new) rcu grace period without using call_rcu().

I would not provide this API. it is not a NACK. I'm just trying
expressing my thinking about the API. I'd rather RCU be changed
and RCU callbacks are changed to be sleepable. But a complete
overhaul cleanup on the whole source tree for compatibility
is needed at first, an even more complex job.



> +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
> +                      struct rcu_work *rwork)
> +{
> +       struct work_struct *work = &rwork->work;
> +
> +       if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> +               rwork->wq = wq;
> +               rwork->cpu = cpu;
> +               call_rcu(&rwork->rcu, rcu_work_rcufn);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +EXPORT_SYMBOL(queue_rcu_work_on);
> +

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-07  2:49     ` Lai Jiangshan
@ 2018-03-07 14:54       ` Paul E. McKenney
  2018-03-07 16:23         ` Peter Zijlstra
  2018-03-08  0:29         ` Lai Jiangshan
  0 siblings, 2 replies; 26+ messages in thread
From: Paul E. McKenney @ 2018-03-07 14:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Tejun Heo, torvalds, jannh, bcrl, viro, kent.overstreet,
	security, LKML, kernel-team

On Wed, Mar 07, 2018 at 10:49:49AM +0800, Lai Jiangshan wrote:
> On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <tj@kernel.org> wrote:
> 
> > +/**
> > + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
> > + * @cpu: CPU number to execute work on
> > + * @wq: workqueue to use
> > + * @rwork: work to queue
> 
> For many people, "RCU grace period" is clear enough, but not ALL.
> 
> So please make it a little more clear that it just queues work after
> a *Normal* RCU grace period. it supports only one RCU variant.
> 
> 
> > + *
> > + * Return: %false if @work was already on a queue, %true otherwise.
> > + */
> 
> I'm afraid this will be a hard-using API.
> 
> The user can't find a plan B when it returns false, especially when
> the user expects the work must be called at least once again
> after an RCU grace period.
> 
> And the error-prone part of it is that, like other queue_work() functions,
> the return value of it is often ignored and makes the problem worse.
> 
> So, since workqueue.c provides this API, it should handle this
> problem. For example, by calling call_rcu() again in this case, but
> everything will be much more complex: a synchronization is needed
> for "calling call_rcu() again" and allowing the work item called
> twice after the last queue_rcu_work() is not workqueue style.

I confess that I had not thought of this aspect, but doesn't the
rcu_barrier() in v2 of this patchset guarantee that it has passed
the RCU portion of the overall wait time?  Given that, what I am
missing is now this differs from flush_work() on a normal work item.

So could you please show me the sequence of events that leads to this
problem?  (Again, against v2 of this patch, which replaces the
synchronize_rcu() with rcu_barrier().)

> Some would argue that the delayed_work has the same problem
> when the user expects the work must be called at least once again
> after a period of time. But time interval is easy to detect, the user
> can check the time and call the queue_delayed_work() again
> when needed which is also a frequent design pattern. And
> for rcu, it is hard to use this design pattern since it is hard
> to detect (new) rcu grace period without using call_rcu().
> 
> I would not provide this API. it is not a NACK. I'm just trying
> expressing my thinking about the API. I'd rather RCU be changed
> and RCU callbacks are changed to be sleepable. But a complete
> overhaul cleanup on the whole source tree for compatibility
> is needed at first, an even more complex job.

One downside of allowing RCU callback functions to sleep is that
one poorly written callback can block a bunch of other ones.
One advantage of Tejun's approach is that such delays only affect
the workqueues, which are already designed to handle such delays.

						Thanx, Paul

> > +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
> > +                      struct rcu_work *rwork)
> > +{
> > +       struct work_struct *work = &rwork->work;
> > +
> > +       if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > +               rwork->wq = wq;
> > +               rwork->cpu = cpu;
> > +               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > +               return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +EXPORT_SYMBOL(queue_rcu_work_on);
> > +
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref
  2018-03-06 17:33   ` [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
@ 2018-03-07 15:39     ` Dennis Dalessandro
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Dalessandro @ 2018-03-07 15:39 UTC (permalink / raw)
  To: Tejun Heo, torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Mike Marciniszyn, linux-rdma

On 3/6/2018 12:33 PM, Tejun Heo wrote:
> rvt_mregion uses percpu_ref for reference counting and RCU to protect
> accesses from lkey_table.  When a rvt_mregion needs to be freed, it
> first gets unregistered from lkey_table and then rvt_check_refs() is
> called to wait for in-flight usages before the rvt_mregion is freed.
> 
> rvt_check_refs() seems to have a couple issues.
> 
> * It has a fast exit path which tests percpu_ref_is_zero().  However,
>    a percpu_ref reading zero doesn't mean that the object can be
>    released.  In fact, the ->release() callback might not even have
>    started executing yet.  Proceeding with freeing can lead to
>    use-after-free.
> 
> * lkey_table is RCU protected but there is no RCU grace period in the
>    free path.  percpu_ref uses RCU internally but it's sched-RCU whose
>    grace periods are different from regular RCU.  Also, it generally
>    isn't a good idea to depend on internal behaviors like this.
> 
> To address the above issues, this patch removes the the fast exit and

Typo above too many "the".

> adds an explicit synchronize_rcu().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: linux-rdma@vger.kernel.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> Hello, Dennis, Mike.
> 
> I don't know RDMA at all and this patch is only compile tested.  Can
> you please take a careful look?

Looks good to me, passes my basic sanity tests. Thanks!

Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-07 14:54       ` Paul E. McKenney
@ 2018-03-07 16:23         ` Peter Zijlstra
  2018-03-07 17:58           ` Paul E. McKenney
  2018-03-08  0:29         ` Lai Jiangshan
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2018-03-07 16:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, Tejun Heo, torvalds, jannh, bcrl, viro,
	kent.overstreet, security, LKML, kernel-team

On Wed, Mar 07, 2018 at 06:54:08AM -0800, Paul E. McKenney wrote:
> One downside of allowing RCU callback functions to sleep is that
> one poorly written callback can block a bunch of other ones.

Not to mention that we really want the RCU callbacks to be simple and
short. Needing fancy things in the callback really should be the
exception not the rule.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-07 16:23         ` Peter Zijlstra
@ 2018-03-07 17:58           ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2018-03-07 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Tejun Heo, torvalds, jannh, bcrl, viro,
	kent.overstreet, security, LKML, kernel-team

On Wed, Mar 07, 2018 at 05:23:17PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 07, 2018 at 06:54:08AM -0800, Paul E. McKenney wrote:
> > One downside of allowing RCU callback functions to sleep is that
> > one poorly written callback can block a bunch of other ones.
> 
> Not to mention that we really want the RCU callbacks to be simple and
> short. Needing fancy things in the callback really should be the
> exception not the rule.

Very much agreed with that as well!

Besides, Tejun's queue_rcu_work() provides this functionality and
seems pretty straightforward to use.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-07 14:54       ` Paul E. McKenney
  2018-03-07 16:23         ` Peter Zijlstra
@ 2018-03-08  0:29         ` Lai Jiangshan
  2018-03-08 17:28           ` Paul E. McKenney
  2018-03-09 16:21           ` Tejun Heo
  1 sibling, 2 replies; 26+ messages in thread
From: Lai Jiangshan @ 2018-03-08  0:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, torvalds, Jann Horn, bcrl, viro, Kent Overstreet,
	security, LKML, kernel-team

On Wed, Mar 7, 2018 at 10:54 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Wed, Mar 07, 2018 at 10:49:49AM +0800, Lai Jiangshan wrote:
>> On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <tj@kernel.org> wrote:
>>
>> > +/**
>> > + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
>> > + * @cpu: CPU number to execute work on
>> > + * @wq: workqueue to use
>> > + * @rwork: work to queue
>>
>> For many people, "RCU grace period" is clear enough, but not ALL.
>>
>> So please make it a little more clear that it just queues work after
>> a *Normal* RCU grace period. it supports only one RCU variant.
>>
>>
>> > + *
>> > + * Return: %false if @work was already on a queue, %true otherwise.
>> > + */
>>
>> I'm afraid this will be a hard-using API.
>>
>> The user can't find a plan B when it returns false, especially when
>> the user expects the work must be called at least once again
>> after an RCU grace period.
>>
>> And the error-prone part of it is that, like other queue_work() functions,
>> the return value of it is often ignored and makes the problem worse.
>>
>> So, since workqueue.c provides this API, it should handle this
>> problem. For example, by calling call_rcu() again in this case, but
>> everything will be much more complex: a synchronization is needed
>> for "calling call_rcu() again" and allowing the work item called
>> twice after the last queue_rcu_work() is not workqueue style.
>
> I confess that I had not thought of this aspect, but doesn't the
> rcu_barrier() in v2 of this patchset guarantee that it has passed
> the RCU portion of the overall wait time?  Given that, what I am
> missing is now this differs from flush_work() on a normal work item.
>
> So could you please show me the sequence of events that leads to this
> problem?  (Again, against v2 of this patch, which replaces the
> synchronize_rcu() with rcu_barrier().)

I mentioned a subtle use case that user would think it is supported
since the comment doesn't disallow it.

It is clear that the user expects
   the work must be called at least once after the API returns
   the work must be called after an RCU grace period

But in the case when the user expects the work must be called
at least once again after "queue_rcu_work() + an RCU grace period",
the API is not competent to it if the work is queued.
Although the user can detect it by the return value of
queue_rcu_work(), the user hardly makes his expectation
happen by adding appropriate code.

>
>> Some would argue that the delayed_work has the same problem
>> when the user expects the work must be called at least once again
>> after a period of time. But time interval is easy to detect, the user
>> can check the time and call the queue_delayed_work() again
>> when needed which is also a frequent design pattern. And
>> for rcu, it is hard to use this design pattern since it is hard
>> to detect (new) rcu grace period without using call_rcu().
>>
>> I would not provide this API. it is not a NACK. I'm just trying
>> expressing my thinking about the API. I'd rather RCU be changed
>> and RCU callbacks are changed to be sleepable. But a complete
>> overhaul cleanup on the whole source tree for compatibility
>> is needed at first, an even more complex job.
>
> One downside of allowing RCU callback functions to sleep is that
> one poorly written callback can block a bunch of other ones.
> One advantage of Tejun's approach is that such delays only affect
> the workqueues, which are already designed to handle such delays.
>
>                                                 Thanx, Paul
>
>> > +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
>> > +                      struct rcu_work *rwork)
>> > +{
>> > +       struct work_struct *work = &rwork->work;
>> > +
>> > +       if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>> > +               rwork->wq = wq;
>> > +               rwork->cpu = cpu;
>> > +               call_rcu(&rwork->rcu, rcu_work_rcufn);
>> > +               return true;
>> > +       }
>> > +
>> > +       return false;
>> > +}
>> > +EXPORT_SYMBOL(queue_rcu_work_on);
>> > +
>>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-08  0:29         ` Lai Jiangshan
@ 2018-03-08 17:28           ` Paul E. McKenney
  2018-03-09 16:21           ` Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2018-03-08 17:28 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Tejun Heo, torvalds, Jann Horn, bcrl, viro, Kent Overstreet,
	security, LKML, kernel-team

On Thu, Mar 08, 2018 at 08:29:53AM +0800, Lai Jiangshan wrote:
> On Wed, Mar 7, 2018 at 10:54 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Wed, Mar 07, 2018 at 10:49:49AM +0800, Lai Jiangshan wrote:
> >> On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <tj@kernel.org> wrote:
> >>
> >> > +/**
> >> > + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
> >> > + * @cpu: CPU number to execute work on
> >> > + * @wq: workqueue to use
> >> > + * @rwork: work to queue
> >>
> >> For many people, "RCU grace period" is clear enough, but not ALL.
> >>
> >> So please make it a little more clear that it just queues work after
> >> a *Normal* RCU grace period. it supports only one RCU variant.
> >>
> >>
> >> > + *
> >> > + * Return: %false if @work was already on a queue, %true otherwise.
> >> > + */
> >>
> >> I'm afraid this will be a hard-using API.
> >>
> >> The user can't find a plan B when it returns false, especially when
> >> the user expects the work must be called at least once again
> >> after an RCU grace period.
> >>
> >> And the error-prone part of it is that, like other queue_work() functions,
> >> the return value of it is often ignored and makes the problem worse.
> >>
> >> So, since workqueue.c provides this API, it should handle this
> >> problem. For example, by calling call_rcu() again in this case, but
> >> everything will be much more complex: a synchronization is needed
> >> for "calling call_rcu() again" and allowing the work item called
> >> twice after the last queue_rcu_work() is not workqueue style.
> >
> > I confess that I had not thought of this aspect, but doesn't the
> > rcu_barrier() in v2 of this patchset guarantee that it has passed
> > the RCU portion of the overall wait time?  Given that, what I am
> > missing is now this differs from flush_work() on a normal work item.
> >
> > So could you please show me the sequence of events that leads to this
> > problem?  (Again, against v2 of this patch, which replaces the
> > synchronize_rcu() with rcu_barrier().)
> 
> I mentioned a subtle use case that user would think it is supported
> since the comment doesn't disallow it.
> 
> It is clear that the user expects
>    the work must be called at least once after the API returns

I would have said "before the API returns" rather than "after the
API returns".  What am I missing here?

>    the work must be called after an RCU grace period
> 
> But in the case when the user expects the work must be called
> at least once again after "queue_rcu_work() + an RCU grace period",
> the API is not competent to it if the work is queued.
> Although the user can detect it by the return value of
> queue_rcu_work(), the user hardly makes his expectation
> happen by adding appropriate code.

Beyond a certain point, I need to defer to Tejun on this, but I thought
that a false return meant that the work executed before the flush call.
Here I am assuming that the caller knows that the queue_work_rcu()
already happened and that another queue_work_rcu() won't be invoked on
that same work item until the flush completes.  Without this sort of
assumption, I agree that there could be problems, but without these
assumptions it seems to me that you would have exactly the same problems
with the non-RCU APIs.

What am I missing?

						Thanx, Paul

> >> Some would argue that the delayed_work has the same problem
> >> when the user expects the work must be called at least once again
> >> after a period of time. But time interval is easy to detect, the user
> >> can check the time and call the queue_delayed_work() again
> >> when needed which is also a frequent design pattern. And
> >> for rcu, it is hard to use this design pattern since it is hard
> >> to detect (new) rcu grace period without using call_rcu().
> >>
> >> I would not provide this API. it is not a NACK. I'm just trying
> >> expressing my thinking about the API. I'd rather RCU be changed
> >> and RCU callbacks are changed to be sleepable. But a complete
> >> overhaul cleanup on the whole source tree for compatibility
> >> is needed at first, an even more complex job.
> >
> > One downside of allowing RCU callback functions to sleep is that
> > one poorly written callback can block a bunch of other ones.
> > One advantage of Tejun's approach is that such delays only affect
> > the workqueues, which are already designed to handle such delays.
> >
> >                                                 Thanx, Paul
> >
> >> > +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
> >> > +                      struct rcu_work *rwork)
> >> > +{
> >> > +       struct work_struct *work = &rwork->work;
> >> > +
> >> > +       if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >> > +               rwork->wq = wq;
> >> > +               rwork->cpu = cpu;
> >> > +               call_rcu(&rwork->rcu, rcu_work_rcufn);
> >> > +               return true;
> >> > +       }
> >> > +
> >> > +       return false;
> >> > +}
> >> > +EXPORT_SYMBOL(queue_rcu_work_on);
> >> > +
> >>
> >
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-06 18:30     ` Linus Torvalds
@ 2018-03-09 15:37       ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-09 15:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Paul McKenney, Benjamin LaHaise, Al Viro,
	Kent Overstreet, security, Linux Kernel Mailing List,
	kernel-team

Hello, Linus.

On Tue, Mar 06, 2018 at 10:30:29AM -0800, Linus Torvalds wrote:
>  - can we split this patch up, so that if somebody bisects a problem
> to it, we'll see if it's cgroup or aio that triggers it?

Will do.

> So I'd like to either just make the thing always just use
> WORK_CPU_UNBOUND, or hear some kind of (handwaving ok) explanation for
> why something else would ever make sense. If the action is
> fundamentally delayed by RCU, why would it make a difference which CPU
> it runs on?

It was mostly for consistency with other interfaces.  Let's drop
queue_work_on() for now.  If ever necessary, we can easily add it back
later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work
  2018-03-08  0:29         ` Lai Jiangshan
  2018-03-08 17:28           ` Paul E. McKenney
@ 2018-03-09 16:21           ` Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2018-03-09 16:21 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, torvalds, Jann Horn, bcrl, viro,
	Kent Overstreet, security, LKML, kernel-team

Hello,

On Thu, Mar 08, 2018 at 08:29:53AM +0800, Lai Jiangshan wrote:
> I mentioned a subtle use case that user would think it is supported
> since the comment doesn't disallow it.
> 
> It is clear that the user expects
>    the work must be called at least once after the API returns
>    the work must be called after an RCU grace period
> 
> But in the case when the user expects the work must be called
> at least once again after "queue_rcu_work() + an RCU grace period",
> the API is not competent to it if the work is queued.
> Although the user can detect it by the return value of
> queue_rcu_work(), the user hardly makes his expectation
> happen by adding appropriate code.

We should definitely document it better but it isn't any different
from delayed_work, and I don't see a reason to deviate.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()
  2018-03-06 17:52     ` Bart Van Assche
@ 2018-03-14 18:46       ` tj
  2018-03-14 20:05         ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: tj @ 2018-03-14 18:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: torvalds, viro, jannh, paulmck, bcrl, kent.overstreet, security,
	linux-kernel, kernel-team, axboe

Hello, Bart.

On Tue, Mar 06, 2018 at 05:52:50PM +0000, Bart Van Assche wrote:
> I think the rcu_read_lock() and rcu_read_unlock() are really necessary in this
> code. In the LWN article "The RCU-barrier menagerie" it is explained that RCU
> can be used to enforce write ordering globally if the code that reads the writes

Yeah but, for it to be meaningful, just like barriers, it has to be
paired with something on the other side.

> that are ordered with an RCU read lock (https://lwn.net/Articles/573497/). See
> also the following comment in scsi_device_quiesce():
> 
> 	/*
> 	 * Ensure that the effect of blk_set_preempt_only() will be visible
> 	 * for percpu_ref_tryget() callers that occur after the queue
> 	 * unfreeze even if the queue was already frozen before this function
> 	 * was called. See also https://lwn.net/Articles/573497/.
> 	 */
> 
> Since this patch introduces a subtle and hard to debug race condition, please
> drop this patch.

Hah, the pairing is between scsi_device_quiesce() and
blk_queue_enter()?  But that doesn't make sense either because
scsi_device_quiesce() is doing regular synchronize_rcu() and
blk_queue_entre() is doing rcu_read_lock_sched().  They don't
interlock with each other in any way.

So, the right thing to do here would be somehow moving the RCU
synchronization into blk_set_preempt_only() and switching to regular
RCU protection in blk_queue_enter().  The code as-is isn't really
doing anything.

I'll drop this patch from the series for now.  Let's revisit it later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()
  2018-03-14 18:46       ` tj
@ 2018-03-14 20:05         ` Bart Van Assche
  2018-03-14 20:08           ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2018-03-14 20:05 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, jannh, bcrl, torvalds, kernel-team, security,
	kent.overstreet, viro, axboe, paulmck

On Wed, 2018-03-14 at 11:46 -0700, tj@kernel.org wrote:
> > that are ordered with an RCU read lock (https://lwn.net/Articles/573497/). See
> > also the following comment in scsi_device_quiesce():
> > 
> > 	/*
> > 	 * Ensure that the effect of blk_set_preempt_only() will be visible
> > 	 * for percpu_ref_tryget() callers that occur after the queue
> > 	 * unfreeze even if the queue was already frozen before this function
> > 	 * was called. See also https://lwn.net/Articles/573497/.
> > 	 */
> > 
> > Since this patch introduces a subtle and hard to debug race condition, please
> > drop this patch.
> 
> Hah, the pairing is between scsi_device_quiesce() and
> blk_queue_enter()?  But that doesn't make sense either because
> scsi_device_quiesce() is doing regular synchronize_rcu() and
> blk_queue_enter() is doing rcu_read_lock_sched().  They don't
> interlock with each other in any way.

Can you clarify this further? From <linux/rcupdate.h>:

static inline void synchronize_rcu(void)
{
	synchronize_sched();
}

> So, the right thing to do here would be somehow moving the RCU
> synchronization into blk_set_preempt_only() and switching to regular
> RCU protection in blk_queue_enter().  The code as-is isn't really
> doing anything.

Since the RCU protection in blk_queue_enter() surrounds a percpu_ref_tryget_live()
call and since percpu_ref_tryget_live() calls rcu_read_lock/unlock_sched() itself
I don't think that it is allowed to switch to regular RCU protection in
blk_queue_enter() without modifying the RCU implementation.

Bart.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()
  2018-03-14 20:05         ` Bart Van Assche
@ 2018-03-14 20:08           ` Peter Zijlstra
  2018-03-14 20:14             ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2018-03-14 20:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj, linux-kernel, jannh, bcrl, torvalds, kernel-team, security,
	kent.overstreet, viro, axboe, paulmck

On Wed, Mar 14, 2018 at 08:05:30PM +0000, Bart Van Assche wrote:
> Can you clarify this further? From <linux/rcupdate.h>:
> 
> static inline void synchronize_rcu(void)
> {
> 	synchronize_sched();
> }

You'll find that is for !CONFIG_PREEMPT_RCU.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()
  2018-03-14 20:08           ` Peter Zijlstra
@ 2018-03-14 20:14             ` Bart Van Assche
  0 siblings, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2018-03-14 20:14 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, jannh, bcrl, torvalds, kernel-team,
	kent.overstreet, security, viro, axboe, paulmck, tj

On Wed, 2018-03-14 at 21:08 +0100, Peter Zijlstra wrote:
> On Wed, Mar 14, 2018 at 08:05:30PM +0000, Bart Van Assche wrote:
> > Can you clarify this further? From <linux/rcupdate.h>:
> > 
> > static inline void synchronize_rcu(void)
> > {
> > 	synchronize_sched();
> > }
> 
> You'll find that is for !CONFIG_PREEMPT_RCU.

Thanks for the clarification. I will submit a fix for the SCSI code.

Bart.



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-03-14 20:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-06 17:26 [PATCHSET] percpu_ref, RCU: Audit RCU usages in percpu_ref users Tejun Heo
2018-03-06 17:33 ` [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
2018-03-06 17:33   ` [PATCH 2/7] fs/aio: Use RCU accessors for kioctx_table->table[] Tejun Heo
2018-03-06 17:33   ` [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
2018-03-07 15:39     ` Dennis Dalessandro
2018-03-06 17:33   ` [PATCH 4/7] HMM: Remove superflous RCU protection around radix tree lookup Tejun Heo
2018-03-06 17:33     ` Tejun Heo
2018-03-06 17:59     ` Jerome Glisse
2018-03-06 17:59       ` Jerome Glisse
2018-03-06 17:33   ` [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter() Tejun Heo
2018-03-06 17:52     ` Bart Van Assche
2018-03-14 18:46       ` tj
2018-03-14 20:05         ` Bart Van Assche
2018-03-14 20:08           ` Peter Zijlstra
2018-03-14 20:14             ` Bart Van Assche
2018-03-06 17:33   ` [PATCH 6/7] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods Tejun Heo
2018-03-06 17:33   ` [PATCH 7/7] RCU, workqueue: Implement rcu_work Tejun Heo
2018-03-06 18:30     ` Linus Torvalds
2018-03-09 15:37       ` Tejun Heo
2018-03-07  2:49     ` Lai Jiangshan
2018-03-07 14:54       ` Paul E. McKenney
2018-03-07 16:23         ` Peter Zijlstra
2018-03-07 17:58           ` Paul E. McKenney
2018-03-08  0:29         ` Lai Jiangshan
2018-03-08 17:28           ` Paul E. McKenney
2018-03-09 16:21           ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.