All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v2] percpu_ref, RCU: Audit RCU usages in percpu_ref users
@ 2018-03-14 19:41 Tejun Heo
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:41 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team

Hello,

Linus, if you want to pick up the first three fix patches, please pull
from the following branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git percpu_ref-rcu-audit-fixes

Changes from v1[L] are

* blk_queue_enter() patch dropped.  Needs further investigation.

* queue_rcu_work_on() dropped and documentation improved.

* rcu_work conversion patches separated out.

Original patchset description follows.

Jann Horn found that aio was depending on the internal RCU grace
periods of percpu-ref and that it's broken because aio uses regular
RCU while percpu_ref uses sched-RCU.

Depending on percpu_ref's internal grace periods isn't a good idea
because

 * The RCU type might not match.

 * percpu_ref's grace periods are used to switch to atomic mode.  They
   aren't between the last put and the invocation of the last release.
   This is easy to get confused about and can lead to subtle bugs.

 * percpu_ref might not have grace periods at all depending on its
   current operation mode.

This patchset audits all percpu_ref users for their RCU usages,
clarifies percpu_ref documentation that the internal grace periods
must not be depended upon, and introduces rcu_work to simplify
bouncing to a workqueue after an RCU grace period.

This patchset contains the following eight patches.

 0001-fs-aio-Add-explicit-RCU-grace-period-when-freeing-ki.patch
 0002-fs-aio-Use-RCU-accessors-for-kioctx_table-table.patch
 0003-RDMAVT-Fix-synchronization-around-percpu_ref.patch
 0004-HMM-Remove-superflous-RCU-protection-around-radix-tr.patch
 0005-percpu_ref-Update-doc-to-dissuade-users-from-dependi.patch
 0006-RCU-workqueue-Implement-rcu_work.patch
 0007-cgroup-Use-rcu_work-instead-of-explicit-rcu-and-work.patch
 0008-fs-aio-Use-rcu_work-instead-of-explicit-rcu-and-work.patch

0001-0003 are fixes and tagged -stable.

0004 removes (seemingly) superflous RCU read lock usages.

0005 updates the doc and 0006-0008 introduce rcu_work and use it
instead of open-coded rcu+work.

This patchset is also available in the following git tree.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git percpu_ref-rcu-audit-v2

diffstat follows.  Thanks.

 drivers/infiniband/sw/rdmavt/mr.c |   10 ++++---
 fs/aio.c                          |   39 ++++++++++++++++-----------
 include/linux/cgroup-defs.h       |    2 -
 include/linux/percpu-refcount.h   |   18 ++++++++----
 include/linux/workqueue.h         |   23 ++++++++++++++++
 kernel/cgroup/cgroup.c            |   21 ++++----------
 kernel/workqueue.c                |   54 ++++++++++++++++++++++++++++++++++++++
 lib/percpu-refcount.c             |    2 +
 mm/hmm.c                          |   12 +-------
 9 files changed, 130 insertions(+), 51 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/20180306172657.3060270-1-tj@kernel.org
    http://lkml.kernel.org/r/20180306173316.3088458-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx
  2018-03-14 19:41 [PATCHSET v2] percpu_ref, RCU: Audit RCU usages in percpu_ref users Tejun Heo
@ 2018-03-14 19:45 ` Tejun Heo
  2018-03-14 19:45   ` [PATCH 2/8] fs/aio: Use RCU accessors for kioctx_table->table[] Tejun Heo
                     ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, stable

While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU.  This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.

Fix it by updating free_ioctx() to go through call_rcu() explicitly.

v2: Comment added to explain double bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: e34ecee2ae79 ("aio: Fix a trinity splat")
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
---
 fs/aio.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index a062d75..eb2e0cf 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -115,7 +115,8 @@ struct kioctx {
 	struct page		**ring_pages;
 	long			nr_pages;
 
-	struct work_struct	free_work;
+	struct rcu_head		free_rcu;
+	struct work_struct	free_work;	/* see free_ioctx() */
 
 	/*
 	 * signals when all in-flight requests are done
@@ -588,6 +589,12 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 	return cancel(&kiocb->common);
 }
 
+/*
+ * free_ioctx() should be RCU delayed to synchronize against the RCU
+ * protected lookup_ioctx() and also needs process context to call
+ * aio_free_ring(), so the double bouncing through kioctx->free_rcu and
+ * ->free_work.
+ */
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -601,6 +608,14 @@ static void free_ioctx(struct work_struct *work)
 	kmem_cache_free(kioctx_cachep, ctx);
 }
 
+static void free_ioctx_rcufn(struct rcu_head *head)
+{
+	struct kioctx *ctx = container_of(head, struct kioctx, free_rcu);
+
+	INIT_WORK(&ctx->free_work, free_ioctx);
+	schedule_work(&ctx->free_work);
+}
+
 static void free_ioctx_reqs(struct percpu_ref *ref)
 {
 	struct kioctx *ctx = container_of(ref, struct kioctx, reqs);
@@ -609,8 +624,8 @@ static void free_ioctx_reqs(struct percpu_ref *ref)
 	if (ctx->rq_wait && atomic_dec_and_test(&ctx->rq_wait->count))
 		complete(&ctx->rq_wait->comp);
 
-	INIT_WORK(&ctx->free_work, free_ioctx);
-	schedule_work(&ctx->free_work);
+	/* Synchronize against RCU protected table->table[] dereferences */
+	call_rcu(&ctx->free_rcu, free_ioctx_rcufn);
 }
 
 /*
@@ -838,7 +853,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	table->table[ctx->id] = NULL;
 	spin_unlock(&mm->ioctx_lock);
 
-	/* percpu_ref_kill() will do the necessary call_rcu() */
+	/* free_ioctx_reqs() will do the necessary RCU synchronization */
 	wake_up_all(&ctx->wait);
 
 	/*
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/8] fs/aio: Use RCU accessors for kioctx_table->table[]
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
@ 2018-03-14 19:45   ` Tejun Heo
  2018-03-14 19:45   ` [PATCH 3/8] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, stable

While converting ioctx index from a list to a table, db446a08c23d
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors.  This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
---
 fs/aio.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index eb2e0cf..6bcd3fb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -68,9 +68,9 @@ struct aio_ring {
 #define AIO_RING_PAGES	8
 
 struct kioctx_table {
-	struct rcu_head	rcu;
-	unsigned	nr;
-	struct kioctx	*table[];
+	struct rcu_head		rcu;
+	unsigned		nr;
+	struct kioctx __rcu	*table[];
 };
 
 struct kioctx_cpu {
@@ -330,7 +330,7 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
 	for (i = 0; i < table->nr; i++) {
 		struct kioctx *ctx;
 
-		ctx = table->table[i];
+		ctx = rcu_dereference(table->table[i]);
 		if (ctx && ctx->aio_ring_file == file) {
 			if (!atomic_read(&ctx->dead)) {
 				ctx->user_id = ctx->mmap_base = vma->vm_start;
@@ -666,9 +666,9 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 	while (1) {
 		if (table)
 			for (i = 0; i < table->nr; i++)
-				if (!table->table[i]) {
+				if (!rcu_access_pointer(table->table[i])) {
 					ctx->id = i;
-					table->table[i] = ctx;
+					rcu_assign_pointer(table->table[i], ctx);
 					spin_unlock(&mm->ioctx_lock);
 
 					/* While kioctx setup is in progress,
@@ -849,8 +849,8 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	}
 
 	table = rcu_dereference_raw(mm->ioctx_table);
-	WARN_ON(ctx != table->table[ctx->id]);
-	table->table[ctx->id] = NULL;
+	WARN_ON(ctx != rcu_access_pointer(table->table[ctx->id]));
+	RCU_INIT_POINTER(table->table[ctx->id], NULL);
 	spin_unlock(&mm->ioctx_lock);
 
 	/* free_ioctx_reqs() will do the necessary RCU synchronization */
@@ -895,7 +895,8 @@ void exit_aio(struct mm_struct *mm)
 
 	skipped = 0;
 	for (i = 0; i < table->nr; ++i) {
-		struct kioctx *ctx = table->table[i];
+		struct kioctx *ctx =
+			rcu_dereference_protected(table->table[i], true);
 
 		if (!ctx) {
 			skipped++;
@@ -1084,7 +1085,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 	if (!table || id >= table->nr)
 		goto out;
 
-	ctx = table->table[id];
+	ctx = rcu_dereference(table->table[id]);
 	if (ctx && ctx->user_id == ctx_id) {
 		percpu_ref_get(&ctx->users);
 		ret = ctx;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/8] RDMAVT: Fix synchronization around percpu_ref
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
  2018-03-14 19:45   ` [PATCH 2/8] fs/aio: Use RCU accessors for kioctx_table->table[] Tejun Heo
@ 2018-03-14 19:45   ` Tejun Heo
  2018-03-15 22:24     ` Jason Gunthorpe
  2018-03-14 19:45     ` Tejun Heo
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, Mike Marciniszyn,
	linux-rdma

rvt_mregion uses percpu_ref for reference counting and RCU to protect
accesses from lkey_table.  When a rvt_mregion needs to be freed, it
first gets unregistered from lkey_table and then rvt_check_refs() is
called to wait for in-flight usages before the rvt_mregion is freed.

rvt_check_refs() seems to have a couple issues.

* It has a fast exit path which tests percpu_ref_is_zero().  However,
  a percpu_ref reading zero doesn't mean that the object can be
  released.  In fact, the ->release() callback might not even have
  started executing yet.  Proceeding with freeing can lead to
  use-after-free.

* lkey_table is RCU protected but there is no RCU grace period in the
  free path.  percpu_ref uses RCU internally but it's sched-RCU whose
  grace periods are different from regular RCU.  Also, it generally
  isn't a good idea to depend on internal behaviors like this.

To address the above issues, this patch removes the fast exit and adds
an explicit synchronize_rcu().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: linux-rdma@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 drivers/infiniband/sw/rdmavt/mr.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
index 1b2e536..cc429b5 100644
--- a/drivers/infiniband/sw/rdmavt/mr.c
+++ b/drivers/infiniband/sw/rdmavt/mr.c
@@ -489,11 +489,13 @@ static int rvt_check_refs(struct rvt_mregion *mr, const char *t)
 	unsigned long timeout;
 	struct rvt_dev_info *rdi = ib_to_rvt(mr->pd->device);
 
-	if (percpu_ref_is_zero(&mr->refcount))
-		return 0;
-	/* avoid dma mr */
-	if (mr->lkey)
+	if (mr->lkey) {
+		/* avoid dma mr */
 		rvt_dereg_clean_qps(mr);
+		/* @mr was indexed on rcu protected @lkey_table */
+		synchronize_rcu();
+	}
+
 	timeout = wait_for_completion_timeout(&mr->comp, 5 * HZ);
 	if (!timeout) {
 		rvt_pr_err(rdi,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/8] HMM: Remove superflous RCU protection around radix tree lookup
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
@ 2018-03-14 19:45     ` Tejun Heo
  2018-03-14 19:45   ` [PATCH 3/8] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, linux-mm

hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
which actually uses the RCU protection.  The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superflous rcu_read_lock/unlock() around the function.

This doesn't add anything and just adds to confusion.  Remove the RCU
protection and open-code the radix tree lookup.  If this needs to
become more sophisticated in the future, let's add them back when
necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
Hello,

Jérôme, how do you want to route this patch?  If you prefer, I can
route it together with other patches.

Thanks.

 mm/hmm.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98..d4627c5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	hmm_devmem_radix_release(resource);
 }
 
-static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
-{
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
-}
-
 static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
 		struct hmm_devmem *dup;
 
-		rcu_read_lock();
-		dup = hmm_devmem_find(key);
-		rcu_read_unlock();
+		dup = radix_tree_lookup(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT);
 		if (dup) {
 			dev_err(device, "%s: collides with mapping for %s\n",
 				__func__, dev_name(dup->device));
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/8] HMM: Remove superflous RCU protection around radix tree lookup
@ 2018-03-14 19:45     ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo, linux-mm

hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
which actually uses the RCU protection.  The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superflous rcu_read_lock/unlock() around the function.

This doesn't add anything and just adds to confusion.  Remove the RCU
protection and open-code the radix tree lookup.  If this needs to
become more sophisticated in the future, let's add them back when
necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
Hello,

JA(C)rA'me, how do you want to route this patch?  If you prefer, I can
route it together with other patches.

Thanks.

 mm/hmm.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98..d4627c5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	hmm_devmem_radix_release(resource);
 }
 
-static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
-{
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
-}
-
 static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
 		struct hmm_devmem *dup;
 
-		rcu_read_lock();
-		dup = hmm_devmem_find(key);
-		rcu_read_unlock();
+		dup = radix_tree_lookup(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT);
 		if (dup) {
 			dev_err(device, "%s: collides with mapping for %s\n",
 				__func__, dev_name(dup->device));
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/8] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
                     ` (2 preceding siblings ...)
  2018-03-14 19:45     ` Tejun Heo
@ 2018-03-14 19:45   ` Tejun Heo
  2018-03-19 17:10     ` Tejun Heo
  2018-03-14 19:45   ` [PATCH 6/8] RCU, workqueue: Implement rcu_work Tejun Heo
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo

percpu_ref internally uses sched-RCU to implement the percpu -> atomic
mode switching and the documentation suggested that this could be
depended upon.  This doesn't seem like a good idea.

* percpu_ref uses sched-RCU which has different grace periods regular
  RCU.  Users may combine percpu_ref with regular RCU usage and
  incorrectly believe that regular RCU grace periods are performed by
  percpu_ref.  This can lead to, for example, use-after-free due to
  premature freeing.

* percpu_ref has a grace period when switching from percpu to atomic
  mode.  It doesn't have one between the last put and release.  This
  distinction is subtle and can lead to surprising bugs.

* percpu_ref allows starting in and switching to atomic mode manually
  for debugging and other purposes.  This means that there may not be
  any grace periods from kill to release.

This patch makes it clear that the grace periods are percpu_ref's
internal implementation detail and can't be depended upon by the
users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/percpu-refcount.h | 18 ++++++++++++------
 lib/percpu-refcount.c           |  2 ++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 864d167..009cdf3 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -30,10 +30,14 @@
  * calls io_destroy() or the process exits.
  *
  * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
- * calls percpu_ref_kill(), then hlist_del_rcu() and synchronize_rcu() to remove
- * the kioctx from the proccess's list of kioctxs - after that, there can't be
- * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop
- * the initial ref with percpu_ref_put().
+ * removes the kioctx from the proccess's table of kioctxs and kills percpu_ref.
+ * After that, there can't be any new users of the kioctx (from lookup_ioctx())
+ * and it's then safe to drop the initial ref with percpu_ref_put().
+ *
+ * Note that the free path, free_ioctx(), needs to go through explicit call_rcu()
+ * to synchronize with RCU protected lookup_ioctx().  percpu_ref operations don't
+ * imply RCU grace periods of any kind and if a user wants to combine percpu_ref
+ * with RCU protection, it must be done explicitly.
  *
  * Code that does a two stage shutdown like this often needs some kind of
  * explicit synchronization to ensure the initial refcount can only be dropped
@@ -113,8 +117,10 @@ void percpu_ref_reinit(struct percpu_ref *ref);
  * Must be used to drop the initial ref on a percpu refcount; must be called
  * precisely once before shutdown.
  *
- * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the
- * percpu counters and dropping the initial ref.
+ * Switches @ref into atomic mode before gathering up the percpu counters
+ * and dropping the initial ref.
+ *
+ * There are no implied RCU grace periods between kill and release.
  */
 static inline void percpu_ref_kill(struct percpu_ref *ref)
 {
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 30e7dd8..9f96fa7 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -322,6 +322,8 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
  * This function normally doesn't block and can be called from any context
  * but it may block if @confirm_kill is specified and @ref is in the
  * process of switching to atomic mode by percpu_ref_switch_to_atomic().
+ *
+ * There are no implied RCU grace periods between kill and release.
  */
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/8] RCU, workqueue: Implement rcu_work
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
                     ` (3 preceding siblings ...)
  2018-03-14 19:45   ` [PATCH 5/8] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods Tejun Heo
@ 2018-03-14 19:45   ` Tejun Heo
  2018-03-14 20:13     ` Paul E. McKenney
  2018-03-16  6:01     ` Lai Jiangshan
  2018-03-14 19:45   ` [PATCH 7/8] cgroup: Use rcu_work instead of explicit rcu and work item Tejun Heo
  2018-03-14 19:45   ` [PATCH 8/8] fs/aio: " Tejun Heo
  6 siblings, 2 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo

There are cases where RCU callback needs to be bounced to a sleepable
context.  This is currently done by the RCU callback queueing a work
item, which can be cumbersome to write and confusing to read.

This patch introduces rcu_work, a workqueue work variant which gets
executed after a RCU grace period, and converts the open coded
bouncing in fs/aio and kernel/cgroup.

v3: Dropped queue_rcu_work_on().  Documented rcu grace period behavior
    after queue_rcu_work().

v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
    completion of previously queued rcu callback as per Paul.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/workqueue.h | 23 ++++++++++++++++++++
 kernel/workqueue.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index bc0cda1..d026f8f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -13,6 +13,7 @@
 #include <linux/threads.h>
 #include <linux/atomic.h>
 #include <linux/cpumask.h>
+#include <linux/rcupdate.h>
 
 struct workqueue_struct;
 
@@ -120,6 +121,14 @@ struct delayed_work {
 	int cpu;
 };
 
+struct rcu_work {
+	struct work_struct work;
+	struct rcu_head rcu;
+
+	/* target workqueue ->rcu uses to queue ->work */
+	struct workqueue_struct *wq;
+};
+
 /**
  * struct workqueue_attrs - A struct for workqueue attributes.
  *
@@ -151,6 +160,11 @@ static inline struct delayed_work *to_delayed_work(struct work_struct *work)
 	return container_of(work, struct delayed_work, work);
 }
 
+static inline struct rcu_work *to_rcu_work(struct work_struct *work)
+{
+	return container_of(work, struct rcu_work, work);
+}
+
 struct execute_work {
 	struct work_struct work;
 };
@@ -266,6 +280,12 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
 #define INIT_DEFERRABLE_WORK_ONSTACK(_work, _func)			\
 	__INIT_DELAYED_WORK_ONSTACK(_work, _func, TIMER_DEFERRABLE)
 
+#define INIT_RCU_WORK(_work, _func)					\
+	INIT_WORK(&(_work)->work, (_func))
+
+#define INIT_RCU_WORK_ONSTACK(_work, _func)				\
+	INIT_WORK_ONSTACK(&(_work)->work, (_func))
+
 /**
  * work_pending - Find out whether a work item is currently pending
  * @work: The work item in question
@@ -447,6 +467,7 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *dwork, unsigned long delay);
+extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
 
 extern void flush_workqueue(struct workqueue_struct *wq);
 extern void drain_workqueue(struct workqueue_struct *wq);
@@ -463,6 +484,8 @@ extern bool flush_delayed_work(struct delayed_work *dwork);
 extern bool cancel_delayed_work(struct delayed_work *dwork);
 extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
 
+extern bool flush_rcu_work(struct rcu_work *rwork);
+
 extern void workqueue_set_max_active(struct workqueue_struct *wq,
 				     int max_active);
 extern struct work_struct *current_work(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bb9a519..7df85fa 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1604,6 +1604,40 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(mod_delayed_work_on);
 
+static void rcu_work_rcufn(struct rcu_head *rcu)
+{
+	struct rcu_work *rwork = container_of(rcu, struct rcu_work, rcu);
+
+	/* read the comment in __queue_work() */
+	local_irq_disable();
+	__queue_work(WORK_CPU_UNBOUND, rwork->wq, &rwork->work);
+	local_irq_enable();
+}
+
+/**
+ * queue_rcu_work - queue work after a RCU grace period
+ * @wq: workqueue to use
+ * @rwork: work to queue
+ *
+ * Return: %false if @rwork was already pending, %true otherwise.  Note
+ * that a full RCU grace period is guaranteed only after a %true return.
+ * While @rwork is guarnateed to be executed after a %false return, the
+ * execution may happen before a full RCU grace period has passed.
+ */
+bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
+{
+	struct work_struct *work = &rwork->work;
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		rwork->wq = wq;
+		call_rcu(&rwork->rcu, rcu_work_rcufn);
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(queue_rcu_work);
+
 /**
  * worker_enter_idle - enter idle state
  * @worker: worker which is entering idle state
@@ -3001,6 +3035,26 @@ bool flush_delayed_work(struct delayed_work *dwork)
 }
 EXPORT_SYMBOL(flush_delayed_work);
 
+/**
+ * flush_rcu_work - wait for a rwork to finish executing the last queueing
+ * @rwork: the rcu work to flush
+ *
+ * Return:
+ * %true if flush_rcu_work() waited for the work to finish execution,
+ * %false if it was already idle.
+ */
+bool flush_rcu_work(struct rcu_work *rwork)
+{
+	if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
+		rcu_barrier();
+		flush_work(&rwork->work);
+		return true;
+	} else {
+		return flush_work(&rwork->work);
+	}
+}
+EXPORT_SYMBOL(flush_rcu_work);
+
 static bool __cancel_work(struct work_struct *work, bool is_dwork)
 {
 	unsigned long flags;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/8] cgroup: Use rcu_work instead of explicit rcu and work item
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
                     ` (4 preceding siblings ...)
  2018-03-14 19:45   ` [PATCH 6/8] RCU, workqueue: Implement rcu_work Tejun Heo
@ 2018-03-14 19:45   ` Tejun Heo
  2018-03-14 19:45   ` [PATCH 8/8] fs/aio: " Tejun Heo
  6 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo

Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
Hello,

If there's no objection, I'll route this together with rcu_work patch
through wq/for-4.17.

Thanks.

 include/linux/cgroup-defs.h |  2 +-
 kernel/cgroup/cgroup.c      | 21 +++++++--------------
 2 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 9f242b8..92d7640 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -151,8 +151,8 @@ struct cgroup_subsys_state {
 	atomic_t online_cnt;
 
 	/* percpu_ref killing and RCU release */
-	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
+	struct rcu_work destroy_rwork;
 
 	/*
 	 * PI: the parent css.	Placed here for cache proximity to following
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8cda3bc..4c5d4ca0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4514,10 +4514,10 @@ static struct cftype cgroup_base_files[] = {
  * and thus involve punting to css->destroy_work adding two additional
  * steps to the already complex sequence.
  */
-static void css_free_work_fn(struct work_struct *work)
+static void css_free_rwork_fn(struct work_struct *work)
 {
-	struct cgroup_subsys_state *css =
-		container_of(work, struct cgroup_subsys_state, destroy_work);
+	struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
+				struct cgroup_subsys_state, destroy_rwork);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
@@ -4563,15 +4563,6 @@ static void css_free_work_fn(struct work_struct *work)
 	}
 }
 
-static void css_free_rcu_fn(struct rcu_head *rcu_head)
-{
-	struct cgroup_subsys_state *css =
-		container_of(rcu_head, struct cgroup_subsys_state, rcu_head);
-
-	INIT_WORK(&css->destroy_work, css_free_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
-}
-
 static void css_release_work_fn(struct work_struct *work)
 {
 	struct cgroup_subsys_state *css =
@@ -4621,7 +4612,8 @@ static void css_release_work_fn(struct work_struct *work)
 
 	mutex_unlock(&cgroup_mutex);
 
-	call_rcu(&css->rcu_head, css_free_rcu_fn);
+	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
+	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 }
 
 static void css_release(struct percpu_ref *ref)
@@ -4755,7 +4747,8 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	call_rcu(&css->rcu_head, css_free_rcu_fn);
+	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
+	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
                     ` (5 preceding siblings ...)
  2018-03-14 19:45   ` [PATCH 7/8] cgroup: Use rcu_work instead of explicit rcu and work item Tejun Heo
@ 2018-03-14 19:45   ` Tejun Heo
  2018-03-19 17:12     ` Tejun Heo
  2018-03-21 15:58     ` Oleg Nesterov
  6 siblings, 2 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-14 19:45 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team, Tejun Heo

Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
Hello,

If there's no objection, I'll route this together with rcu_work patch
through wq/for-4.17.

Thanks.

 fs/aio.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6bcd3fb..88d7927 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -115,8 +115,7 @@ struct kioctx {
 	struct page		**ring_pages;
 	long			nr_pages;
 
-	struct rcu_head		free_rcu;
-	struct work_struct	free_work;	/* see free_ioctx() */
+	struct rcu_work		free_rwork;	/* see free_ioctx() */
 
 	/*
 	 * signals when all in-flight requests are done
@@ -592,13 +591,12 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 /*
  * free_ioctx() should be RCU delayed to synchronize against the RCU
  * protected lookup_ioctx() and also needs process context to call
- * aio_free_ring(), so the double bouncing through kioctx->free_rcu and
- * ->free_work.
+ * aio_free_ring().  Use rcu_work.
  */
 static void free_ioctx(struct work_struct *work)
 {
-	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
-
+	struct kioctx *ctx = container_of(to_rcu_work(work), struct kioctx,
+					  free_rwork);
 	pr_debug("freeing %p\n", ctx);
 
 	aio_free_ring(ctx);
@@ -608,14 +606,6 @@ static void free_ioctx(struct work_struct *work)
 	kmem_cache_free(kioctx_cachep, ctx);
 }
 
-static void free_ioctx_rcufn(struct rcu_head *head)
-{
-	struct kioctx *ctx = container_of(head, struct kioctx, free_rcu);
-
-	INIT_WORK(&ctx->free_work, free_ioctx);
-	schedule_work(&ctx->free_work);
-}
-
 static void free_ioctx_reqs(struct percpu_ref *ref)
 {
 	struct kioctx *ctx = container_of(ref, struct kioctx, reqs);
@@ -625,7 +615,8 @@ static void free_ioctx_reqs(struct percpu_ref *ref)
 		complete(&ctx->rq_wait->comp);
 
 	/* Synchronize against RCU protected table->table[] dereferences */
-	call_rcu(&ctx->free_rcu, free_ioctx_rcufn);
+	INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
+	queue_rcu_work(system_wq, &ctx->free_rwork);
 }
 
 /*
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] RCU, workqueue: Implement rcu_work
  2018-03-14 19:45   ` [PATCH 6/8] RCU, workqueue: Implement rcu_work Tejun Heo
@ 2018-03-14 20:13     ` Paul E. McKenney
  2018-03-16  6:01     ` Lai Jiangshan
  1 sibling, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2018-03-14 20:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

On Wed, Mar 14, 2018 at 12:45:13PM -0700, Tejun Heo wrote:
> There are cases where RCU callback needs to be bounced to a sleepable
> context.  This is currently done by the RCU callback queueing a work
> item, which can be cumbersome to write and confusing to read.
> 
> This patch introduces rcu_work, a workqueue work variant which gets
> executed after a RCU grace period, and converts the open coded
> bouncing in fs/aio and kernel/cgroup.
> 
> v3: Dropped queue_rcu_work_on().  Documented rcu grace period behavior
>     after queue_rcu_work().
> 
> v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
>     completion of previously queued rcu callback as per Paul.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Looks good to me!

Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  include/linux/workqueue.h | 23 ++++++++++++++++++++
>  kernel/workqueue.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 77 insertions(+)
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index bc0cda1..d026f8f 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -13,6 +13,7 @@
>  #include <linux/threads.h>
>  #include <linux/atomic.h>
>  #include <linux/cpumask.h>
> +#include <linux/rcupdate.h>
> 
>  struct workqueue_struct;
> 
> @@ -120,6 +121,14 @@ struct delayed_work {
>  	int cpu;
>  };
> 
> +struct rcu_work {
> +	struct work_struct work;
> +	struct rcu_head rcu;
> +
> +	/* target workqueue ->rcu uses to queue ->work */
> +	struct workqueue_struct *wq;
> +};
> +
>  /**
>   * struct workqueue_attrs - A struct for workqueue attributes.
>   *
> @@ -151,6 +160,11 @@ static inline struct delayed_work *to_delayed_work(struct work_struct *work)
>  	return container_of(work, struct delayed_work, work);
>  }
> 
> +static inline struct rcu_work *to_rcu_work(struct work_struct *work)
> +{
> +	return container_of(work, struct rcu_work, work);
> +}
> +
>  struct execute_work {
>  	struct work_struct work;
>  };
> @@ -266,6 +280,12 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
>  #define INIT_DEFERRABLE_WORK_ONSTACK(_work, _func)			\
>  	__INIT_DELAYED_WORK_ONSTACK(_work, _func, TIMER_DEFERRABLE)
> 
> +#define INIT_RCU_WORK(_work, _func)					\
> +	INIT_WORK(&(_work)->work, (_func))
> +
> +#define INIT_RCU_WORK_ONSTACK(_work, _func)				\
> +	INIT_WORK_ONSTACK(&(_work)->work, (_func))
> +
>  /**
>   * work_pending - Find out whether a work item is currently pending
>   * @work: The work item in question
> @@ -447,6 +467,7 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
>  			struct delayed_work *work, unsigned long delay);
>  extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
>  			struct delayed_work *dwork, unsigned long delay);
> +extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
> 
>  extern void flush_workqueue(struct workqueue_struct *wq);
>  extern void drain_workqueue(struct workqueue_struct *wq);
> @@ -463,6 +484,8 @@ extern bool flush_delayed_work(struct delayed_work *dwork);
>  extern bool cancel_delayed_work(struct delayed_work *dwork);
>  extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
> 
> +extern bool flush_rcu_work(struct rcu_work *rwork);
> +
>  extern void workqueue_set_max_active(struct workqueue_struct *wq,
>  				     int max_active);
>  extern struct work_struct *current_work(void);
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index bb9a519..7df85fa 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1604,6 +1604,40 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
>  }
>  EXPORT_SYMBOL_GPL(mod_delayed_work_on);
> 
> +static void rcu_work_rcufn(struct rcu_head *rcu)
> +{
> +	struct rcu_work *rwork = container_of(rcu, struct rcu_work, rcu);
> +
> +	/* read the comment in __queue_work() */
> +	local_irq_disable();
> +	__queue_work(WORK_CPU_UNBOUND, rwork->wq, &rwork->work);
> +	local_irq_enable();
> +}
> +
> +/**
> + * queue_rcu_work - queue work after a RCU grace period
> + * @wq: workqueue to use
> + * @rwork: work to queue
> + *
> + * Return: %false if @rwork was already pending, %true otherwise.  Note
> + * that a full RCU grace period is guaranteed only after a %true return.
> + * While @rwork is guarnateed to be executed after a %false return, the
> + * execution may happen before a full RCU grace period has passed.
> + */
> +bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> +{
> +	struct work_struct *work = &rwork->work;
> +
> +	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> +		rwork->wq = wq;
> +		call_rcu(&rwork->rcu, rcu_work_rcufn);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL(queue_rcu_work);
> +
>  /**
>   * worker_enter_idle - enter idle state
>   * @worker: worker which is entering idle state
> @@ -3001,6 +3035,26 @@ bool flush_delayed_work(struct delayed_work *dwork)
>  }
>  EXPORT_SYMBOL(flush_delayed_work);
> 
> +/**
> + * flush_rcu_work - wait for a rwork to finish executing the last queueing
> + * @rwork: the rcu work to flush
> + *
> + * Return:
> + * %true if flush_rcu_work() waited for the work to finish execution,
> + * %false if it was already idle.
> + */
> +bool flush_rcu_work(struct rcu_work *rwork)
> +{
> +	if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
> +		rcu_barrier();
> +		flush_work(&rwork->work);
> +		return true;
> +	} else {
> +		return flush_work(&rwork->work);
> +	}
> +}
> +EXPORT_SYMBOL(flush_rcu_work);
> +
>  static bool __cancel_work(struct work_struct *work, bool is_dwork)
>  {
>  	unsigned long flags;
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] RDMAVT: Fix synchronization around percpu_ref
  2018-03-14 19:45   ` [PATCH 3/8] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
@ 2018-03-15 22:24     ` Jason Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2018-03-15 22:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team, Mike Marciniszyn, linux-rdma

On Wed, Mar 14, 2018 at 12:45:10PM -0700, Tejun Heo wrote:
> rvt_mregion uses percpu_ref for reference counting and RCU to protect
> accesses from lkey_table.  When a rvt_mregion needs to be freed, it
> first gets unregistered from lkey_table and then rvt_check_refs() is
> called to wait for in-flight usages before the rvt_mregion is freed.
> 
> rvt_check_refs() seems to have a couple issues.
> 
> * It has a fast exit path which tests percpu_ref_is_zero().  However,
>   a percpu_ref reading zero doesn't mean that the object can be
>   released.  In fact, the ->release() callback might not even have
>   started executing yet.  Proceeding with freeing can lead to
>   use-after-free.
> 
> * lkey_table is RCU protected but there is no RCU grace period in the
>   free path.  percpu_ref uses RCU internally but it's sched-RCU whose
>   grace periods are different from regular RCU.  Also, it generally
>   isn't a good idea to depend on internal behaviors like this.
> 
> To address the above issues, this patch removes the fast exit and adds
> an explicit synchronize_rcu().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: linux-rdma@vger.kernel.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>  drivers/infiniband/sw/rdmavt/mr.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)

Applied to rdma for-next

Thanks,
Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] RCU, workqueue: Implement rcu_work
  2018-03-14 19:45   ` [PATCH 6/8] RCU, workqueue: Implement rcu_work Tejun Heo
  2018-03-14 20:13     ` Paul E. McKenney
@ 2018-03-16  6:01     ` Lai Jiangshan
  2018-03-19 16:45       ` Tejun Heo
  1 sibling, 1 reply; 31+ messages in thread
From: Lai Jiangshan @ 2018-03-16  6:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Jann Horn, Paul E. McKenney, Benjamin LaHaise,
	Al Viro, Kent Overstreet, security, LKML, kernel-team

On Thu, Mar 15, 2018 at 3:45 AM, Tejun Heo <tj@kernel.org> wrote:
> There are cases where RCU callback needs to be bounced to a sleepable
> context.  This is currently done by the RCU callback queueing a work
> item, which can be cumbersome to write and confusing to read.
>
> This patch introduces rcu_work, a workqueue work variant which gets
> executed after a RCU grace period, and converts the open coded
> bouncing in fs/aio and kernel/cgroup.
>
> v3: Dropped queue_rcu_work_on().  Documented rcu grace period behavior
>     after queue_rcu_work().
>
> v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
>     completion of previously queued rcu callback as per Paul.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  include/linux/workqueue.h | 23 ++++++++++++++++++++
>  kernel/workqueue.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 77 insertions(+)
>
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index bc0cda1..d026f8f 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -13,6 +13,7 @@
>  #include <linux/threads.h>
>  #include <linux/atomic.h>
>  #include <linux/cpumask.h>
> +#include <linux/rcupdate.h>
>
>  struct workqueue_struct;
>
> @@ -120,6 +121,14 @@ struct delayed_work {
>         int cpu;
>  };
>
> +struct rcu_work {
> +       struct work_struct work;
> +       struct rcu_head rcu;
> +
> +       /* target workqueue ->rcu uses to queue ->work */
> +       struct workqueue_struct *wq;
> +};
> +
>  /**
>   * struct workqueue_attrs - A struct for workqueue attributes.
>   *
> @@ -151,6 +160,11 @@ static inline struct delayed_work *to_delayed_work(struct work_struct *work)
>         return container_of(work, struct delayed_work, work);
>  }
>
> +static inline struct rcu_work *to_rcu_work(struct work_struct *work)
> +{
> +       return container_of(work, struct rcu_work, work);
> +}
> +
>  struct execute_work {
>         struct work_struct work;
>  };
> @@ -266,6 +280,12 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
>  #define INIT_DEFERRABLE_WORK_ONSTACK(_work, _func)                     \
>         __INIT_DELAYED_WORK_ONSTACK(_work, _func, TIMER_DEFERRABLE)
>
> +#define INIT_RCU_WORK(_work, _func)                                    \
> +       INIT_WORK(&(_work)->work, (_func))
> +
> +#define INIT_RCU_WORK_ONSTACK(_work, _func)                            \
> +       INIT_WORK_ONSTACK(&(_work)->work, (_func))
> +
>  /**
>   * work_pending - Find out whether a work item is currently pending
>   * @work: The work item in question
> @@ -447,6 +467,7 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
>                         struct delayed_work *work, unsigned long delay);
>  extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
>                         struct delayed_work *dwork, unsigned long delay);
> +extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
>
>  extern void flush_workqueue(struct workqueue_struct *wq);
>  extern void drain_workqueue(struct workqueue_struct *wq);
> @@ -463,6 +484,8 @@ extern bool flush_delayed_work(struct delayed_work *dwork);
>  extern bool cancel_delayed_work(struct delayed_work *dwork);
>  extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
>
> +extern bool flush_rcu_work(struct rcu_work *rwork);
> +
>  extern void workqueue_set_max_active(struct workqueue_struct *wq,
>                                      int max_active);
>  extern struct work_struct *current_work(void);
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index bb9a519..7df85fa 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1604,6 +1604,40 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
>  }
>  EXPORT_SYMBOL_GPL(mod_delayed_work_on);
>
> +static void rcu_work_rcufn(struct rcu_head *rcu)
> +{
> +       struct rcu_work *rwork = container_of(rcu, struct rcu_work, rcu);
> +
> +       /* read the comment in __queue_work() */
> +       local_irq_disable();
> +       __queue_work(WORK_CPU_UNBOUND, rwork->wq, &rwork->work);
> +       local_irq_enable();
> +}
> +
> +/**
> + * queue_rcu_work - queue work after a RCU grace period
> + * @wq: workqueue to use
> + * @rwork: work to queue
> + *
> + * Return: %false if @rwork was already pending, %true otherwise.  Note
> + * that a full RCU grace period is guaranteed only after a %true return.
> + * While @rwork is guarnateed to be executed after a %false return, the
> + * execution may happen before a full RCU grace period has passed.
> + */

LGTM

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

> +bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> +{
> +       struct work_struct *work = &rwork->work;
> +
> +       if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> +               rwork->wq = wq;
> +               call_rcu(&rwork->rcu, rcu_work_rcufn);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +EXPORT_SYMBOL(queue_rcu_work);
> +
>  /**
>   * worker_enter_idle - enter idle state
>   * @worker: worker which is entering idle state
> @@ -3001,6 +3035,26 @@ bool flush_delayed_work(struct delayed_work *dwork)
>  }
>  EXPORT_SYMBOL(flush_delayed_work);
>
> +/**
> + * flush_rcu_work - wait for a rwork to finish executing the last queueing
> + * @rwork: the rcu work to flush
> + *
> + * Return:
> + * %true if flush_rcu_work() waited for the work to finish execution,
> + * %false if it was already idle.
> + */
> +bool flush_rcu_work(struct rcu_work *rwork)
> +{
> +       if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
> +               rcu_barrier();
> +               flush_work(&rwork->work);
> +               return true;

A possible tiny improvement: check if it was already queued on wq.
For example:

       if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
               if (!flush_work(&rwork->work)) {
                      rcu_barrier();
                      flush_work(&rwork->work);
               }
               return true;
> +       } else {
> +               return flush_work(&rwork->work);
> +       }
> +}
> +EXPORT_SYMBOL(flush_rcu_work);
> +
>  static bool __cancel_work(struct work_struct *work, bool is_dwork)
>  {
>         unsigned long flags;
> --
> 2.9.5
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] RCU, workqueue: Implement rcu_work
  2018-03-16  6:01     ` Lai Jiangshan
@ 2018-03-19 16:45       ` Tejun Heo
  2018-03-20 10:04         ` Lai Jiangshan
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-19 16:45 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Linus Torvalds, Jann Horn, Paul E. McKenney, Benjamin LaHaise,
	Al Viro, Kent Overstreet, security, LKML, kernel-team

Hello, Lai.

On Fri, Mar 16, 2018 at 02:01:35PM +0800, Lai Jiangshan wrote:
> > +bool flush_rcu_work(struct rcu_work *rwork)
> > +{
> > +       if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
> > +               rcu_barrier();
> > +               flush_work(&rwork->work);
> > +               return true;
> 
> A possible tiny improvement: check if it was already queued on wq.
> For example:
> 
>        if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
>                if (!flush_work(&rwork->work)) {
>                       rcu_barrier();
>                       flush_work(&rwork->work);
>                }
>                return true;

But this breaks the guarantee that flush_work waits for the latest
queueing instance.  Please consider the following scenario.


 1. rcu-work is queued
 2. rcu-work starts executing
				3. rcu-work is queued again
				4. rcu-work is flushed
 5. execution finishes
				6. flush finishes
				7. execution finishes

6 should happen after 7 but it didn't.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
  2018-03-14 19:45   ` [PATCH 5/8] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods Tejun Heo
@ 2018-03-19 17:10     ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-19 17:10 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team

On Wed, Mar 14, 2018 at 12:45:12PM -0700, Tejun Heo wrote:
> This patch makes it clear that the grace periods are percpu_ref's
> internal implementation detail and can't be depended upon by the
> users.

Applied to percpu/for-4.16-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-14 19:45   ` [PATCH 8/8] fs/aio: " Tejun Heo
@ 2018-03-19 17:12     ` Tejun Heo
  2018-03-21 15:58     ` Oleg Nesterov
  1 sibling, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-19 17:12 UTC (permalink / raw)
  To: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet
  Cc: security, linux-kernel, kernel-team

On Wed, Mar 14, 2018 at 12:45:15PM -0700, Tejun Heo wrote:
> Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
> item bouncing.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Applied 6-8 to wq/for-4.17.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] RCU, workqueue: Implement rcu_work
  2018-03-19 16:45       ` Tejun Heo
@ 2018-03-20 10:04         ` Lai Jiangshan
  0 siblings, 0 replies; 31+ messages in thread
From: Lai Jiangshan @ 2018-03-20 10:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Jann Horn, Paul E. McKenney, Benjamin LaHaise,
	Al Viro, Kent Overstreet, security, LKML, kernel-team

On Tue, Mar 20, 2018 at 12:45 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Lai.
>
> On Fri, Mar 16, 2018 at 02:01:35PM +0800, Lai Jiangshan wrote:
>> > +bool flush_rcu_work(struct rcu_work *rwork)
>> > +{
>> > +       if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
>> > +               rcu_barrier();
>> > +               flush_work(&rwork->work);
>> > +               return true;
>>
>> A possible tiny improvement: check if it was already queued on wq.
>> For example:
>>
>>        if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
>>                if (!flush_work(&rwork->work)) {
>>                       rcu_barrier();
>>                       flush_work(&rwork->work);
>>                }
>>                return true;
>
> But this breaks the guarantee that flush_work waits for the latest
> queueing instance.  Please consider the following scenario.

Oh, I'm sorry I was wrong. It is so evident that
"flush_work(&rwork->work) return true"
doesn't equals to "it has been queued on wq".

To detect if "it has been queued on wq" requires
a bunch of code. It is not worthy to complicate
this function.

>
>
>  1. rcu-work is queued
>  2. rcu-work starts executing
>                                 3. rcu-work is queued again
>                                 4. rcu-work is flushed
>  5. execution finishes
>                                 6. flush finishes
>                                 7. execution finishes
>
> 6 should happen after 7 but it didn't.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-14 19:45   ` [PATCH 8/8] fs/aio: " Tejun Heo
  2018-03-19 17:12     ` Tejun Heo
@ 2018-03-21 15:58     ` Oleg Nesterov
  2018-03-21 16:40       ` Tejun Heo
  1 sibling, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2018-03-21 15:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

Hi Tejun,

sorry for late reply.

On 03/14, Tejun Heo wrote:
>
> Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
> item bouncing.

Yes, but a bit of open-coding may be more efficient...

> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -115,8 +115,7 @@ struct kioctx {
>  	struct page		**ring_pages;
>  	long			nr_pages;
>  
> -	struct rcu_head		free_rcu;
> -	struct work_struct	free_work;	/* see free_ioctx() */
> +	struct rcu_work		free_rwork;	/* see free_ioctx() */

IIUC, you can't easily share rcu_work's, thus every kioctx needs its own
->free_rwork and this looks sub-optimal.

What do you think about the (untested) patch below?

Oleg.


--- a/fs/aio.c
+++ b/fs/aio.c
@@ -115,8 +115,10 @@ struct kioctx {
 	struct page		**ring_pages;
 	long			nr_pages;
 
-	struct rcu_head		free_rcu;
-	struct work_struct	free_work;	/* see free_ioctx() */
+	union {
+		struct rcu_head		free_rcu;
+		struct llist_node	free_llist;
+	};
 
 	/*
 	 * signals when all in-flight requests are done
@@ -589,31 +591,38 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 	return cancel(&kiocb->common);
 }
 
+static struct llist_head free_ioctx_llist;
+
 /*
  * free_ioctx() should be RCU delayed to synchronize against the RCU
  * protected lookup_ioctx() and also needs process context to call
  * aio_free_ring(), so the double bouncing through kioctx->free_rcu and
  * ->free_work.
  */
-static void free_ioctx(struct work_struct *work)
+static void free_ioctx_workfn(struct work_struct *work)
 {
-	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
+	struct llist_node *llist = llist_del_all(&free_ioctx_llist);
+	struct kioctx *ctx, *tmp;
 
-	pr_debug("freeing %p\n", ctx);
+	llist_for_each_entry_safe(ctx, tmp, llist, free_llist) {
+		pr_debug("freeing %p\n", ctx);
 
-	aio_free_ring(ctx);
-	free_percpu(ctx->cpu);
-	percpu_ref_exit(&ctx->reqs);
-	percpu_ref_exit(&ctx->users);
-	kmem_cache_free(kioctx_cachep, ctx);
+		aio_free_ring(ctx);
+		free_percpu(ctx->cpu);
+		percpu_ref_exit(&ctx->reqs);
+		percpu_ref_exit(&ctx->users);
+		kmem_cache_free(kioctx_cachep, ctx);
+	}
 }
 
+static DECLARE_WORK(free_ioctx_work, free_ioctx_workfn);
+
 static void free_ioctx_rcufn(struct rcu_head *head)
 {
 	struct kioctx *ctx = container_of(head, struct kioctx, free_rcu);
 
-	INIT_WORK(&ctx->free_work, free_ioctx);
-	schedule_work(&ctx->free_work);
+	if (llist_add(&ctx->free_llist, &free_ioctx_llist))
+		schedule_work(&free_ioctx_work);
 }
 
 static void free_ioctx_reqs(struct percpu_ref *ref)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-21 15:58     ` Oleg Nesterov
@ 2018-03-21 16:40       ` Tejun Heo
  2018-03-21 17:17         ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-21 16:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

Hey, Oleg.

On Wed, Mar 21, 2018 at 04:58:13PM +0100, Oleg Nesterov wrote:
> > -	struct rcu_head		free_rcu;
> > -	struct work_struct	free_work;	/* see free_ioctx() */
> > +	struct rcu_work		free_rwork;	/* see free_ioctx() */
> 
> IIUC, you can't easily share rcu_work's, thus every kioctx needs its own
> ->free_rwork and this looks sub-optimal.
>
> What do you think about the (untested) patch below?
> 
> Oleg.
> 
> 
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -115,8 +115,10 @@ struct kioctx {
>  	struct page		**ring_pages;
>  	long			nr_pages;
>  
> -	struct rcu_head		free_rcu;
> -	struct work_struct	free_work;	/* see free_ioctx() */
> +	union {
> +		struct rcu_head		free_rcu;
> +		struct llist_node	free_llist;
> +	};

It really depends on how much we want to optimize.  Do you think it
matters enough?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-21 16:40       ` Tejun Heo
@ 2018-03-21 17:17         ` Oleg Nesterov
  2018-03-21 17:53           ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2018-03-21 17:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

On 03/21, Tejun Heo wrote:
>
> Hey, Oleg.
>
> On Wed, Mar 21, 2018 at 04:58:13PM +0100, Oleg Nesterov wrote:
> > > -	struct rcu_head		free_rcu;
> > > -	struct work_struct	free_work;	/* see free_ioctx() */
> > > +	struct rcu_work		free_rwork;	/* see free_ioctx() */
> >
> > IIUC, you can't easily share rcu_work's, thus every kioctx needs its own
> > ->free_rwork and this looks sub-optimal.
> >
> > What do you think about the (untested) patch below?
> >
> > Oleg.
> >
> >
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -115,8 +115,10 @@ struct kioctx {
> >  	struct page		**ring_pages;
> >  	long			nr_pages;
> >
> > -	struct rcu_head		free_rcu;
> > -	struct work_struct	free_work;	/* see free_ioctx() */
> > +	union {
> > +		struct rcu_head		free_rcu;
> > +		struct llist_node	free_llist;
> > +	};
>
> It really depends on how much we want to optimize.  Do you think it
> matters enough?

I have no idea, probably not.

Mostly I am asking because I do not really understand
"[PATCH 6/8] RCU, workqueue: Implement rcu_work".

I mean, the code looks simple and correct but why does it play with
WORK_STRUCT_PENDING_BIT? IOW, I do not see a "good" use-case when 2 or more
queue_rcu_work()'s can use the same rwork and hit work_pending() == T. And
what the caller should do if queue_rcu_work() returns false?

Oleg.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-21 17:17         ` Oleg Nesterov
@ 2018-03-21 17:53           ` Tejun Heo
  2018-03-22 11:24             ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-21 17:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

Hello,

On Wed, Mar 21, 2018 at 06:17:43PM +0100, Oleg Nesterov wrote:
> Mostly I am asking because I do not really understand
> "[PATCH 6/8] RCU, workqueue: Implement rcu_work".
> 
> I mean, the code looks simple and correct but why does it play with
> WORK_STRUCT_PENDING_BIT? IOW, I do not see a "good" use-case when 2 or more
> queue_rcu_work()'s can use the same rwork and hit work_pending() == T. And
> what the caller should do if queue_rcu_work() returns false?

It's just following standard workqueue conventions.  We can try to
make it more specialized but then flush_rcu_work()'s behavior would
have to different too and it gets confusing quickly.  Unless there are
overriding reasons to deviate, I'd like to keep it consistent.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-21 17:53           ` Tejun Heo
@ 2018-03-22 11:24             ` Oleg Nesterov
  2018-03-26 15:04               ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2018-03-22 11:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

On 03/21, Tejun Heo wrote:
>
> Hello,
>
> On Wed, Mar 21, 2018 at 06:17:43PM +0100, Oleg Nesterov wrote:
> > Mostly I am asking because I do not really understand
> > "[PATCH 6/8] RCU, workqueue: Implement rcu_work".
> >
> > I mean, the code looks simple and correct but why does it play with
> > WORK_STRUCT_PENDING_BIT? IOW, I do not see a "good" use-case when 2 or more
> > queue_rcu_work()'s can use the same rwork and hit work_pending() == T. And
> > what the caller should do if queue_rcu_work() returns false?
>
> It's just following standard workqueue conventions.

OK. And I agree that the usage of WORK_STRUCT_PENDING_BIT in queue_rcu_work() is
fine. If nothing else the kernel won't crash if you call queue_rcu_work() twice.

But why flush_rcu_work() can't simply do flush_work() ?

If WORK_STRUCT_PENDING_BIT was set by us (rcu_work_rcufn() succeeded) we do not
need rcu_barrier(). Why should we care about other rcu callbacks?

If rcu_work_rcufn() fails and someone else sets PENDING, how this rcu_barrier()
can help? We didn't even do call_rcu() in this case.

In short. Once flush_work() returns we know that rcu callback which queued this
work is finished. It doesn't matter if it was fired by us or not. And if it was
not fired by us, then rcu_barrier() doesn't imply a grace period anyway.


> We can try to
> make it more specialized but then flush_rcu_work()'s behavior would
> have to different too and it gets confusing quickly.  Unless there are
> overriding reasons to deviate, I'd like to keep it consistent.

Perhaps this all is consistent, but I fail to understand this API :/

And again, at least for fs/aio.c it doesn't offer too much but sub-optimal
compared to call_rcu() + schedule_work() by hand.

Oleg.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] HMM: Remove superflous RCU protection around radix tree lookup
  2018-03-14 19:45     ` Tejun Heo
@ 2018-03-26 14:54       ` Tejun Heo
  -1 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-26 14:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: security, linux-kernel, kernel-team, linux-mm, torvalds, jannh,
	paulmck, bcrl, viro, kent.overstreet

Hello, Andrew.

Do you mind picking up the following patch?  I can't find a good tree
to route this through.  The raw patch can be found at

  https://marc.info/?l=linux-mm&m=152105674112496&q=raw

Thank you very much.

On Wed, Mar 14, 2018 at 12:45:11PM -0700, Tejun Heo wrote:
> hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
> which actually uses the RCU protection.  The only caller is
> hmm_devmem_pages_create() which already grabs the mutex and does
> superflous rcu_read_lock/unlock() around the function.
> 
> This doesn't add anything and just adds to confusion.  Remove the RCU
> protection and open-code the radix tree lookup.  If this needs to
> become more sophisticated in the future, let's add them back when
> necessary.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> Hello,
> 
> Jérôme, how do you want to route this patch?  If you prefer, I can
> route it together with other patches.
> 
> Thanks.
> 
>  mm/hmm.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98..d4627c5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  	hmm_devmem_radix_release(resource);
>  }
>  
> -static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
> -{
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> -
> -	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
> -}
> -
>  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
>  		struct hmm_devmem *dup;
>  
> -		rcu_read_lock();
> -		dup = hmm_devmem_find(key);
> -		rcu_read_unlock();
> +		dup = radix_tree_lookup(&hmm_devmem_radix,
> +					key >> PA_SECTION_SHIFT);
>  		if (dup) {
>  			dev_err(device, "%s: collides with mapping for %s\n",
>  				__func__, dev_name(dup->device));
> -- 
> 2.9.5
> 

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] HMM: Remove superflous RCU protection around radix tree lookup
@ 2018-03-26 14:54       ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-26 14:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: security, linux-kernel, kernel-team, linux-mm, torvalds, jannh,
	paulmck, bcrl, viro, kent.overstreet

Hello, Andrew.

Do you mind picking up the following patch?  I can't find a good tree
to route this through.  The raw patch can be found at

  https://marc.info/?l=linux-mm&m=152105674112496&q=raw

Thank you very much.

On Wed, Mar 14, 2018 at 12:45:11PM -0700, Tejun Heo wrote:
> hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
> which actually uses the RCU protection.  The only caller is
> hmm_devmem_pages_create() which already grabs the mutex and does
> superflous rcu_read_lock/unlock() around the function.
> 
> This doesn't add anything and just adds to confusion.  Remove the RCU
> protection and open-code the radix tree lookup.  If this needs to
> become more sophisticated in the future, let's add them back when
> necessary.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Jerome Glisse <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> Hello,
> 
> Jerome, how do you want to route this patch?  If you prefer, I can
> route it together with other patches.
> 
> Thanks.
> 
>  mm/hmm.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98..d4627c5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  	hmm_devmem_radix_release(resource);
>  }
>  
> -static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
> -{
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> -
> -	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
> -}
> -
>  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
>  		struct hmm_devmem *dup;
>  
> -		rcu_read_lock();
> -		dup = hmm_devmem_find(key);
> -		rcu_read_unlock();
> +		dup = radix_tree_lookup(&hmm_devmem_radix,
> +					key >> PA_SECTION_SHIFT);
>  		if (dup) {
>  			dev_err(device, "%s: collides with mapping for %s\n",
>  				__func__, dev_name(dup->device));
> -- 
> 2.9.5
> 

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-22 11:24             ` Oleg Nesterov
@ 2018-03-26 15:04               ` Tejun Heo
  2018-03-27 14:28                 ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-26 15:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

Hey, Oleg.

On Thu, Mar 22, 2018 at 12:24:12PM +0100, Oleg Nesterov wrote:
> OK. And I agree that the usage of WORK_STRUCT_PENDING_BIT in queue_rcu_work() is
> fine. If nothing else the kernel won't crash if you call queue_rcu_work() twice.
> 
> But why flush_rcu_work() can't simply do flush_work() ?
> 
> If WORK_STRUCT_PENDING_BIT was set by us (rcu_work_rcufn() succeeded) we do not
> need rcu_barrier(). Why should we care about other rcu callbacks?
>
> If rcu_work_rcufn() fails and someone else sets PENDING, how this rcu_barrier()
> can help? We didn't even do call_rcu() in this case.
> 
> In short. Once flush_work() returns we know that rcu callback which queued this
> work is finished. It doesn't matter if it was fired by us or not. And if it was
> not fired by us, then rcu_barrier() doesn't imply a grace period anyway.

flush_*work() guarantees to wait for the completion of the latest
instance of the work item which was visible to the caller.  We can't
guarantee that w/o rcu_barrier().

> > We can try to
> > make it more specialized but then flush_rcu_work()'s behavior would
> > have to different too and it gets confusing quickly.  Unless there are
> > overriding reasons to deviate, I'd like to keep it consistent.
> 
> Perhaps this all is consistent, but I fail to understand this API :/
> 
> And again, at least for fs/aio.c it doesn't offer too much but sub-optimal
> compared to call_rcu() + schedule_work() by hand.

Sure, this isn't about performance.  It's about making code less
painful on the eyes.  If performance matters, we sure can hand-craft
things, which doesn't seem to be the case, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-26 15:04               ` Tejun Heo
@ 2018-03-27 14:28                 ` Oleg Nesterov
  2018-03-27 15:55                   ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2018-03-27 14:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

Hi Tejun,

On 03/26, Tejun Heo wrote:
>
> On Thu, Mar 22, 2018 at 12:24:12PM +0100, Oleg Nesterov wrote:
> >
> > But why flush_rcu_work() can't simply do flush_work() ?
> >
> > If WORK_STRUCT_PENDING_BIT was set by us (rcu_work_rcufn() succeeded) we do not
> > need rcu_barrier(). Why should we care about other rcu callbacks?
> >
> > If rcu_work_rcufn() fails and someone else sets PENDING, how this rcu_barrier()
> > can help? We didn't even do call_rcu() in this case.
> >
> > In short. Once flush_work() returns we know that rcu callback which queued this
> > work is finished. It doesn't matter if it was fired by us or not. And if it was
> > not fired by us, then rcu_barrier() doesn't imply a grace period anyway.
>
> flush_*work() guarantees to wait for the completion of the latest
> instance of the work item which was visible to the caller.  We can't
> guarantee that w/o rcu_barrier().

And this is what I can't understand.

So let me repeat. Could you please describe a use-case which needs flush_rcuwork()
with rcu_barrier() ?


> > And again, at least for fs/aio.c it doesn't offer too much but sub-optimal
> > compared to call_rcu() + schedule_work() by hand.
>
> Sure, this isn't about performance.  It's about making code less
> painful on the eyes.  If performance matters, we sure can hand-craft
> things, which doesn't seem to be the case, right?

OK, I won't insist.

Oleg.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-27 14:28                 ` Oleg Nesterov
@ 2018-03-27 15:55                   ` Tejun Heo
  2018-03-29 16:49                     ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2018-03-27 15:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

Hey,

On Tue, Mar 27, 2018 at 04:28:48PM +0200, Oleg Nesterov wrote:
> > flush_*work() guarantees to wait for the completion of the latest
> > instance of the work item which was visible to the caller.  We can't
> > guarantee that w/o rcu_barrier().
> 
> And this is what I can't understand.
> 
> So let me repeat. Could you please describe a use-case which needs flush_rcuwork()
> with rcu_barrier() ?

So, if you skip that, flush_work() in itself won't wait for PENDING
bit at all.  It'll return right away if the work item is waiting for
rcu grace period.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] HMM: Remove superflous RCU protection around radix tree lookup
  2018-03-26 14:54       ` Tejun Heo
@ 2018-03-27 16:12         ` Jerome Glisse
  -1 siblings, 0 replies; 31+ messages in thread
From: Jerome Glisse @ 2018-03-27 16:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, security, linux-kernel, kernel-team, linux-mm,
	torvalds, jannh, paulmck, bcrl, viro, kent.overstreet

On Mon, Mar 26, 2018 at 07:54:31AM -0700, Tejun Heo wrote:
> Hello, Andrew.
> 
> Do you mind picking up the following patch?  I can't find a good tree
> to route this through.  The raw patch can be found at
> 
>   https://marc.info/?l=linux-mm&m=152105674112496&q=raw
> 
> Thank you very much.

I am fine with which ever route there is low probability of conflict
when merging HMM through different tree.


> 
> On Wed, Mar 14, 2018 at 12:45:11PM -0700, Tejun Heo wrote:
> > hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
> > which actually uses the RCU protection.  The only caller is
> > hmm_devmem_pages_create() which already grabs the mutex and does
> > superflous rcu_read_lock/unlock() around the function.
> > 
> > This doesn't add anything and just adds to confusion.  Remove the RCU
> > protection and open-code the radix tree lookup.  If this needs to
> > become more sophisticated in the future, let's add them back when
> > necessary.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: linux-mm@kvack.org
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > ---
> > Hello,
> > 
> > Jérôme, how do you want to route this patch?  If you prefer, I can
> > route it together with other patches.
> > 
> > Thanks.
> > 
> >  mm/hmm.c | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 320545b98..d4627c5 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
> >  	hmm_devmem_radix_release(resource);
> >  }
> >  
> > -static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
> > -{
> > -	WARN_ON_ONCE(!rcu_read_lock_held());
> > -
> > -	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
> > -}
> > -
> >  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> >  {
> >  	resource_size_t key, align_start, align_size, align_end;
> > @@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> >  	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
> >  		struct hmm_devmem *dup;
> >  
> > -		rcu_read_lock();
> > -		dup = hmm_devmem_find(key);
> > -		rcu_read_unlock();
> > +		dup = radix_tree_lookup(&hmm_devmem_radix,
> > +					key >> PA_SECTION_SHIFT);
> >  		if (dup) {
> >  			dev_err(device, "%s: collides with mapping for %s\n",
> >  				__func__, dev_name(dup->device));
> > -- 
> > 2.9.5
> > 
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] HMM: Remove superflous RCU protection around radix tree lookup
@ 2018-03-27 16:12         ` Jerome Glisse
  0 siblings, 0 replies; 31+ messages in thread
From: Jerome Glisse @ 2018-03-27 16:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, security, linux-kernel, kernel-team, linux-mm,
	torvalds, jannh, paulmck, bcrl, viro, kent.overstreet

On Mon, Mar 26, 2018 at 07:54:31AM -0700, Tejun Heo wrote:
> Hello, Andrew.
> 
> Do you mind picking up the following patch?  I can't find a good tree
> to route this through.  The raw patch can be found at
> 
>   https://marc.info/?l=linux-mm&m=152105674112496&q=raw
> 
> Thank you very much.

I am fine with which ever route there is low probability of conflict
when merging HMM through different tree.


> 
> On Wed, Mar 14, 2018 at 12:45:11PM -0700, Tejun Heo wrote:
> > hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
> > which actually uses the RCU protection.  The only caller is
> > hmm_devmem_pages_create() which already grabs the mutex and does
> > superflous rcu_read_lock/unlock() around the function.
> > 
> > This doesn't add anything and just adds to confusion.  Remove the RCU
> > protection and open-code the radix tree lookup.  If this needs to
> > become more sophisticated in the future, let's add them back when
> > necessary.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reviewed-by: Jerome Glisse <jglisse@redhat.com>
> > Cc: linux-mm@kvack.org
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > ---
> > Hello,
> > 
> > Jerome, how do you want to route this patch?  If you prefer, I can
> > route it together with other patches.
> > 
> > Thanks.
> > 
> >  mm/hmm.c | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 320545b98..d4627c5 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
> >  	hmm_devmem_radix_release(resource);
> >  }
> >  
> > -static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
> > -{
> > -	WARN_ON_ONCE(!rcu_read_lock_held());
> > -
> > -	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
> > -}
> > -
> >  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> >  {
> >  	resource_size_t key, align_start, align_size, align_end;
> > @@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> >  	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
> >  		struct hmm_devmem *dup;
> >  
> > -		rcu_read_lock();
> > -		dup = hmm_devmem_find(key);
> > -		rcu_read_unlock();
> > +		dup = radix_tree_lookup(&hmm_devmem_radix,
> > +					key >> PA_SECTION_SHIFT);
> >  		if (dup) {
> >  			dev_err(device, "%s: collides with mapping for %s\n",
> >  				__func__, dev_name(dup->device));
> > -- 
> > 2.9.5
> > 
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-27 15:55                   ` Tejun Heo
@ 2018-03-29 16:49                     ` Oleg Nesterov
  2018-03-29 17:41                       ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2018-03-29 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

On 03/27, Tejun Heo wrote:
>
> On Tue, Mar 27, 2018 at 04:28:48PM +0200, Oleg Nesterov wrote:
> > > flush_*work() guarantees to wait for the completion of the latest
> > > instance of the work item which was visible to the caller.  We can't
> > > guarantee that w/o rcu_barrier().
> >
> > And this is what I can't understand.
> >
> > So let me repeat. Could you please describe a use-case which needs flush_rcuwork()
> > with rcu_barrier() ?
>
> So, if you skip that, flush_work() in itself won't wait for PENDING
> bit at all.  It'll return right away if the work item is waiting for
> rcu grace period.

Still no use-case... But yes, I forgot this is needed for correctness.

OK, thanks for your patience. But fyi now I hate this interface even more,
exactly because I was technically wrong in this discussion ;)

Oleg.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] fs/aio: Use rcu_work instead of explicit rcu and work item
  2018-03-29 16:49                     ` Oleg Nesterov
@ 2018-03-29 17:41                       ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2018-03-29 17:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, jannh, paulmck, bcrl, viro, kent.overstreet, security,
	linux-kernel, kernel-team

Hello,

On Thu, Mar 29, 2018 at 06:49:25PM +0200, Oleg Nesterov wrote:
> Still no use-case... But yes, I forgot this is needed for correctness.
> 
> OK, thanks for your patience. But fyi now I hate this interface even more,
> exactly because I was technically wrong in this discussion ;)

Yeah, we might as well omit it for now and add it later when it's
actually needed.  It's just that given that queue and flush are the
two core interfaces, I didn't want to leave it out, and the current
implementation is the lowest-effort one.  I'd be happy to make it
better if you have good suggestions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-03-29 17:41 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-14 19:41 [PATCHSET v2] percpu_ref, RCU: Audit RCU usages in percpu_ref users Tejun Heo
2018-03-14 19:45 ` [PATCH 1/8] fs/aio: Add explicit RCU grace period when freeing kioctx Tejun Heo
2018-03-14 19:45   ` [PATCH 2/8] fs/aio: Use RCU accessors for kioctx_table->table[] Tejun Heo
2018-03-14 19:45   ` [PATCH 3/8] RDMAVT: Fix synchronization around percpu_ref Tejun Heo
2018-03-15 22:24     ` Jason Gunthorpe
2018-03-14 19:45   ` [PATCH 4/8] HMM: Remove superflous RCU protection around radix tree lookup Tejun Heo
2018-03-14 19:45     ` Tejun Heo
2018-03-26 14:54     ` Tejun Heo
2018-03-26 14:54       ` Tejun Heo
2018-03-27 16:12       ` Jerome Glisse
2018-03-27 16:12         ` Jerome Glisse
2018-03-14 19:45   ` [PATCH 5/8] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods Tejun Heo
2018-03-19 17:10     ` Tejun Heo
2018-03-14 19:45   ` [PATCH 6/8] RCU, workqueue: Implement rcu_work Tejun Heo
2018-03-14 20:13     ` Paul E. McKenney
2018-03-16  6:01     ` Lai Jiangshan
2018-03-19 16:45       ` Tejun Heo
2018-03-20 10:04         ` Lai Jiangshan
2018-03-14 19:45   ` [PATCH 7/8] cgroup: Use rcu_work instead of explicit rcu and work item Tejun Heo
2018-03-14 19:45   ` [PATCH 8/8] fs/aio: " Tejun Heo
2018-03-19 17:12     ` Tejun Heo
2018-03-21 15:58     ` Oleg Nesterov
2018-03-21 16:40       ` Tejun Heo
2018-03-21 17:17         ` Oleg Nesterov
2018-03-21 17:53           ` Tejun Heo
2018-03-22 11:24             ` Oleg Nesterov
2018-03-26 15:04               ` Tejun Heo
2018-03-27 14:28                 ` Oleg Nesterov
2018-03-27 15:55                   ` Tejun Heo
2018-03-29 16:49                     ` Oleg Nesterov
2018-03-29 17:41                       ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.