linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 00/18] kthread: Use kthread worker API more widely
@ 2015-09-21 13:03 Petr Mladek
  2015-09-21 13:03 ` [RFC v2 01/18] kthread: Allow to call __kthread_create_on_node() with va_list args Petr Mladek
                   ` (19 more replies)
  0 siblings, 20 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

My intention is to make it easier to manipulate kthreads. This RFC tries
to use the kthread worker API. It is based on comments from the
first attempt. See https://lkml.org/lkml/2015/7/28/648 and
the list of changes below.

1st..8th patches: improve the existing kthread worker API

9th, 12th, 17th patches: convert three kthreads into the new API,
     namely: khugepaged, ring buffer benchmark, RCU gp kthreads[*]

10th, 11th patches: fix potential problems in the ring buffer
      benchmark; also sent separately

13th patch: small fix for RCU kthread; also sent separately;
     being tested by Paul

14th..16th patches: preparation steps for the RCU threads
     conversion; they are needed _only_ if we split GP start
     and QS handling into separate works[*]

18th patch: does a possible improvement of the kthread worker API;
     it adds an extra parameter to the create*() functions, so I
     rather put it into this draft
     

[*] IMPORTANT: I tried to split RCU GP start and GS state handling
    into separate works this time. But there is a problem with
    a race in rcu_gp_kthread_worker_poke(). It might queue
    the wrong work. It can be detected and fixed by the work
    itself but it is a bit ugly. Alternative solution is to
    do both operations in one work. But then we sleep too much
    in the work which is ugly as well. Any idea is appreciated.
    

Changes against v1:

+ remove wrappers to manipulate the scheduling policy and priority

+ remove questionable wakeup_and_destroy_kthread_worker() variant

+ do not check for chained work when draining the queue

+ allocate struct kthread worker in create_kthread_work() and
  use more simple checks for running worker

+ add support for delayed kthread works and use them instead
  of waiting inside the works

+ rework the "unrelated" fixes for the ring buffer benchmark
  as discussed in the 1st RFC; also sent separately

+ convert also the consumer in the ring buffer benchmark


I have tested this patch set against the stable Linus tree
for 4.3-rc2.

Petr Mladek (18):
  kthread: Allow to call __kthread_create_on_node() with va_list args
  kthread: Add create_kthread_worker*()
  kthread: Add drain_kthread_worker()
  kthread: Add destroy_kthread_worker()
  kthread: Add pending flag to kthread work
  kthread: Initial support for delayed kthread work
  kthread: Allow to cancel kthread work
  kthread: Allow to modify delayed kthread work
  mm/huge_page: Convert khugepaged() into kthread worker API
  ring_buffer: Do no not complete benchmark reader too early
  ring_buffer: Fix more races when terminating the producer in the
    benchmark
  ring_buffer: Convert benchmark kthreads into kthread worker API
  rcu: Finish folding ->fqs_state into ->gp_state
  rcu: Store first_gp_fqs into struct rcu_state
  rcu: Clean up timeouts for forcing the quiescent state
  rcu: Check actual RCU_GP_FLAG_FQS when handling quiescent state
  rcu: Convert RCU gp kthreads into kthread worker API
  kthread: Better support freezable kthread workers

 include/linux/kthread.h              |  67 +++++
 kernel/kthread.c                     | 544 ++++++++++++++++++++++++++++++++---
 kernel/rcu/tree.c                    | 407 ++++++++++++++++----------
 kernel/rcu/tree.h                    |  24 +-
 kernel/rcu/tree_plugin.h             |  16 +-
 kernel/rcu/tree_trace.c              |   2 +-
 kernel/trace/ring_buffer_benchmark.c | 194 ++++++-------
 mm/huge_memory.c                     | 116 ++++----
 8 files changed, 1017 insertions(+), 353 deletions(-)

-- 
1.8.5.6


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC v2 01/18] kthread: Allow to call __kthread_create_on_node() with va_list args
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 02/18] kthread: Add create_kthread_worker*() Petr Mladek
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

kthread_create_on_node() implements a bunch of logic to create
the kthread. It is already called by kthread_create_on_cpu().

We are going to add a new API that will allow to standardize kthreads
and define safe points for termination, freezing, parking, and even
signal handling. It will want to call kthread_create_on_node()
with va_list args.

This patch does only a refactoring and does not modify the existing
behavior.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/kthread.c | 72 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 30 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 9ff173dca1ae..1d29e675fe44 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -244,33 +244,10 @@ static void create_kthread(struct kthread_create_info *create)
 	}
 }
 
-/**
- * kthread_create_on_node - create a kthread.
- * @threadfn: the function to run until signal_pending(current).
- * @data: data ptr for @threadfn.
- * @node: task and thread structures for the thread are allocated on this node
- * @namefmt: printf-style name for the thread.
- *
- * Description: This helper function creates and names a kernel
- * thread.  The thread will be stopped: use wake_up_process() to start
- * it.  See also kthread_run().  The new thread has SCHED_NORMAL policy and
- * is affine to all CPUs.
- *
- * If thread is going to be bound on a particular cpu, give its node
- * in @node, to get NUMA affinity for kthread stack, or else give NUMA_NO_NODE.
- * When woken, the thread will run @threadfn() with @data as its
- * argument. @threadfn() can either call do_exit() directly if it is a
- * standalone thread for which no one will call kthread_stop(), or
- * return when 'kthread_should_stop()' is true (which means
- * kthread_stop() has been called).  The return value should be zero
- * or a negative error number; it will be passed to kthread_stop().
- *
- * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
- */
-struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
-					   void *data, int node,
-					   const char namefmt[],
-					   ...)
+static struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
+						    void *data, int node,
+						    const char namefmt[],
+						    va_list args)
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct task_struct *task;
@@ -311,11 +288,8 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 	task = create->result;
 	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };
-		va_list args;
 
-		va_start(args, namefmt);
 		vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
-		va_end(args);
 		/*
 		 * root may have changed our (kthreadd's) priority or CPU mask.
 		 * The kernel thread should not inherit these properties.
@@ -326,6 +300,44 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 	kfree(create);
 	return task;
 }
+
+/**
+ * kthread_create_on_node - create a kthread.
+ * @threadfn: the function to run until signal_pending(current).
+ * @data: data ptr for @threadfn.
+ * @node: task and thread structures for the thread are allocated on this node
+ * @namefmt: printf-style name for the thread.
+ *
+ * Description: This helper function creates and names a kernel
+ * thread.  The thread will be stopped: use wake_up_process() to start
+ * it.  See also kthread_run().  The new thread has SCHED_NORMAL policy and
+ * is affine to all CPUs.
+ *
+ * If thread is going to be bound on a particular cpu, give its node
+ * in @node, to get NUMA affinity for kthread stack, or else give NUMA_NO_NODE.
+ * When woken, the thread will run @threadfn() with @data as its
+ * argument. @threadfn() can either call do_exit() directly if it is a
+ * standalone thread for which no one will call kthread_stop(), or
+ * return when 'kthread_should_stop()' is true (which means
+ * kthread_stop() has been called).  The return value should be zero
+ * or a negative error number; it will be passed to kthread_stop().
+ *
+ * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
+ */
+struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
+					   void *data, int node,
+					   const char namefmt[],
+					   ...)
+{
+	struct task_struct *task;
+	va_list args;
+
+	va_start(args, namefmt);
+	task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
+	va_end(args);
+
+	return task;
+}
 EXPORT_SYMBOL(kthread_create_on_node);
 
 static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mask, long state)
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 02/18] kthread: Add create_kthread_worker*()
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
  2015-09-21 13:03 ` [RFC v2 01/18] kthread: Allow to call __kthread_create_on_node() with va_list args Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-22 18:20   ` Tejun Heo
  2015-09-21 13:03 ` [RFC v2 03/18] kthread: Add drain_kthread_worker() Petr Mladek
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

Kthread workers are currently created using the classic kthread API,
namely kthread_run(). kthread_worker_fn() is passed as the @threadfn
parameter.

This patch defines create_kthread_worker_on_node() and
create_kthread_worker() functions that hide implementation details.

It enforces using kthread_worker_fn() for the main thread. But I doubt
that there are any plans to create any alternative. In fact, I think
that we do not want any alternative main thread because it would be
hard to support consistency with the rest of the kthread worker API.

The naming and function is inspired by the workqueues API like
the rest of the kthread worker API.

This patch does _not_ convert existing kthread workers. The kthread worker
API need more improvements first, e.g. a function to destroy the worker.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 include/linux/kthread.h |  7 +++++++
 kernel/kthread.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index e691b6a23f72..e390069a3f68 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -124,6 +124,13 @@ extern void __init_kthread_worker(struct kthread_worker *worker,
 
 int kthread_worker_fn(void *worker_ptr);
 
+__printf(2, 3)
+struct kthread_worker *
+create_kthread_worker_on_node(int node, const char namefmt[], ...);
+
+#define create_kthread_worker(namefmt, arg...)				\
+	create_kthread_worker_on_node(-1, namefmt, ##arg)
+
 bool queue_kthread_work(struct kthread_worker *worker,
 			struct kthread_work *work);
 void flush_kthread_work(struct kthread_work *work);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 1d29e675fe44..8f8813b42632 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -579,7 +579,11 @@ int kthread_worker_fn(void *worker_ptr)
 	struct kthread_worker *worker = worker_ptr;
 	struct kthread_work *work;
 
-	WARN_ON(worker->task);
+	/*
+	 * FIXME: Update the check and remove the assignment when all kthread
+	 * worker users are created using create_kthread_worker*() functions.
+	 */
+	WARN_ON(worker->task && worker->task != current);
 	worker->task = current;
 repeat:
 	set_current_state(TASK_INTERRUPTIBLE);	/* mb paired w/ kthread_stop */
@@ -613,6 +617,51 @@ repeat:
 }
 EXPORT_SYMBOL_GPL(kthread_worker_fn);
 
+/**
+ * create_kthread_worker_on_node - create a kthread worker.
+ * @node: memory node number.
+ * @namefmt: printf-style name for the kthread worker (task).
+ *
+ * If the worker is going to be bound on a particular CPU, give its node
+ * in @node, to get NUMA affinity for kthread stack, or else give -1.
+ *
+ * Returns pointer to allocated worker on success, ERR_PTR(-ENOMEM) when
+ * the needed structures could not get allocated, and ERR_PTR(-EINTR)
+ * when the worker was SIGKILLed.
+ */
+struct kthread_worker *
+create_kthread_worker_on_node(int node, const char namefmt[], ...)
+{
+	struct kthread_worker *worker;
+	struct task_struct *task;
+	va_list args;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	if (!worker)
+		return ERR_PTR(-ENOMEM);
+
+	init_kthread_worker(worker);
+
+	va_start(args, namefmt);
+	task = __kthread_create_on_node(kthread_worker_fn, worker, node,
+					namefmt, args);
+	va_end(args);
+
+	if (IS_ERR(task))
+		goto fail_task;
+
+	worker->task = task;
+	wake_up_process(task);
+
+	return worker;
+
+fail_task:
+	kfree(worker);
+	return ERR_CAST(task);
+
+}
+EXPORT_SYMBOL(create_kthread_worker_on_node);
+
 /* insert @work before @pos in @worker */
 static void insert_kthread_work(struct kthread_worker *worker,
 			       struct kthread_work *work,
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 03/18] kthread: Add drain_kthread_worker()
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
  2015-09-21 13:03 ` [RFC v2 01/18] kthread: Allow to call __kthread_create_on_node() with va_list args Petr Mladek
  2015-09-21 13:03 ` [RFC v2 02/18] kthread: Add create_kthread_worker*() Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-22 18:26   ` Tejun Heo
  2015-09-21 13:03 ` [RFC v2 04/18] kthread: Add destroy_kthread_worker() Petr Mladek
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

flush_kthread_worker() returns when the currently queued works are proceed.
But some other works might have been queued in the meantime.

This patch adds drain_kthread_work() that is inspired by drain_workqueue().
It returns when the queue is completely empty and warns when it takes too
long.

The initial implementation does not block queuing new works when draining.
It makes things much easier. The blocking would be useful to debug
potential problems but it is not clear if it is worth
the complication at the moment.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/kthread.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8f8813b42632..e6424cf17cbd 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -770,3 +770,42 @@ void flush_kthread_worker(struct kthread_worker *worker)
 	wait_for_completion(&fwork.done);
 }
 EXPORT_SYMBOL_GPL(flush_kthread_worker);
+
+/**
+ * drain_kthread_worker - drain a kthread worker
+ * @worker: worker to be drained
+ *
+ * Wait until there is none work queued for the given kthread worker.
+ * Only currently running work on @worker can queue further work items
+ * on it.  @worker is flushed repeatedly until it becomes empty.
+ * The number of flushing is determined by the depth of chaining
+ * and should be relatively short.  Whine if it takes too long.
+ *
+ * The caller is responsible for blocking all existing works
+ * from an infinite re-queuing!
+ *
+ * Also the caller is responsible for blocking all the kthread
+ * worker users from queuing any new work. It is especially
+ * important if the queue has to stay empty once this function
+ * finishes.
+ */
+void drain_kthread_worker(struct kthread_worker *worker)
+{
+	int flush_cnt = 0;
+
+	spin_lock_irq(&worker->lock);
+
+	while (!list_empty(&worker->work_list)) {
+		spin_unlock_irq(&worker->lock);
+
+		flush_kthread_worker(worker);
+		WARN_ONCE(flush_cnt++ > 10,
+			  "kthread worker %s: drain_kthread_worker() isn't complete after %u tries\n",
+			  worker->task->comm, flush_cnt);
+
+		spin_lock_irq(&worker->lock);
+	}
+
+	spin_unlock_irq(&worker->lock);
+}
+EXPORT_SYMBOL(drain_kthread_worker);
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 04/18] kthread: Add destroy_kthread_worker()
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (2 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 03/18] kthread: Add drain_kthread_worker() Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-22 18:30   ` Tejun Heo
  2015-09-21 13:03 ` [RFC v2 05/18] kthread: Add pending flag to kthread work Petr Mladek
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

The current kthread worker users call flush() and stop() explicitly.
The new function will make it easier and will do it better. Also it
frees the kthread_worker struct that has been allocated by
create_kthread_worker().

Note that flush() does not guarantee that the queue is empty. drain()
is more safe. It returns when the queue is really empty. Also it warns
when too many work is being queued when draining.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 include/linux/kthread.h |  2 ++
 kernel/kthread.c        | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index e390069a3f68..bef97e06d2b6 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -136,4 +136,6 @@ bool queue_kthread_work(struct kthread_worker *worker,
 void flush_kthread_work(struct kthread_work *work);
 void flush_kthread_worker(struct kthread_worker *worker);
 
+void destroy_kthread_worker(struct kthread_worker *worker);
+
 #endif /* _LINUX_KTHREAD_H */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index e6424cf17cbd..65c263336b8b 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -809,3 +809,23 @@ void drain_kthread_worker(struct kthread_worker *worker)
 	spin_unlock_irq(&worker->lock);
 }
 EXPORT_SYMBOL(drain_kthread_worker);
+
+/**
+ * destroy_kthread_worker - destroy a kthread worker
+ * @worker: worker to be destroyed
+ *
+ * Destroy @worker. It should be idle when this is called.
+ */
+void destroy_kthread_worker(struct kthread_worker *worker)
+{
+	struct task_struct *task;
+
+	task = worker->task;
+	if (WARN_ON(!task))
+		return;
+
+	drain_kthread_worker(worker);
+	kthread_stop(task);
+	kfree(worker);
+}
+EXPORT_SYMBOL(destroy_kthread_worker);
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 05/18] kthread: Add pending flag to kthread work
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (3 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 04/18] kthread: Add destroy_kthread_worker() Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 06/18] kthread: Initial support for delayed " Petr Mladek
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

This is a preparation step for delayed kthread works. It will use
a timer to queue the work with the requested delay. We need to
somehow mark the work in the meantime.

The implementation is inspired by workqueues. It adds a flag that
is manipulated using bit operations. If the flag is set, it means
that the work is going to be queued and any new attempts to queue
the work should fail. As a side effect, queue_kthread_work() could
test pending work even without the lock.

In compare with workqueues, the flag is stored in a separate bitmap
instead of sharing with the worker pointer. Kthread worker does not
use pools of kthreads and the handling is much easier here. I did
not fix a situation where we would need to manipulate both the flag
and the worker pointer atomically.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 include/linux/kthread.h |  6 ++++++
 kernel/kthread.c        | 30 ++++++++++++++++++++++++++----
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index bef97e06d2b6..aabb105d3d4b 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -71,7 +71,13 @@ struct kthread_worker {
 	struct kthread_work	*current_work;
 };
 
+enum {
+	/* work item is pending execution */
+	KTHREAD_WORK_PENDING_BIT	= 0,
+};
+
 struct kthread_work {
+	DECLARE_BITMAP(flags, 8);
 	struct list_head	node;
 	kthread_work_func_t	func;
 	struct kthread_worker	*worker;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 65c263336b8b..fe1510e7ad04 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -602,6 +602,7 @@ repeat:
 		work = list_first_entry(&worker->work_list,
 					struct kthread_work, node);
 		list_del_init(&work->node);
+		clear_bit(KTHREAD_WORK_PENDING_BIT, work->flags);
 	}
 	worker->current_work = work;
 	spin_unlock_irq(&worker->lock);
@@ -675,6 +676,27 @@ static void insert_kthread_work(struct kthread_worker *worker,
 		wake_up_process(worker->task);
 }
 
+/*
+ * Queue @work without the check for the pending flag.
+ * Must be called with IRQs disabled.
+ */
+static void __queue_kthread_work(struct kthread_worker *worker,
+			  struct kthread_work *work)
+{
+	/*
+	 * While a work item is PENDING && off queue, a task trying to
+	 * steal the PENDING will busy-loop waiting for it to either get
+	 * queued or lose PENDING.  Grabbing PENDING and queuing should
+	 * happen with IRQ disabled.
+	 */
+	WARN_ON_ONCE(!irqs_disabled());
+	WARN_ON_ONCE(!list_empty(&work->node));
+
+	spin_lock(&worker->lock);
+	insert_kthread_work(worker, work, &worker->work_list);
+	spin_unlock(&worker->lock);
+}
+
 /**
  * queue_kthread_work - queue a kthread_work
  * @worker: target kthread_worker
@@ -690,12 +712,12 @@ bool queue_kthread_work(struct kthread_worker *worker,
 	bool ret = false;
 	unsigned long flags;
 
-	spin_lock_irqsave(&worker->lock, flags);
-	if (list_empty(&work->node)) {
-		insert_kthread_work(worker, work, &worker->work_list);
+	local_irq_save(flags);
+	if (!test_and_set_bit(KTHREAD_WORK_PENDING_BIT, work->flags)) {
+		__queue_kthread_work(worker, work);
 		ret = true;
 	}
-	spin_unlock_irqrestore(&worker->lock, flags);
+	local_irq_restore(flags);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(queue_kthread_work);
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 06/18] kthread: Initial support for delayed kthread work
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (4 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 05/18] kthread: Add pending flag to kthread work Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 07/18] kthread: Allow to cancel " Petr Mladek
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

We are going to use kthread_worker more widely and delayed works
will be pretty useful.

The implementation is inspired by workqueues. It uses a timer to
queue the work after the requested delay. If the delay is zero,
the work is queued immediately.

In compare with workqueues, the worker is stored in the associated
kthread_work struct directly. Kthread workers do not use pool
of kthreads and the value could stay the same all the time.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 include/linux/kthread.h | 31 ++++++++++++++++++
 kernel/kthread.c        | 84 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 115 insertions(+)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabb105d3d4b..64fb9796ab69 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -63,6 +63,7 @@ extern int tsk_fork_get_node(struct task_struct *tsk);
  */
 struct kthread_work;
 typedef void (*kthread_work_func_t)(struct kthread_work *work);
+void delayed_kthread_work_timer_fn(unsigned long __data);
 
 struct kthread_worker {
 	spinlock_t		lock;
@@ -83,6 +84,11 @@ struct kthread_work {
 	struct kthread_worker	*worker;
 };
 
+struct delayed_kthread_work {
+	struct kthread_work work;
+	struct timer_list timer;
+};
+
 #define KTHREAD_WORKER_INIT(worker)	{				\
 	.lock = __SPIN_LOCK_UNLOCKED((worker).lock),			\
 	.work_list = LIST_HEAD_INIT((worker).work_list),		\
@@ -93,12 +99,23 @@ struct kthread_work {
 	.func = (fn),							\
 	}
 
+#define DELAYED_KTHREAD_WORK_INIT(dwork, fn) {				\
+	.work = KTHREAD_WORK_INIT((dwork).work, (fn)),			\
+	.timer = __TIMER_INITIALIZER(delayed_kthread_work_timer_fn,	\
+				     0, (unsigned long)&(dwork),	\
+				     TIMER_IRQSAFE),			\
+	}
+
 #define DEFINE_KTHREAD_WORKER(worker)					\
 	struct kthread_worker worker = KTHREAD_WORKER_INIT(worker)
 
 #define DEFINE_KTHREAD_WORK(work, fn)					\
 	struct kthread_work work = KTHREAD_WORK_INIT(work, fn)
 
+#define DEFINE_DELAYED_KTHREAD_WORK(dwork, fn)				\
+	struct delayed_kthread_work dwork =				\
+		DELAYED_KTHREAD_WORK_INIT(dwork, fn)
+
 /*
  * kthread_worker.lock needs its own lockdep class key when defined on
  * stack with lockdep enabled.  Use the following macros in such cases.
@@ -128,6 +145,15 @@ extern void __init_kthread_worker(struct kthread_worker *worker,
 		(work)->func = (fn);					\
 	} while (0)
 
+#define init_delayed_kthread_work(dwork, fn)				\
+	do {								\
+		init_kthread_work(&(dwork)->work, (fn));		\
+		__setup_timer(&(dwork)->timer,				\
+			      delayed_kthread_work_timer_fn,		\
+			      (unsigned long)(dwork),			\
+			      TIMER_IRQSAFE);				\
+	} while (0)
+
 int kthread_worker_fn(void *worker_ptr);
 
 __printf(2, 3)
@@ -139,6 +165,11 @@ create_kthread_worker_on_node(int node, const char namefmt[], ...);
 
 bool queue_kthread_work(struct kthread_worker *worker,
 			struct kthread_work *work);
+
+bool queue_delayed_kthread_work(struct kthread_worker *worker,
+				struct delayed_kthread_work *dwork,
+				unsigned long delay);
+
 void flush_kthread_work(struct kthread_work *work);
 void flush_kthread_worker(struct kthread_worker *worker);
 
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe1510e7ad04..eba6e061bda5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -722,6 +722,90 @@ bool queue_kthread_work(struct kthread_worker *worker,
 }
 EXPORT_SYMBOL_GPL(queue_kthread_work);
 
+/**
+ * delayed_kthread_work_timer_fn - callback that queues the associated delayed
+ *	kthread work when the timer expires.
+ * @__data: pointer to the data associated with the timer
+ *
+ * The format of the function is defined by struct timer_list.
+ */
+void delayed_kthread_work_timer_fn(unsigned long __data)
+{
+	struct delayed_kthread_work *dwork =
+		(struct delayed_kthread_work *)__data;
+
+	/* should have been called from irqsafe timer with irq already off */
+	__queue_kthread_work(dwork->work.worker, &dwork->work);
+}
+EXPORT_SYMBOL(delayed_kthread_work_timer_fn);
+
+void __queue_delayed_kthread_work(struct kthread_worker *worker,
+				struct delayed_kthread_work *dwork,
+				unsigned long delay)
+{
+	struct timer_list *timer = &dwork->timer;
+	struct kthread_work *work = &dwork->work;
+
+	WARN_ON_ONCE(timer->function != delayed_kthread_work_timer_fn ||
+		     timer->data != (unsigned long)dwork);
+	WARN_ON_ONCE(timer_pending(timer));
+	WARN_ON_ONCE(!list_empty(&work->node));
+
+	/*
+	 * If @delay is 0, queue @dwork->work immediately.  This is for
+	 * both optimization and correctness.  The earliest @timer can
+	 * expire is on the closest next tick and delayed_work users depend
+	 * on that there's no such delay when @delay is 0.
+	 */
+	if (!delay) {
+		__queue_kthread_work(worker, work);
+		return;
+	}
+
+	timer_stats_timer_set_start_info(&dwork->timer);
+
+	work->worker = worker;
+	timer->expires = jiffies + delay;
+
+	add_timer(timer);
+}
+
+/**
+ * queue_delayed_kthread_work - queue the associated kthread work
+ *	after a delay.
+ * @worker: target kthread_worker
+ * @work: kthread_work to queue
+ * delay: number of jiffies to wait before queuing
+ *
+ * If the work has not been pending it starts a timer that will queue
+ * the work after the given @delay. If @delay is zero, it queues the
+ * work immediately.
+ *
+ * Return: %false if the @work has already been pending. It means that
+ * either the timer was running or the work was queued. It returns %true
+ * otherwise.
+ */
+bool queue_delayed_kthread_work(struct kthread_worker *worker,
+				struct delayed_kthread_work *dwork,
+				unsigned long delay)
+{
+	struct kthread_work *work = &dwork->work;
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	if (!test_and_set_bit(KTHREAD_WORK_PENDING_BIT, work->flags)) {
+		__queue_delayed_kthread_work(worker, dwork, delay);
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_delayed_kthread_work);
+
 struct kthread_flush_work {
 	struct kthread_work	work;
 	struct completion	done;
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (5 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 06/18] kthread: Initial support for delayed " Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-22 19:35   ` Tejun Heo
  2015-09-21 13:03 ` [RFC v2 08/18] kthread: Allow to modify delayed " Petr Mladek
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

We are going to use kthread workers more widely and we will need
to cancel pending work in some situations.

The implementation is inspired by workqueues. There are four basic
situations. The work might be pending, running, or idle. While a pending
delayer work might have running timer or it might already be in the queue.

In all cases we try to get PENDING flag and protect others from queuing.
Once we have the PENDING flag, we try to remove the pending work from
the queue and we wait for a potentially running work until it finishes.

The most complicated situation is when more cancel_*kthread_work() calls
run in parallel. Only one could grab PENDING flags using a busy wait.
The others need to wait until the first one flush() the work. It might
take arbitrary long time and busy wait is not an option here. This
situation is detected using the new CANCELING flag and the less
successful callers need to sleep in a wait queue. They are
woken when the winner finishes its job.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 include/linux/kthread.h |  11 +++
 kernel/kthread.c        | 198 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 64fb9796ab69..327d82875410 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -75,6 +75,8 @@ struct kthread_worker {
 enum {
 	/* work item is pending execution */
 	KTHREAD_WORK_PENDING_BIT	= 0,
+	/* work item is canceling */
+	KTHREAD_WORK_CANCELING_BIT	= 2,
 };
 
 struct kthread_work {
@@ -89,6 +91,12 @@ struct delayed_kthread_work {
 	struct timer_list timer;
 };
 
+static inline struct delayed_kthread_work *
+to_delayed_kthread_work(struct kthread_work *work)
+{
+	return container_of(work, struct delayed_kthread_work, work);
+}
+
 #define KTHREAD_WORKER_INIT(worker)	{				\
 	.lock = __SPIN_LOCK_UNLOCKED((worker).lock),			\
 	.work_list = LIST_HEAD_INIT((worker).work_list),		\
@@ -173,6 +181,9 @@ bool queue_delayed_kthread_work(struct kthread_worker *worker,
 void flush_kthread_work(struct kthread_work *work);
 void flush_kthread_worker(struct kthread_worker *worker);
 
+bool cancel_kthread_work_sync(struct kthread_work *work);
+bool cancel_delayed_kthread_work_sync(struct delayed_kthread_work *work);
+
 void destroy_kthread_worker(struct kthread_worker *worker);
 
 #endif /* _LINUX_KTHREAD_H */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index eba6e061bda5..8c6160eece72 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -859,6 +859,204 @@ retry:
 EXPORT_SYMBOL_GPL(flush_kthread_work);
 
 /**
+ * try_to_grab_pending_kthread_work - steal kthread work item from worklist,
+ *	and disable irq
+ * @work: work item to steal
+ * @is_dwork: @work is a delayed_work
+ * @flags: place to store irq state
+ *
+ * Try to grab PENDING bit of @work.  This function can handle @work in any
+ * stable state - idle, on timer or on worklist.
+ *
+ * Return:
+ *  1		if @work was pending and we successfully stole PENDING
+ *  0		if @work was idle and we claimed PENDING
+ *  -EAGAIN	if PENDING couldn't be grabbed at the moment, safe to busy-retry
+ *  -ENOENT	if someone else is canceling @work, this state may persist
+ *		for arbitrarily long
+ *
+ * Note:
+ * On >= 0 return, the caller owns @work's PENDING bit.  To avoid getting
+ * interrupted while holding PENDING and @work off queue, irq must be
+ * disabled on return.  This, combined with delayed_work->timer being
+ * irqsafe, ensures that we return -EAGAIN for finite short period of time.
+ *
+ * On successful return, >= 0, irq is disabled and the caller is
+ * responsible for releasing it using local_irq_restore(*@flags).
+ *
+ * This function is safe to call from any context including IRQ handler.
+ */
+static int
+try_to_grab_pending_kthread_work(struct kthread_work *work,  bool is_dwork,
+				 unsigned long *flags)
+{
+	struct kthread_worker *worker;
+
+	local_irq_save(*flags);
+retry:
+	/* try to steal the timer if it exists */
+	if (is_dwork) {
+		struct delayed_kthread_work *dwork =
+			to_delayed_kthread_work(work);
+
+		/*
+		 * dwork->timer is irqsafe.  If del_timer() fails, it's
+		 * guaranteed that the timer is not queued anywhere and not
+		 * running on the local CPU.
+		 */
+		if (likely(del_timer(&dwork->timer)))
+			return 1;
+	}
+
+	/* try to claim PENDING the normal way */
+	if (!test_and_set_bit(KTHREAD_WORK_PENDING_BIT, work->flags))
+		return 0;
+
+	/*
+	 * The queuing is in progress, or it is already queued. Try to
+	 * steal it from ->worklist without clearing KTHREAD_WORK_PENDING.
+	 */
+	worker = work->worker;
+	if (!worker)
+		goto fail;
+
+	spin_lock(&worker->lock);
+
+	if (work->worker != worker) {
+		spin_unlock(&worker->lock);
+		goto retry;
+	}
+
+	/* try to grab queued work before it is being executed */
+	if (!list_empty(&work->node)) {
+		list_del_init(&work->node);
+		spin_unlock(&worker->lock);
+		return 1;
+	}
+
+	spin_unlock(&worker->lock);
+fail:
+	local_irq_restore(*flags);
+	if (test_bit(KTHREAD_WORK_CANCELING_BIT, work->flags))
+		return -ENOENT;
+	cpu_relax();
+	return -EAGAIN;
+}
+
+/* custom wait for canceling a kthread work */
+struct cktw_wait {
+	wait_queue_t		wait;
+	struct kthread_work	*work;
+};
+
+static int cktw_wakefn(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct cktw_wait *cwait = container_of(wait, struct cktw_wait, wait);
+
+	if (cwait->work != key)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, key);
+}
+
+static bool __cancel_kthread_work_sync(struct kthread_work *work, bool is_dwork)
+{
+	static DECLARE_WAIT_QUEUE_HEAD(cancel_waitq);
+	unsigned long flags;
+	int ret;
+
+	do {
+		ret = try_to_grab_pending_kthread_work(work, is_dwork, &flags);
+		/*
+		 * If someone else is already canceling, wait for it to finish.
+		 * flush_work() doesn't work for PREEMPT_NONE because we may
+		 * get scheduled between @work's completion and the other
+		 * canceling task resuming and clearing CANCELING -
+		 * flush_work() will return false immediately as @work is
+		 * no longer busy, try_to_grab_pending_kthread_work() will
+		 * return -ENOENT as @work is still being canceled and the
+		 * other canceling task won't be able to clear CANCELING as
+		 * we're hogging the CPU.
+		 *
+		 * Let's wait for completion using a waitqueue.  As this
+		 * may lead to the thundering herd problem, use a custom
+		 * wake function which matches @work along with exclusive
+		 * wait and wakeup.
+		 */
+		if (unlikely(ret == -ENOENT)) {
+			struct cktw_wait cwait;
+
+			init_wait(&cwait.wait);
+			cwait.wait.func = cktw_wakefn;
+			cwait.work = work;
+
+			prepare_to_wait_exclusive(&cancel_waitq, &cwait.wait,
+						  TASK_UNINTERRUPTIBLE);
+			if (test_bit(KTHREAD_WORK_CANCELING_BIT, work->flags))
+				schedule();
+			finish_wait(&cancel_waitq, &cwait.wait);
+		}
+	} while (unlikely(ret < 0));
+
+	/* tell other tasks trying to grab @work to back off */
+	set_bit(KTHREAD_WORK_CANCELING_BIT, work->flags);
+	local_irq_restore(flags);
+
+	flush_kthread_work(work);
+	/* clear both PENDING and CANCELING flags atomically */
+	memset(work->flags, 0, sizeof(work->flags));
+	/*
+	 * Paired with prepare_to_wait() above so that either
+	 * waitqueue_active() is visible here or CANCELING bit is
+	 * visible there.
+	 */
+	smp_mb();
+	if (waitqueue_active(&cancel_waitq))
+		__wake_up(&cancel_waitq, TASK_NORMAL, 1, work);
+
+	return ret;
+}
+
+/**
+ * cancel_kthread_work_sync - cancel a kthread work and wait for it to finish
+ * @dwork: the delayed kthread work to cancel
+ *
+ * Cancel @work and wait for its execution to finish.  This function
+ * can be used even if the work re-queues itself or migrates to
+ * another workqueue.  On return from this function, @work is
+ * guaranteed to be not pending or executing on any CPU.
+ *
+ * cancel_kthread_work_sync(&delayed_work->work) must not be used for
+ * delayed_work's.  Use cancel_delayed_kthread_work_sync() instead.
+ *
+ * The caller must ensure that the worker on which @work was last
+ * queued can't be destroyed before this function returns.
+ *
+ * Return:
+ * %true if @work was pending, %false otherwise.
+ */
+bool cancel_kthread_work_sync(struct kthread_work *work)
+{
+	return __cancel_kthread_work_sync(work, false);
+}
+EXPORT_SYMBOL_GPL(cancel_kthread_work_sync);
+
+/**
+ * cancel_delayed_kthread_work_sync - cancel a delayed kthread work and
+ *	wait for it to finish
+ * @dwork: the delayed kthread work to cancel
+ *
+ * This is cancel_kthread_work_sync() for delayed works.
+ *
+ * Return:
+ * %true if @dwork was pending, %false otherwise.
+ */
+bool cancel_delayed_kthread_work_sync(struct delayed_kthread_work *dwork)
+{
+	return __cancel_kthread_work_sync(&dwork->work, true);
+}
+EXPORT_SYMBOL_GPL(cancel_delayed_kthread_work_sync);
+
+/**
  * flush_kthread_worker - flush all current works on a kthread_worker
  * @worker: worker to flush
  *
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 08/18] kthread: Allow to modify delayed kthread work
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (6 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 07/18] kthread: Allow to cancel " Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 09/18] mm/huge_page: Convert khugepaged() into kthread worker API Petr Mladek
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

There are situations when we need to modify the delay of a delayed
kthread work. It is typically when the work depends on an event
and the initial delay means a timeout. In this case, we want to
queue the work immediately when the event happens.

The implementation of mod_delayed_kthread_work() is inspired
by a similar function from workqueues.

The function must work also in IRQ context. Therefore it could
not sleep. It must give up when a cancel_delayed_kthread_work()
is flushing the work in parallel. But this happens only
when the two operations are not synchronized any other way
and we would get the same result if the cancel() was called
just a bit later.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 include/linux/kthread.h |  4 ++++
 kernel/kthread.c        | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 327d82875410..2110a55bd769 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -178,6 +178,10 @@ bool queue_delayed_kthread_work(struct kthread_worker *worker,
 				struct delayed_kthread_work *dwork,
 				unsigned long delay);
 
+bool mod_delayed_kthread_work(struct kthread_worker *worker,
+			      struct delayed_kthread_work *dwork,
+			      unsigned long delay);
+
 void flush_kthread_work(struct kthread_work *work);
 void flush_kthread_worker(struct kthread_worker *worker);
 
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8c6160eece72..27bf242064d1 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -943,6 +943,49 @@ fail:
 	return -EAGAIN;
 }
 
+/**
+ * mod_delayed_kthread_work - modify delay of or queue a delayed kthread work
+ * @worker: kthread worker to use
+ * @dwork: delayed kthread work to queue
+ * @delay: number of jiffies to wait before queuing
+ *
+ * If @dwork is idle, equivalent to queue_delayed_kthread work(); otherwise,
+ * modify @dwork's timer so that it expires after @delay.  If @delay is
+ * zero, @work is guaranteed to be queued immediately;
+ *
+ * Return: %false if @dwork was idle and queued, %true if @dwork was
+ * pending and its timer was modified.
+ *
+ * It returns %true also when cancel_kthread_work_sync() is flushing
+ * the work, see below. We are not able to queue the work in this case.
+ * But it happens only when the two calls are not synchronized. We would
+ * get the same result if cancel() was called just a bit later.
+ *
+ * This function is safe to call from any context including IRQ handler.
+ * See try_to_grab_pending_kthread_work() for details.
+ */
+bool mod_delayed_kthread_work(struct kthread_worker *worker,
+			      struct delayed_kthread_work *dwork,
+			      unsigned long delay)
+{
+	unsigned long flags;
+	int ret;
+
+	do {
+		ret = try_to_grab_pending_kthread_work(&dwork->work,
+						       true, &flags);
+	} while (unlikely(ret == -EAGAIN));
+
+	if (likely(ret >= 0)) {
+		__queue_delayed_kthread_work(worker, dwork, delay);
+		local_irq_restore(flags);
+	}
+
+	/* -ENOENT from try_to_grab_pending() becomes %true. */
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mod_delayed_kthread_work);
+
 /* custom wait for canceling a kthread work */
 struct cktw_wait {
 	wait_queue_t		wait;
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 09/18] mm/huge_page: Convert khugepaged() into kthread worker API
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (7 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 08/18] kthread: Allow to modify delayed " Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-22 20:26   ` Tejun Heo
  2015-09-21 13:03 ` [RFC v2 10/18] ring_buffer: Do no not complete benchmark reader too early Petr Mladek
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

Kthreads are currently implemented as an infinite loop. Each
has its own variant of checks for terminating, freezing,
awakening. In many cases it is unclear to say in which state
it is and sometimes it is done a wrong way.

The plan is to convert kthreads into kthread_worker or workqueues
API. It allows to split the functionality into separate operations.
It helps to make a better structure. Also it defines a clean state
where no locks are taken, IRQs blocked, the kthread might sleep
or even be safely migrated.

The kthread worker API is useful when we want to have a dedicated
single thread for the work. It helps to make sure that it is
available when needed. Also it allows a better control, e.g.
define a scheduling priority.

This patch converts khugepaged() in kthread worker API
because it modifies the scheduling.

It keeps the functionality except that we do not wakeup the worker
when it is already created and someone calls start() once again.

The scan work is queued only when the list of scanned pages is
not empty. They delay between scans is done using delayed work.

Note that @khugepaged_wait waitqueue had two purposes. It was used
to wait between scans and when an allocation failed. It is still used
for the second purpose. Therefore it was renamed to better describe
the current use.

Also note that we could not longer check for kthread_should_stop()
in the works. The kthread used by the worker has to stay alive
until all queued works are finished. Instead, we use the existing
check khugepaged_enabled() that returns false when we are going down.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 mm/huge_memory.c | 116 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 62 insertions(+), 54 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8db9df2..d5030fe7b687 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -57,10 +57,19 @@ static unsigned int khugepaged_full_scans;
 static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
-static struct task_struct *khugepaged_thread __read_mostly;
+
+static void khugepaged_init_func(struct kthread_work *dummy);
+static void khugepaged_do_scan_func(struct kthread_work *dummy);
+static void khugepaged_cleanup_func(struct kthread_work *dummy);
+static struct kthread_worker *khugepaged_worker;
+static DEFINE_KTHREAD_WORK(khugepaged_init_work, khugepaged_init_func);
+static DEFINE_DELAYED_KTHREAD_WORK(khugepaged_do_scan_work,
+				   khugepaged_do_scan_func);
+static DEFINE_KTHREAD_WORK(khugepaged_cleanup_work, khugepaged_cleanup_func);
+
 static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
-static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_alloc_wait);
 /*
  * default collapse hugepages if there is at least one pte mapped like
  * it would have happened if the vma was large enough during page
@@ -68,7 +77,6 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  */
 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
 
-static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
 static void khugepaged_slab_exit(void);
 
@@ -144,29 +152,43 @@ static void set_recommended_min_free_kbytes(void)
 	setup_per_zone_wmarks();
 }
 
+static int khugepaged_has_work(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) &&
+		khugepaged_enabled();
+}
+
 static int start_stop_khugepaged(void)
 {
 	int err = 0;
 	if (khugepaged_enabled()) {
-		if (!khugepaged_thread)
-			khugepaged_thread = kthread_run(khugepaged, NULL,
-							"khugepaged");
-		if (unlikely(IS_ERR(khugepaged_thread))) {
-			pr_err("khugepaged: kthread_run(khugepaged) failed\n");
-			err = PTR_ERR(khugepaged_thread);
-			khugepaged_thread = NULL;
-			goto fail;
+		if (khugepaged_worker)
+			goto out;
+
+		khugepaged_worker = create_kthread_worker("khugepaged");
+
+		if (unlikely(IS_ERR(khugepaged_worker))) {
+			pr_err("khugepaged: failed to create kthread worker\n");
+			khugepaged_worker = NULL;
+			goto out;
 		}
 
+		queue_kthread_work(khugepaged_worker,
+				   &khugepaged_init_work);
+
 		if (!list_empty(&khugepaged_scan.mm_head))
-			wake_up_interruptible(&khugepaged_wait);
+			queue_delayed_kthread_work(khugepaged_worker,
+						   &khugepaged_do_scan_work,
+						   0);
 
 		set_recommended_min_free_kbytes();
-	} else if (khugepaged_thread) {
-		kthread_stop(khugepaged_thread);
-		khugepaged_thread = NULL;
+	} else if (khugepaged_worker) {
+		cancel_delayed_kthread_work_sync(&khugepaged_do_scan_work);
+		queue_kthread_work(khugepaged_worker, &khugepaged_cleanup_work);
+		destroy_kthread_worker(khugepaged_worker);
+		khugepaged_worker = NULL;
 	}
-fail:
+out:
 	return err;
 }
 
@@ -425,7 +447,10 @@ static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
 		return -EINVAL;
 
 	khugepaged_scan_sleep_millisecs = msecs;
-	wake_up_interruptible(&khugepaged_wait);
+	if (khugepaged_has_work())
+		mod_delayed_kthread_work(khugepaged_worker,
+					 &khugepaged_do_scan_work,
+					 0);
 
 	return count;
 }
@@ -452,7 +477,7 @@ static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
 		return -EINVAL;
 
 	khugepaged_alloc_sleep_millisecs = msecs;
-	wake_up_interruptible(&khugepaged_wait);
+	wake_up_interruptible(&khugepaged_alloc_wait);
 
 	return count;
 }
@@ -2120,7 +2145,9 @@ int __khugepaged_enter(struct mm_struct *mm)
 
 	atomic_inc(&mm->mm_count);
 	if (wakeup)
-		wake_up_interruptible(&khugepaged_wait);
+		mod_delayed_kthread_work(khugepaged_worker,
+					 &khugepaged_do_scan_work,
+					 0);
 
 	return 0;
 }
@@ -2335,10 +2362,10 @@ static void khugepaged_alloc_sleep(void)
 {
 	DEFINE_WAIT(wait);
 
-	add_wait_queue(&khugepaged_wait, &wait);
+	add_wait_queue(&khugepaged_alloc_wait, &wait);
 	freezable_schedule_timeout_interruptible(
 		msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
-	remove_wait_queue(&khugepaged_wait, &wait);
+	remove_wait_queue(&khugepaged_alloc_wait, &wait);
 }
 
 static int khugepaged_node_load[MAX_NUMNODES];
@@ -2849,19 +2876,13 @@ breakouterloop_mmap_sem:
 	return progress;
 }
 
-static int khugepaged_has_work(void)
-{
-	return !list_empty(&khugepaged_scan.mm_head) &&
-		khugepaged_enabled();
-}
-
-static int khugepaged_wait_event(void)
+static void khugepaged_init_func(struct kthread_work *dummy)
 {
-	return !list_empty(&khugepaged_scan.mm_head) ||
-		kthread_should_stop();
+	set_freezable();
+	set_user_nice(current, MAX_NICE);
 }
 
-static void khugepaged_do_scan(void)
+static void khugepaged_do_scan_func(struct kthread_work *dummy)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
@@ -2876,7 +2897,7 @@ static void khugepaged_do_scan(void)
 
 		cond_resched();
 
-		if (unlikely(kthread_should_stop() || try_to_freeze()))
+		if (unlikely(!khugepaged_enabled() || try_to_freeze()))
 			break;
 
 		spin_lock(&khugepaged_mm_lock);
@@ -2893,43 +2914,30 @@ static void khugepaged_do_scan(void)
 
 	if (!IS_ERR_OR_NULL(hpage))
 		put_page(hpage);
-}
 
-static void khugepaged_wait_work(void)
-{
 	if (khugepaged_has_work()) {
-		if (!khugepaged_scan_sleep_millisecs)
-			return;
 
-		wait_event_freezable_timeout(khugepaged_wait,
-					     kthread_should_stop(),
-			msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
-		return;
-	}
+		unsigned long delay = 0;
 
-	if (khugepaged_enabled())
-		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
+		if (khugepaged_scan_sleep_millisecs)
+			delay = msecs_to_jiffies(khugepaged_scan_sleep_millisecs);
+
+		queue_delayed_kthread_work(khugepaged_worker,
+					   &khugepaged_do_scan_work,
+					   delay);
+	}
 }
 
-static int khugepaged(void *none)
+static void khugepaged_cleanup_func(struct kthread_work *dummy)
 {
 	struct mm_slot *mm_slot;
 
-	set_freezable();
-	set_user_nice(current, MAX_NICE);
-
-	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
-		khugepaged_wait_work();
-	}
-
 	spin_lock(&khugepaged_mm_lock);
 	mm_slot = khugepaged_scan.mm_slot;
 	khugepaged_scan.mm_slot = NULL;
 	if (mm_slot)
 		collect_mm_slot(mm_slot);
 	spin_unlock(&khugepaged_mm_lock);
-	return 0;
 }
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 10/18] ring_buffer: Do no not complete benchmark reader too early
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (8 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 09/18] mm/huge_page: Convert khugepaged() into kthread worker API Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 11/18] ring_buffer: Fix more races when terminating the producer in the benchmark Petr Mladek
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

It seems that complete(&read_done) might be called too early
in some situations.

1st scenario:
-------------

CPU0					CPU1

ring_buffer_producer_thread()
  wake_up_process(consumer);
  wait_for_completion(&read_start);

					ring_buffer_consumer_thread()
					  complete(&read_start);

  ring_buffer_producer()
    # producing data in
    # the do-while cycle

					  ring_buffer_consumer();
					    # reading data
					    # got error
					    # set kill_test = 1;
					    set_current_state(
						TASK_INTERRUPTIBLE);
					    if (reader_finish)  # false
					    schedule();

    # producer still in the middle of
    # do-while cycle
    if (consumer && !(cnt % wakeup_interval))
      wake_up_process(consumer);

					    # spurious wakeup
					    while (!reader_finish &&
						   !kill_test)
					    # leaving because
					    # kill_test == 1
					    reader_finish = 0;
					    complete(&read_done);

1st BANG: We might access uninitialized "read_done" if this is the
	  the first round.

    # producer finally leaving
    # the do-while cycle because kill_test == 1;

    if (consumer) {
      reader_finish = 1;
      wake_up_process(consumer);
      wait_for_completion(&read_done);

2nd BANG: This will never complete because consumer already did
	  the completion.

2nd scenario:
-------------

CPU0					CPU1

ring_buffer_producer_thread()
  wake_up_process(consumer);
  wait_for_completion(&read_start);

					ring_buffer_consumer_thread()
					  complete(&read_start);

  ring_buffer_producer()
    # CPU3 removes the module	  <--- difference from
    # and stops producer          <--- the 1st scenario
    if (kthread_should_stop())
      kill_test = 1;

					  ring_buffer_consumer();
					    while (!reader_finish &&
						   !kill_test)
					    # kill_test == 1 => we never go
					    # into the top level while()
					    reader_finish = 0;
					    complete(&read_done);

    # producer still in the middle of
    # do-while cycle
    if (consumer && !(cnt % wakeup_interval))
      wake_up_process(consumer);

					    # spurious wakeup
					    while (!reader_finish &&
						   !kill_test)
					    # leaving because kill_test == 1
					    reader_finish = 0;
					    complete(&read_done);

BANG: We are in the same "bang" situations as in the 1st scenario.

Root of the problem:
--------------------

ring_buffer_consumer() must complete "read_done" only when "reader_finish"
variable is set. It must not be skipped due to other conditions.

Note that we still must keep the check for "reader_finish" in a loop
because there might be spurious wakeups as described in the
above scenarios.

Solution:
----------

The top level cycle in ring_buffer_consumer() will finish only when
"reader_finish" is set. The data will be read in "while-do" cycle
so that they are not read after an error (kill_test == 1)
or a spurious wake up.

In addition, "reader_finish" is manipulated by the producer thread.
Therefore we add READ_ONCE() to make sure that the fresh value is
read in each cycle. Also we add the corresponding barrier
to synchronize the sleep check.

Next we set the state back to TASK_RUNNING for the situation where we
did not sleep.

Just from paranoid reasons, we initialize both completions statically.
This is safer, in case there are other races that we are unaware of.

As a side effect we could remove the memory barrier from
ring_buffer_producer_thread(). IMHO, this was the reason for
the barrier. ring_buffer_reset() uses spin locks that should
provide the needed memory barrier for using the buffer.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/trace/ring_buffer_benchmark.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index a1503a027ee2..9ea7949366b3 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -24,8 +24,8 @@ struct rb_page {
 static int wakeup_interval = 100;
 
 static int reader_finish;
-static struct completion read_start;
-static struct completion read_done;
+static DECLARE_COMPLETION(read_start);
+static DECLARE_COMPLETION(read_done);
 
 static struct ring_buffer *buffer;
 static struct task_struct *producer;
@@ -178,10 +178,14 @@ static void ring_buffer_consumer(void)
 	read_events ^= 1;
 
 	read = 0;
-	while (!reader_finish && !kill_test) {
-		int found;
+	/*
+	 * Continue running until the producer specifically asks to stop
+	 * and is ready for the completion.
+	 */
+	while (!READ_ONCE(reader_finish)) {
+		int found = 1;
 
-		do {
+		while (found && !kill_test) {
 			int cpu;
 
 			found = 0;
@@ -195,17 +199,23 @@ static void ring_buffer_consumer(void)
 
 				if (kill_test)
 					break;
+
 				if (stat == EVENT_FOUND)
 					found = 1;
+
 			}
-		} while (found && !kill_test);
+		}
 
+		/* Wait till the producer wakes us up when there is more data
+		 * available or when the producer wants us to finish reading.
+		 */
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (reader_finish)
 			break;
 
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 	reader_finish = 0;
 	complete(&read_done);
 }
@@ -389,13 +399,10 @@ static int ring_buffer_consumer_thread(void *arg)
 
 static int ring_buffer_producer_thread(void *arg)
 {
-	init_completion(&read_start);
-
 	while (!kthread_should_stop() && !kill_test) {
 		ring_buffer_reset(buffer);
 
 		if (consumer) {
-			smp_wmb();
 			wake_up_process(consumer);
 			wait_for_completion(&read_start);
 		}
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 11/18] ring_buffer: Fix more races when terminating the producer in the benchmark
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (9 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 10/18] ring_buffer: Do no not complete benchmark reader too early Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 12/18] ring_buffer: Convert benchmark kthreads into kthread worker API Petr Mladek
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

The commit b44754d8262d3aab8 ("ring_buffer: Allow to exit the ring
buffer benchmark immediately") added a hack into ring_buffer_producer()
that set @kill_test when kthread_should_stop() returned true. It improved
the situation a lot. It stopped the kthread in most cases because
the producer spent most of the time in the patched while cycle.

But there are still few possible races when kthread_should_stop()
is set outside of the cycle. Then we do not set @kill_test and
some other checks pass.

This patch adds a better fix. It renames @test_kill/TEST_KILL() into
a better descriptive @test_error/TEST_ERROR(). Also it introduces
break_test() function that checks for both @test_error and
kthread_should_stop().

The new function is used in the producer when the check for @test_error
is not enough. It is not used in the consumer because its state
is manipulated by the producer via the "reader_finish" variable.

Also we add a missing check into ring_buffer_producer_thread()
between setting TASK_INTERRUPTIBLE and calling schedule_timeout().
Otherwise, we might miss a wakeup from kthread_stop().

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/trace/ring_buffer_benchmark.c | 54 +++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 9ea7949366b3..9e00fd178226 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -60,12 +60,12 @@ MODULE_PARM_DESC(consumer_fifo, "fifo prio for consumer");
 
 static int read_events;
 
-static int kill_test;
+static int test_error;
 
-#define KILL_TEST()				\
+#define TEST_ERROR()				\
 	do {					\
-		if (!kill_test) {		\
-			kill_test = 1;		\
+		if (!test_error) {		\
+			test_error = 1;		\
 			WARN_ON(1);		\
 		}				\
 	} while (0)
@@ -75,6 +75,11 @@ enum event_status {
 	EVENT_DROPPED,
 };
 
+static bool break_test(void)
+{
+	return test_error || kthread_should_stop();
+}
+
 static enum event_status read_event(int cpu)
 {
 	struct ring_buffer_event *event;
@@ -87,7 +92,7 @@ static enum event_status read_event(int cpu)
 
 	entry = ring_buffer_event_data(event);
 	if (*entry != cpu) {
-		KILL_TEST();
+		TEST_ERROR();
 		return EVENT_DROPPED;
 	}
 
@@ -115,10 +120,10 @@ static enum event_status read_page(int cpu)
 		rpage = bpage;
 		/* The commit may have missed event flags set, clear them */
 		commit = local_read(&rpage->commit) & 0xfffff;
-		for (i = 0; i < commit && !kill_test; i += inc) {
+		for (i = 0; i < commit && !test_error ; i += inc) {
 
 			if (i >= (PAGE_SIZE - offsetof(struct rb_page, data))) {
-				KILL_TEST();
+				TEST_ERROR();
 				break;
 			}
 
@@ -128,7 +133,7 @@ static enum event_status read_page(int cpu)
 			case RINGBUF_TYPE_PADDING:
 				/* failed writes may be discarded events */
 				if (!event->time_delta)
-					KILL_TEST();
+					TEST_ERROR();
 				inc = event->array[0] + 4;
 				break;
 			case RINGBUF_TYPE_TIME_EXTEND:
@@ -137,12 +142,12 @@ static enum event_status read_page(int cpu)
 			case 0:
 				entry = ring_buffer_event_data(event);
 				if (*entry != cpu) {
-					KILL_TEST();
+					TEST_ERROR();
 					break;
 				}
 				read++;
 				if (!event->array[0]) {
-					KILL_TEST();
+					TEST_ERROR();
 					break;
 				}
 				inc = event->array[0] + 4;
@@ -150,17 +155,17 @@ static enum event_status read_page(int cpu)
 			default:
 				entry = ring_buffer_event_data(event);
 				if (*entry != cpu) {
-					KILL_TEST();
+					TEST_ERROR();
 					break;
 				}
 				read++;
 				inc = ((event->type_len + 1) * 4);
 			}
-			if (kill_test)
+			if (test_error)
 				break;
 
 			if (inc <= 0) {
-				KILL_TEST();
+				TEST_ERROR();
 				break;
 			}
 		}
@@ -185,7 +190,7 @@ static void ring_buffer_consumer(void)
 	while (!READ_ONCE(reader_finish)) {
 		int found = 1;
 
-		while (found && !kill_test) {
+		while (found && !test_error) {
 			int cpu;
 
 			found = 0;
@@ -197,7 +202,7 @@ static void ring_buffer_consumer(void)
 				else
 					stat = read_page(cpu);
 
-				if (kill_test)
+				if (test_error)
 					break;
 
 				if (stat == EVENT_FOUND)
@@ -273,10 +278,7 @@ static void ring_buffer_producer(void)
 		if (cnt % wakeup_interval)
 			cond_resched();
 #endif
-		if (kthread_should_stop())
-			kill_test = 1;
-
-	} while (ktime_before(end_time, timeout) && !kill_test);
+	} while (ktime_before(end_time, timeout) && !break_test());
 	trace_printk("End ring buffer hammer\n");
 
 	if (consumer) {
@@ -297,7 +299,7 @@ static void ring_buffer_producer(void)
 	entries = ring_buffer_entries(buffer);
 	overruns = ring_buffer_overruns(buffer);
 
-	if (kill_test && !kthread_should_stop())
+	if (test_error)
 		trace_printk("ERROR!\n");
 
 	if (!disable_reader) {
@@ -378,15 +380,14 @@ static void wait_to_die(void)
 
 static int ring_buffer_consumer_thread(void *arg)
 {
-	while (!kthread_should_stop() && !kill_test) {
+	while (!break_test()) {
 		complete(&read_start);
 
 		ring_buffer_consumer();
 
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (kthread_should_stop() || kill_test)
+		if (break_test())
 			break;
-
 		schedule();
 	}
 	__set_current_state(TASK_RUNNING);
@@ -399,7 +400,7 @@ static int ring_buffer_consumer_thread(void *arg)
 
 static int ring_buffer_producer_thread(void *arg)
 {
-	while (!kthread_should_stop() && !kill_test) {
+	while (!break_test()) {
 		ring_buffer_reset(buffer);
 
 		if (consumer) {
@@ -408,15 +409,18 @@ static int ring_buffer_producer_thread(void *arg)
 		}
 
 		ring_buffer_producer();
-		if (kill_test)
+		if (break_test())
 			goto out_kill;
 
 		trace_printk("Sleeping for 10 secs\n");
 		set_current_state(TASK_INTERRUPTIBLE);
+		if (break_test())
+			goto out_kill;
 		schedule_timeout(HZ * SLEEP_TIME);
 	}
 
 out_kill:
+	__set_current_state(TASK_RUNNING);
 	if (!kthread_should_stop())
 		wait_to_die();
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 12/18] ring_buffer: Convert benchmark kthreads into kthread worker API
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (10 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 11/18] ring_buffer: Fix more races when terminating the producer in the benchmark Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 13/18] rcu: Finish folding ->fqs_state into ->gp_state Petr Mladek
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

Kthreads are currently implemented as an infinite loop. Each
has its own variant of checks for terminating, freezing,
awakening. In many cases it is unclear to say in which state
it is and sometimes it is done a wrong way.

The plan is to convert kthreads into kthread_worker or workqueues
API. It allows to split the functionality into separate operations.
It helps to make a better structure. Also it defines a clean state
where no locks are taken, IRQs blocked, the kthread might sleep
or even be safely migrated.

The kthread worker API is useful when we want to have a dedicated
single thread for the work. It helps to make sure that it is
available when needed. Also it allows a better control, e.g.
define a scheduling priority.

This patch converts the ring buffer benchmark producer into a kthread
worker because it modifies the scheduling priority and policy.
Also, it is a benchmark. It makes CPU very busy. It will most likely
run only limited time. IMHO, it does not make sense to mess the system
workqueues with it.

The thread is split into two independent works. It might look more
complicated but it helped me to find a race in the sleeping part
that was fixed separately.

kthread_should_stop() could not longer be used inside the works
because it defines the life of the worker and it needs to stay
usable until all works are done. Instead, we add @test_end
global variable. It is set during normal termination in compare
with @test_error.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/trace/ring_buffer_benchmark.c | 133 ++++++++++++++++-------------------
 1 file changed, 59 insertions(+), 74 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 9e00fd178226..3f27ff6debd3 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -26,10 +26,17 @@ static int wakeup_interval = 100;
 static int reader_finish;
 static DECLARE_COMPLETION(read_start);
 static DECLARE_COMPLETION(read_done);
-
 static struct ring_buffer *buffer;
-static struct task_struct *producer;
-static struct task_struct *consumer;
+
+static void rb_producer_hammer_func(struct kthread_work *dummy);
+static struct kthread_worker *rb_producer_worker;
+static DEFINE_DELAYED_KTHREAD_WORK(rb_producer_hammer_work,
+				   rb_producer_hammer_func);
+
+static void rb_consumer_func(struct kthread_work *dummy);
+static struct kthread_worker *rb_consumer_worker;
+static DEFINE_KTHREAD_WORK(rb_consumer_work, rb_consumer_func);
+
 static unsigned long read;
 
 static unsigned int disable_reader;
@@ -61,6 +68,7 @@ MODULE_PARM_DESC(consumer_fifo, "fifo prio for consumer");
 static int read_events;
 
 static int test_error;
+static int test_end;
 
 #define TEST_ERROR()				\
 	do {					\
@@ -77,7 +85,7 @@ enum event_status {
 
 static bool break_test(void)
 {
-	return test_error || kthread_should_stop();
+	return test_error || test_end;
 }
 
 static enum event_status read_event(int cpu)
@@ -262,8 +270,8 @@ static void ring_buffer_producer(void)
 		end_time = ktime_get();
 
 		cnt++;
-		if (consumer && !(cnt % wakeup_interval))
-			wake_up_process(consumer);
+		if (rb_consumer_worker && !(cnt % wakeup_interval))
+			wake_up_process(rb_consumer_worker->task);
 
 #ifndef CONFIG_PREEMPT
 		/*
@@ -281,7 +289,7 @@ static void ring_buffer_producer(void)
 	} while (ktime_before(end_time, timeout) && !break_test());
 	trace_printk("End ring buffer hammer\n");
 
-	if (consumer) {
+	if (rb_consumer_worker) {
 		/* Init both completions here to avoid races */
 		init_completion(&read_start);
 		init_completion(&read_done);
@@ -290,7 +298,7 @@ static void ring_buffer_producer(void)
 		reader_finish = 1;
 		/* finish var visible before waking up the consumer */
 		smp_wmb();
-		wake_up_process(consumer);
+		wake_up_process(rb_consumer_worker->task);
 		wait_for_completion(&read_done);
 	}
 
@@ -368,68 +376,39 @@ static void ring_buffer_producer(void)
 	}
 }
 
-static void wait_to_die(void)
-{
-	set_current_state(TASK_INTERRUPTIBLE);
-	while (!kthread_should_stop()) {
-		schedule();
-		set_current_state(TASK_INTERRUPTIBLE);
-	}
-	__set_current_state(TASK_RUNNING);
-}
-
-static int ring_buffer_consumer_thread(void *arg)
+static void rb_consumer_func(struct kthread_work *dummy)
 {
-	while (!break_test()) {
-		complete(&read_start);
-
-		ring_buffer_consumer();
+	complete(&read_start);
 
-		set_current_state(TASK_INTERRUPTIBLE);
-		if (break_test())
-			break;
-		schedule();
-	}
-	__set_current_state(TASK_RUNNING);
-
-	if (!kthread_should_stop())
-		wait_to_die();
-
-	return 0;
+	ring_buffer_consumer();
 }
 
-static int ring_buffer_producer_thread(void *arg)
+static void rb_producer_hammer_func(struct kthread_work *dummy)
 {
-	while (!break_test()) {
-		ring_buffer_reset(buffer);
+	if (break_test())
+		return;
 
-		if (consumer) {
-			wake_up_process(consumer);
-			wait_for_completion(&read_start);
-		}
-
-		ring_buffer_producer();
-		if (break_test())
-			goto out_kill;
+	ring_buffer_reset(buffer);
 
-		trace_printk("Sleeping for 10 secs\n");
-		set_current_state(TASK_INTERRUPTIBLE);
-		if (break_test())
-			goto out_kill;
-		schedule_timeout(HZ * SLEEP_TIME);
+	if (rb_consumer_worker) {
+		queue_kthread_work(rb_consumer_worker, &rb_consumer_work);
+		wait_for_completion(&read_start);
 	}
 
-out_kill:
-	__set_current_state(TASK_RUNNING);
-	if (!kthread_should_stop())
-		wait_to_die();
+	ring_buffer_producer();
 
-	return 0;
+	if (break_test())
+		return;
+
+	trace_printk("Sleeping for 10 secs\n");
+	queue_delayed_kthread_work(rb_producer_worker,
+				   &rb_producer_hammer_work,
+				   HZ * SLEEP_TIME);
 }
 
 static int __init ring_buffer_benchmark_init(void)
 {
-	int ret;
+	int ret = 0;
 
 	/* make a one meg buffer in overwite mode */
 	buffer = ring_buffer_alloc(1000000, RB_FL_OVERWRITE);
@@ -437,19 +416,21 @@ static int __init ring_buffer_benchmark_init(void)
 		return -ENOMEM;
 
 	if (!disable_reader) {
-		consumer = kthread_create(ring_buffer_consumer_thread,
-					  NULL, "rb_consumer");
-		ret = PTR_ERR(consumer);
-		if (IS_ERR(consumer))
+		rb_consumer_worker = create_kthread_worker("rb_consumer");
+		if (IS_ERR(rb_consumer_worker)) {
+			ret = PTR_ERR(rb_consumer_worker);
 			goto out_fail;
+		}
 	}
 
-	producer = kthread_run(ring_buffer_producer_thread,
-			       NULL, "rb_producer");
-	ret = PTR_ERR(producer);
-
-	if (IS_ERR(producer))
+	rb_producer_worker = create_kthread_worker("rb_producer");
+	if (IS_ERR(rb_producer_worker)) {
+		ret = PTR_ERR(rb_producer_worker);
 		goto out_kill;
+	}
+
+	queue_delayed_kthread_work(rb_producer_worker,
+				   &rb_producer_hammer_work, 0);
 
 	/*
 	 * Run them as low-prio background tasks by default:
@@ -459,24 +440,26 @@ static int __init ring_buffer_benchmark_init(void)
 			struct sched_param param = {
 				.sched_priority = consumer_fifo
 			};
-			sched_setscheduler(consumer, SCHED_FIFO, &param);
+			sched_setscheduler(rb_consumer_worker->task,
+					   SCHED_FIFO, &param);
 		} else
-			set_user_nice(consumer, consumer_nice);
+			set_user_nice(rb_consumer_worker->task, consumer_nice);
 	}
 
 	if (producer_fifo >= 0) {
 		struct sched_param param = {
 			.sched_priority = producer_fifo
 		};
-		sched_setscheduler(producer, SCHED_FIFO, &param);
+		sched_setscheduler(rb_producer_worker->task,
+				   SCHED_FIFO, &param);
 	} else
-		set_user_nice(producer, producer_nice);
+		set_user_nice(rb_producer_worker->task, producer_nice);
 
 	return 0;
 
  out_kill:
-	if (consumer)
-		kthread_stop(consumer);
+	if (rb_consumer_worker)
+		destroy_kthread_worker(rb_consumer_worker);
 
  out_fail:
 	ring_buffer_free(buffer);
@@ -485,9 +468,11 @@ static int __init ring_buffer_benchmark_init(void)
 
 static void __exit ring_buffer_benchmark_exit(void)
 {
-	kthread_stop(producer);
-	if (consumer)
-		kthread_stop(consumer);
+	test_end = 1;
+	cancel_delayed_kthread_work_sync(&rb_producer_hammer_work);
+	destroy_kthread_worker(rb_producer_worker);
+	if (rb_consumer_worker)
+		destroy_kthread_worker(rb_consumer_worker);
 	ring_buffer_free(buffer);
 }
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 13/18] rcu: Finish folding ->fqs_state into ->gp_state
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (11 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 12/18] ring_buffer: Convert benchmark kthreads into kthread worker API Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 14/18] rcu: Store first_gp_fqs into struct rcu_state Petr Mladek
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

Commit commit 4cdfc175c25c89ee ("rcu: Move quiescent-state forcing
into kthread") started the process of folding the old ->fqs_state into
->gp_state, but did not complete it.  This situation does not cause
any malfunction, but can result in extremely confusing trace output.
This commit completes this task of eliminating ->fqs_state in favor
of ->gp_state.

The old ->fqs_state was also used to decide when to collect dyntick-idle
snapshots.  For this purpose, we add a boolean variable into the kthread,
which is set on the first call to rcu_gp_fqs() for a given grace period
and clear otherwise.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c       | 18 ++++++++----------
 kernel/rcu/tree.h       | 14 +++-----------
 kernel/rcu/tree_trace.c |  2 +-
 3 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9f75f25cc5d9..5413d87a67c6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -98,7 +98,7 @@ struct rcu_state sname##_state = { \
 	.level = { &sname##_state.node[0] }, \
 	.rda = &sname##_data, \
 	.call = cr, \
-	.fqs_state = RCU_GP_IDLE, \
+	.gp_state = RCU_GP_IDLE, \
 	.gpnum = 0UL - 300UL, \
 	.completed = 0UL - 300UL, \
 	.orphan_lock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.orphan_lock), \
@@ -1927,16 +1927,15 @@ static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp, int *gfp)
 /*
  * Do one round of quiescent-state forcing.
  */
-static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
+static void rcu_gp_fqs(struct rcu_state *rsp, bool first_time)
 {
-	int fqs_state = fqs_state_in;
 	bool isidle = false;
 	unsigned long maxj;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	WRITE_ONCE(rsp->gp_activity, jiffies);
 	rsp->n_force_qs++;
-	if (fqs_state == RCU_SAVE_DYNTICK) {
+	if (first_time) {
 		/* Collect dyntick-idle snapshots. */
 		if (is_sysidle_rcu_state(rsp)) {
 			isidle = true;
@@ -1945,7 +1944,6 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 		force_qs_rnp(rsp, dyntick_save_progress_counter,
 			     &isidle, &maxj);
 		rcu_sysidle_report_gp(rsp, isidle, maxj);
-		fqs_state = RCU_FORCE_QS;
 	} else {
 		/* Handle dyntick-idle and offline CPUs. */
 		isidle = true;
@@ -1959,7 +1957,6 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 			   READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
 		raw_spin_unlock_irq(&rnp->lock);
 	}
-	return fqs_state;
 }
 
 /*
@@ -2023,7 +2020,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	/* Declare grace period done. */
 	WRITE_ONCE(rsp->completed, rsp->gpnum);
 	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
-	rsp->fqs_state = RCU_GP_IDLE;
+	rsp->gp_state = RCU_GP_IDLE;
 	rdp = this_cpu_ptr(rsp->rda);
 	/* Advance CBs to reduce false positives below. */
 	needgp = rcu_advance_cbs(rsp, rnp, rdp) || needgp;
@@ -2041,7 +2038,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
-	int fqs_state;
+	bool first_gp_fqs;
 	int gf;
 	unsigned long j;
 	int ret;
@@ -2073,7 +2070,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 		}
 
 		/* Handle quiescent-state forcing. */
-		fqs_state = RCU_SAVE_DYNTICK;
+		first_gp_fqs = true;
 		j = jiffies_till_first_fqs;
 		if (j > HZ) {
 			j = HZ;
@@ -2101,7 +2098,8 @@ static int __noreturn rcu_gp_kthread(void *arg)
 				trace_rcu_grace_period(rsp->name,
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqsstart"));
-				fqs_state = rcu_gp_fqs(rsp, fqs_state);
+				rcu_gp_fqs(rsp, first_gp_fqs);
+				first_gp_fqs = false;
 				trace_rcu_grace_period(rsp->name,
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqsend"));
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2e991f8361e4..de370b611837 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -412,13 +412,6 @@ struct rcu_data {
 	struct rcu_state *rsp;
 };
 
-/* Values for fqs_state field in struct rcu_state. */
-#define RCU_GP_IDLE		0	/* No grace period in progress. */
-#define RCU_GP_INIT		1	/* Grace period being initialized. */
-#define RCU_SAVE_DYNTICK	2	/* Need to scan dyntick state. */
-#define RCU_FORCE_QS		3	/* Need to force quiescent state. */
-#define RCU_SIGNAL_INIT		RCU_SAVE_DYNTICK
-
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOGP_WAKE_NOT	0
 #define RCU_NOGP_WAKE		1
@@ -469,9 +462,8 @@ struct rcu_state {
 
 	/* The following fields are guarded by the root rcu_node's lock. */
 
-	u8	fqs_state ____cacheline_internodealigned_in_smp;
-						/* Force QS state. */
-	u8	boost;				/* Subject to priority boost. */
+	u8	boost ____cacheline_internodealigned_in_smp;
+						/* Subject to priority boost. */
 	unsigned long gpnum;			/* Current gp number. */
 	unsigned long completed;		/* # of last completed gp. */
 	struct task_struct *gp_kthread;		/* Task for grace periods. */
@@ -539,7 +531,7 @@ struct rcu_state {
 #define RCU_GP_FLAG_FQS  0x2	/* Need grace-period quiescent-state forcing. */
 
 /* Values for rcu_state structure's gp_flags field. */
-#define RCU_GP_WAIT_INIT 0	/* Initial state. */
+#define RCU_GP_IDLE	 0	/* Initial state and no GP in progress. */
 #define RCU_GP_WAIT_GPS  1	/* Wait for grace-period start. */
 #define RCU_GP_DONE_GPS  2	/* Wait done for grace-period start. */
 #define RCU_GP_WAIT_FQS  3	/* Wait for force-quiescent-state time. */
diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index 6fc4c5ff3bb5..1d61f5ba4641 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -268,7 +268,7 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
 	gpnum = rsp->gpnum;
 	seq_printf(m, "c=%ld g=%ld s=%d jfq=%ld j=%x ",
 		   ulong2long(rsp->completed), ulong2long(gpnum),
-		   rsp->fqs_state,
+		   rsp->gp_state,
 		   (long)(rsp->jiffies_force_qs - jiffies),
 		   (int)(jiffies & 0xffff));
 	seq_printf(m, "nfqs=%lu/nfqsng=%lu(%lu) fqlh=%lu oqlen=%ld/%ld\n",
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 14/18] rcu: Store first_gp_fqs into struct rcu_state
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (12 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 13/18] rcu: Finish folding ->fqs_state into ->gp_state Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 15/18] rcu: Clean up timeouts for forcing the quiescent state Petr Mladek
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

We are going to try split the rcu kthread into few kthread works.
We will not stay in the funciton all the time and "first_gp_fqs"
variable will not preserve the state. Let's store it into
struct rcu_state.

Note that this change is needed only when the split into more
kthread works is accepted.

The patch does not change the existing behavior.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/rcu/tree.c | 11 +++++------
 kernel/rcu/tree.h |  2 ++
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5413d87a67c6..5a3e70a21df8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1927,7 +1927,7 @@ static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp, int *gfp)
 /*
  * Do one round of quiescent-state forcing.
  */
-static void rcu_gp_fqs(struct rcu_state *rsp, bool first_time)
+static void rcu_gp_fqs(struct rcu_state *rsp)
 {
 	bool isidle = false;
 	unsigned long maxj;
@@ -1935,7 +1935,7 @@ static void rcu_gp_fqs(struct rcu_state *rsp, bool first_time)
 
 	WRITE_ONCE(rsp->gp_activity, jiffies);
 	rsp->n_force_qs++;
-	if (first_time) {
+	if (rsp->first_gp_fqs) {
 		/* Collect dyntick-idle snapshots. */
 		if (is_sysidle_rcu_state(rsp)) {
 			isidle = true;
@@ -1944,6 +1944,7 @@ static void rcu_gp_fqs(struct rcu_state *rsp, bool first_time)
 		force_qs_rnp(rsp, dyntick_save_progress_counter,
 			     &isidle, &maxj);
 		rcu_sysidle_report_gp(rsp, isidle, maxj);
+		rsp->first_gp_fqs = false;
 	} else {
 		/* Handle dyntick-idle and offline CPUs. */
 		isidle = true;
@@ -2038,7 +2039,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
-	bool first_gp_fqs;
 	int gf;
 	unsigned long j;
 	int ret;
@@ -2070,7 +2070,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 		}
 
 		/* Handle quiescent-state forcing. */
-		first_gp_fqs = true;
+		rsp->first_gp_fqs = true;
 		j = jiffies_till_first_fqs;
 		if (j > HZ) {
 			j = HZ;
@@ -2098,8 +2098,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 				trace_rcu_grace_period(rsp->name,
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqsstart"));
-				rcu_gp_fqs(rsp, first_gp_fqs);
-				first_gp_fqs = false;
+				rcu_gp_fqs(rsp);
 				trace_rcu_grace_period(rsp->name,
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqsend"));
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index de370b611837..f16578a5eefe 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -470,6 +470,8 @@ struct rcu_state {
 	wait_queue_head_t gp_wq;		/* Where GP task waits. */
 	short gp_flags;				/* Commands for GP task. */
 	short gp_state;				/* GP kthread sleep state. */
+	bool first_gp_fqs;			/* Do we force QS for */
+						/* the first time? */
 
 	/* End of fields guarded by root rcu_node's lock. */
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 15/18] rcu: Clean up timeouts for forcing the quiescent state
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (13 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 14/18] rcu: Store first_gp_fqs into struct rcu_state Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 16/18] rcu: Check actual RCU_GP_FLAG_FQS when handling " Petr Mladek
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

This patch does some code refactoring that will help us to split the rcu
kthread into more kthread works. It fixes rather theoretical and innocent
race. Otherwise, the changes should not be visible to the user.

First, it moves the code that limits the maximal timeout into separate
functions, see normalize_jiffies*().

The commit 88d6df612cc3c99f5 ("rcu: Prevent spurious-wakeup DoS attack
on rcu_gp_kthread()") suggests that a spurious wakeup is possible.
In this case, the thread continue waiting and
wait_event_interruptible_timeout() should be called with the remaining
timeout. It is newly computed in the new variable "timeout".

wait_event_interruptible_timeout() returns "1" when the condition is true
after the timeout elapsed. This might happen when there is a race between
fulfilling the condition and the wakeup. Therefore, it is cleaner to
update "rsp->jiffies_force_qs" when QS is forced and do not rely on
the "ret" value.

Finally, this the patch moves cond_resched_rcu_qs() to a single place.
It changes the order of the check for the pending signal. But there never
should be a pending signal. If there was we would have bigger problems
because wait_event() would never sleep again until someone flushed
the signal.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/rcu/tree.c | 77 ++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5a3e70a21df8..286e300794f0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2035,13 +2035,45 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 }
 
 /*
+ * Normalize, update, and return the first timeout.
+ */
+static unsigned long normalize_jiffies_till_first_fqs(void)
+{
+	unsigned long j = jiffies_till_first_fqs;
+
+	if (unlikely(j > HZ)) {
+		j = HZ;
+		jiffies_till_first_fqs = HZ;
+	}
+
+	return j;
+}
+
+/*
+ * Normalize, update, and return the next timeout.
+ */
+static unsigned long normalize_jiffies_till_next_fqs(void)
+{
+	unsigned long j = jiffies_till_next_fqs;
+
+	if (unlikely(j > HZ)) {
+		j = HZ;
+		jiffies_till_next_fqs = HZ;
+	} else if (unlikely(j < 1)) {
+		j = 1;
+		jiffies_till_next_fqs = 1;
+	}
+
+	return j;
+}
+
+/*
  * Body of kthread that handles grace periods.
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
 	int gf;
-	unsigned long j;
-	int ret;
+	unsigned long timeout, j;
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
@@ -2071,22 +2103,18 @@ static int __noreturn rcu_gp_kthread(void *arg)
 
 		/* Handle quiescent-state forcing. */
 		rsp->first_gp_fqs = true;
-		j = jiffies_till_first_fqs;
-		if (j > HZ) {
-			j = HZ;
-			jiffies_till_first_fqs = HZ;
-		}
-		ret = 0;
+		timeout = normalize_jiffies_till_first_fqs();
+		rsp->jiffies_force_qs = jiffies + timeout;
 		for (;;) {
-			if (!ret)
-				rsp->jiffies_force_qs = jiffies + j;
 			trace_rcu_grace_period(rsp->name,
 					       READ_ONCE(rsp->gpnum),
 					       TPS("fqswait"));
 			rsp->gp_state = RCU_GP_WAIT_FQS;
-			ret = wait_event_interruptible_timeout(rsp->gp_wq,
-					rcu_gp_fqs_check_wake(rsp, &gf), j);
+			wait_event_interruptible_timeout(rsp->gp_wq,
+					rcu_gp_fqs_check_wake(rsp, &gf),
+					timeout);
 			rsp->gp_state = RCU_GP_DOING_FQS;
+try_again:
 			/* Locking provides needed memory barriers. */
 			/* If grace period done, leave loop. */
 			if (!READ_ONCE(rnp->qsmask) &&
@@ -2099,28 +2127,29 @@ static int __noreturn rcu_gp_kthread(void *arg)
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqsstart"));
 				rcu_gp_fqs(rsp);
+				timeout = normalize_jiffies_till_next_fqs();
+				rsp->jiffies_force_qs = jiffies + timeout;
 				trace_rcu_grace_period(rsp->name,
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqsend"));
-				cond_resched_rcu_qs();
-				WRITE_ONCE(rsp->gp_activity, jiffies);
 			} else {
 				/* Deal with stray signal. */
-				cond_resched_rcu_qs();
-				WRITE_ONCE(rsp->gp_activity, jiffies);
 				WARN_ON(signal_pending(current));
 				trace_rcu_grace_period(rsp->name,
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqswaitsig"));
 			}
-			j = jiffies_till_next_fqs;
-			if (j > HZ) {
-				j = HZ;
-				jiffies_till_next_fqs = HZ;
-			} else if (j < 1) {
-				j = 1;
-				jiffies_till_next_fqs = 1;
-			}
+			cond_resched_rcu_qs();
+			WRITE_ONCE(rsp->gp_activity, jiffies);
+			/*
+			 * Count the remaining timeout when it was a spurious
+			 * wakeup. Well, it is useful also when we have slept
+			 * in the cond_resched().
+			 */
+			j = jiffies;
+			if (ULONG_CMP_GE(j, rsp->jiffies_force_qs))
+				goto try_again;
+			timeout = rsp->jiffies_force_qs - j;
 		}
 
 		/* Handle grace-period end. */
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 16/18] rcu: Check actual RCU_GP_FLAG_FQS when handling quiescent state
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (14 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 15/18] rcu: Clean up timeouts for forcing the quiescent state Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-21 13:03 ` [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API Petr Mladek
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

This change will help a lot if we want to split RCU kthreads into
few kthread works.

The variable "qf" was added by the commit 88d6df612cc3c99f56 ("rcu: Prevent
spurious-wakeup DoS attack on rcu_gp_kthread()"). IMHO, the primary fix
is the "ret" handling.

If I read the code correctly, RCU_GP_FLAG_FQS should not get lost
at this stage. rsp->gp_flags are written in these functions:

  + rcu_gp_init()
  + rcu_gp_cleanup()
    + Both are safe. They are called from other part of the kthread
      that can't get accessed when "gf" is used.

  + rcu_gp_fqs()
    + Safe. It is the function that we call when the flag is set.

  + rcu_report_qs_rsp()
  + force_quiescent_state()
    + Both should be safe. They set RCU_GP_FLAG_FQS and therefore they
      could not cause loss of rcu_gp_fqs() call.

  + rcu_start_gp_advanced()
    + This should be safe. It is called only when the last grace period
      was completed and we are opening a new one. Therefore it should
      not happen when "gf" is set and tested because the grace period
      is closed by the kthread.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/rcu/tree.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 286e300794f0..08d1d3e63b9b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1908,13 +1908,12 @@ static int rcu_gp_init(struct rcu_state *rsp)
  * Helper function for wait_event_interruptible_timeout() wakeup
  * at force-quiescent-state time.
  */
-static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp, int *gfp)
+static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp)
 {
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	/* Someone like call_rcu() requested a force-quiescent-state scan. */
-	*gfp = READ_ONCE(rsp->gp_flags);
-	if (*gfp & RCU_GP_FLAG_FQS)
+	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
 		return true;
 
 	/* The current grace period has completed. */
@@ -2072,7 +2071,6 @@ static unsigned long normalize_jiffies_till_next_fqs(void)
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
-	int gf;
 	unsigned long timeout, j;
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
@@ -2111,7 +2109,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 					       TPS("fqswait"));
 			rsp->gp_state = RCU_GP_WAIT_FQS;
 			wait_event_interruptible_timeout(rsp->gp_wq,
-					rcu_gp_fqs_check_wake(rsp, &gf),
+					rcu_gp_fqs_check_wake(rsp),
 					timeout);
 			rsp->gp_state = RCU_GP_DOING_FQS;
 try_again:
@@ -2122,7 +2120,7 @@ try_again:
 				break;
 			/* If time for quiescent-state forcing, do it. */
 			if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
-			    (gf & RCU_GP_FLAG_FQS)) {
+			    (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)) {
 				trace_rcu_grace_period(rsp->name,
 						       READ_ONCE(rsp->gpnum),
 						       TPS("fqsstart"));
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (15 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 16/18] rcu: Check actual RCU_GP_FLAG_FQS when handling " Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-28 17:14   ` Paul E. McKenney
  2015-09-21 13:03 ` [RFC v2 18/18] kthread: Better support freezable kthread workers Petr Mladek
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

Kthreads are currently implemented as an infinite loop. Each
has its own variant of checks for terminating, freezing,
awakening. In many cases it is unclear to say in which state
it is and sometimes it is done a wrong way.

The plan is to convert kthreads into kthread_worker or workqueues
API. It allows to split the functionality into separate operations.
It helps to make a better structure. Also it defines a clean state
where no locks are taken, IRQs blocked, the kthread might sleep
or even be safely migrated.

The kthread worker API is useful when we want to have a dedicated
single kthread for the work. It helps to make sure that it is
available when needed. Also it allows a better control, e.g.
define a scheduling priority.

This patch converts RCU gp threads into the kthread worker API.
They modify the scheduling, have their own logic to bind the process.
They provide functions that are critical for the system to work
and thus deserve a dedicated kthread.

This patch tries to split start of the grace period and the quiescent
state handling into separate works. The motivation is to avoid
wait_events inside the work. Instead it queues the works when
appropriate which is more typical for this API.

On one hand, it should reduce spurious wakeups where the condition
in the wait_event failed and the kthread went to sleep again.

On the other hand, there is a small race window when the other
work might get queued. We could detect and fix this situation
at the beginning of the work but it is a bit ugly.

The patch renames the functions kthread_wake() to kthread_worker_poke()
that sounds more appropriate.

Otherwise, the logic should stay the same. I did a lot of torturing
and I did not see any problem with the current patch. But of course,
it would deserve much more testing and reviewing before applying.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/rcu/tree.c        | 349 ++++++++++++++++++++++++++++++-----------------
 kernel/rcu/tree.h        |   8 +-
 kernel/rcu/tree_plugin.h |  16 +--
 3 files changed, 237 insertions(+), 136 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 08d1d3e63b9b..e115c3aee65d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -482,7 +482,7 @@ void show_rcu_gp_kthreads(void)
 
 	for_each_rcu_flavor(rsp) {
 		pr_info("%s: wait state: %d ->state: %#lx\n",
-			rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+			rsp->name, rsp->gp_state, rsp->gp_worker->task->state);
 		/* sched_show_task(rsp->gp_kthread); */
 	}
 }
@@ -1179,7 +1179,7 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
 		       rsp->name, j - gpa,
 		       rsp->gpnum, rsp->completed,
 		       rsp->gp_flags, rsp->gp_state,
-		       rsp->gp_kthread ? rsp->gp_kthread->state : 0);
+		       rsp->gp_worker ? rsp->gp_worker->task->state : 0);
 }
 
 /*
@@ -1577,19 +1577,66 @@ static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
 }
 
 /*
- * Awaken the grace-period kthread for the specified flavor of RCU.
- * Don't do a self-awaken, and don't bother awakening when there is
- * nothing for the grace-period kthread to do (as in several CPUs
- * raced to awaken, and we lost), and finally don't try to awaken
- * a kthread that has not yet been created.
+ * Check if it makes sense to queue the kthread work that would
+ * start a new grace period.
  */
-static void rcu_gp_kthread_wake(struct rcu_state *rsp)
+static bool rcu_gp_start_ready(struct rcu_state *rsp)
 {
-	if (current == rsp->gp_kthread ||
-	    !READ_ONCE(rsp->gp_flags) ||
-	    !rsp->gp_kthread)
+	/* Someone like call_rcu() requested a new grace period. */
+	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_INIT)
+		return true;
+
+	return false;
+
+}
+
+/*
+ * Check if it makes sense to immediately queue the kthread work
+ * that would handle quiescent state.
+ *
+ * It does not check the timeout for forcing the quiescent state
+ * because the delayed kthread work should be scheduled at this
+ * time.
+ */
+static bool rcu_gp_handle_qs_ready(struct rcu_state *rsp)
+{
+	struct rcu_node *rnp = rcu_get_root(rsp);
+
+	/* Someone like call_rcu() requested a force-quiescent-state scan. */
+	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
+		return true;
+
+	/* The current grace period has completed. */
+	if (!READ_ONCE(rnp->qsmask) && !rcu_preempt_blocked_readers_cgp(rnp))
+		return true;
+
+	return false;
+}
+
+/*
+ * Poke the kthread worker that handles grace periods for the specified
+ * flavor of RCU. Return when there is nothing for the grace-period kthread
+ * worker to do (as in several CPUs raced to awaken, and we lost). Also
+ * don't try to use the kthread worker that has not been created yet.
+ * Finally, ignore requests from the kthread servicing the worker itself.
+ */
+static void rcu_gp_kthread_worker_poke(struct rcu_state *rsp)
+{
+	if (!READ_ONCE(rsp->gp_flags) ||
+	    !rsp->gp_worker ||
+	    rsp->gp_worker->task == current)
 		return;
-	wake_up(&rsp->gp_wq);
+
+	if (!rcu_gp_in_progress(rsp)) {
+		if (rcu_gp_start_ready(rsp))
+			queue_kthread_work(rsp->gp_worker, &rsp->gp_start_work);
+		return;
+	}
+
+	if (rcu_gp_handle_qs_ready(rsp))
+		mod_delayed_kthread_work(rsp->gp_worker,
+					 &rsp->gp_handle_qs_work,
+					 0);
 }
 
 /*
@@ -1756,7 +1803,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
 static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
 {
 	unsigned long flags;
-	bool needwake;
+	bool needpoke;
 	struct rcu_node *rnp;
 
 	local_irq_save(flags);
@@ -1769,10 +1816,10 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
 		return;
 	}
 	smp_mb__after_unlock_lock();
-	needwake = __note_gp_changes(rsp, rnp, rdp);
+	needpoke = __note_gp_changes(rsp, rnp, rdp);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-	if (needwake)
-		rcu_gp_kthread_wake(rsp);
+	if (needpoke)
+		rcu_gp_kthread_worker_poke(rsp);
 }
 
 static void rcu_gp_slow(struct rcu_state *rsp, int delay)
@@ -1905,25 +1952,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
 }
 
 /*
- * Helper function for wait_event_interruptible_timeout() wakeup
- * at force-quiescent-state time.
- */
-static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp)
-{
-	struct rcu_node *rnp = rcu_get_root(rsp);
-
-	/* Someone like call_rcu() requested a force-quiescent-state scan. */
-	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
-		return true;
-
-	/* The current grace period has completed. */
-	if (!READ_ONCE(rnp->qsmask) && !rcu_preempt_blocked_readers_cgp(rnp))
-		return true;
-
-	return false;
-}
-
-/*
  * Do one round of quiescent-state forcing.
  */
 static void rcu_gp_fqs(struct rcu_state *rsp)
@@ -2067,94 +2095,157 @@ static unsigned long normalize_jiffies_till_next_fqs(void)
 }
 
 /*
- * Body of kthread that handles grace periods.
+ * Initialize kthread worker for handling grace periods.
  */
-static int __noreturn rcu_gp_kthread(void *arg)
+static void rcu_gp_init_func(struct kthread_work *work)
 {
-	unsigned long timeout, j;
-	struct rcu_state *rsp = arg;
-	struct rcu_node *rnp = rcu_get_root(rsp);
+	struct rcu_state *rsp = container_of(work, struct rcu_state,
+					     gp_init_work);
 
 	rcu_bind_gp_kthread();
-	for (;;) {
 
-		/* Handle grace-period start. */
-		for (;;) {
-			trace_rcu_grace_period(rsp->name,
-					       READ_ONCE(rsp->gpnum),
-					       TPS("reqwait"));
-			rsp->gp_state = RCU_GP_WAIT_GPS;
-			wait_event_interruptible(rsp->gp_wq,
-						 READ_ONCE(rsp->gp_flags) &
-						 RCU_GP_FLAG_INIT);
-			rsp->gp_state = RCU_GP_DONE_GPS;
-			/* Locking provides needed memory barrier. */
-			if (rcu_gp_init(rsp))
-				break;
-			cond_resched_rcu_qs();
-			WRITE_ONCE(rsp->gp_activity, jiffies);
-			WARN_ON(signal_pending(current));
-			trace_rcu_grace_period(rsp->name,
-					       READ_ONCE(rsp->gpnum),
-					       TPS("reqwaitsig"));
+	trace_rcu_grace_period(rsp->name,
+			       READ_ONCE(rsp->gpnum),
+			       TPS("reqwait"));
+	rsp->gp_state = RCU_GP_WAIT_GPS;
+}
+
+/*
+ * Function for RCU kthread work that starts a new grace period.
+ */
+static void rcu_gp_start_func(struct kthread_work *work)
+{
+	unsigned long timeout;
+	struct rcu_state *rsp = container_of(work, struct rcu_state,
+					     gp_start_work);
+
+	/*
+	 * There is a small race window in rcu_gp_kthread_worker_poke().
+	 * Check if the grace period has already started and the quiescent
+	 * state should get handled instead.
+	 */
+	if (rcu_gp_in_progress(rsp)) {
+		if (rcu_gp_handle_qs_ready(rsp)) {
+			mod_delayed_kthread_work(rsp->gp_worker,
+						 &rsp->gp_handle_qs_work,
+						 0);
 		}
+		return;
+	}
 
+	rsp->gp_state = RCU_GP_DONE_GPS;
+	if (rcu_gp_init(rsp)) {
 		/* Handle quiescent-state forcing. */
 		rsp->first_gp_fqs = true;
 		timeout = normalize_jiffies_till_first_fqs();
 		rsp->jiffies_force_qs = jiffies + timeout;
-		for (;;) {
-			trace_rcu_grace_period(rsp->name,
-					       READ_ONCE(rsp->gpnum),
-					       TPS("fqswait"));
-			rsp->gp_state = RCU_GP_WAIT_FQS;
-			wait_event_interruptible_timeout(rsp->gp_wq,
-					rcu_gp_fqs_check_wake(rsp),
-					timeout);
-			rsp->gp_state = RCU_GP_DOING_FQS;
-try_again:
-			/* Locking provides needed memory barriers. */
-			/* If grace period done, leave loop. */
-			if (!READ_ONCE(rnp->qsmask) &&
-			    !rcu_preempt_blocked_readers_cgp(rnp))
-				break;
-			/* If time for quiescent-state forcing, do it. */
-			if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
-			    (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)) {
-				trace_rcu_grace_period(rsp->name,
-						       READ_ONCE(rsp->gpnum),
-						       TPS("fqsstart"));
-				rcu_gp_fqs(rsp);
-				timeout = normalize_jiffies_till_next_fqs();
-				rsp->jiffies_force_qs = jiffies + timeout;
-				trace_rcu_grace_period(rsp->name,
-						       READ_ONCE(rsp->gpnum),
-						       TPS("fqsend"));
-			} else {
-				/* Deal with stray signal. */
-				WARN_ON(signal_pending(current));
-				trace_rcu_grace_period(rsp->name,
-						       READ_ONCE(rsp->gpnum),
-						       TPS("fqswaitsig"));
-			}
-			cond_resched_rcu_qs();
-			WRITE_ONCE(rsp->gp_activity, jiffies);
-			/*
-			 * Count the remaining timeout when it was a spurious
-			 * wakeup. Well, it is useful also when we have slept
-			 * in the cond_resched().
-			 */
-			j = jiffies;
-			if (ULONG_CMP_GE(j, rsp->jiffies_force_qs))
-				goto try_again;
-			timeout = rsp->jiffies_force_qs - j;
-		}
+		trace_rcu_grace_period(rsp->name,
+				       READ_ONCE(rsp->gpnum),
+				       TPS("fqswait"));
+		rsp->gp_state = RCU_GP_WAIT_FQS;
+		queue_delayed_kthread_work(rsp->gp_worker,
+					   &rsp->gp_handle_qs_work,
+					   timeout);
+		return;
+	}
+
+	cond_resched_rcu_qs();
+	WRITE_ONCE(rsp->gp_activity, jiffies);
+	WARN_ON(signal_pending(current));
+	trace_rcu_grace_period(rsp->name,
+			       READ_ONCE(rsp->gpnum),
+			       TPS("reqwaitsig"));
+	trace_rcu_grace_period(rsp->name,
+			       READ_ONCE(rsp->gpnum),
+			       TPS("reqwait"));
+}
+
+/*
+ * Function for RCU kthread work that handles a quiescent state
+ * and closes the related grace period.
+ */
+static void rcu_gp_handle_qs_func(struct kthread_work *work)
+{
+	unsigned long timeout, j;
+	struct rcu_state *rsp = container_of(work, struct rcu_state,
+					     gp_handle_qs_work.work);
+	struct rcu_node *rnp = rcu_get_root(rsp);
+
 
+	/*
+	 * There is a small race window in rcu_gp_kthread_worker_poke()
+	 * when the work might be queued more times. First, check if
+	 * we are already waiting for the GP start instead.
+	 */
+	if (!rcu_gp_in_progress(rsp)) {
+		if (rcu_gp_start_ready(rsp))
+			queue_kthread_work(rsp->gp_worker, &rsp->gp_start_work);
+		return;
+	}
+
+	/*
+	 * Second, we might have been queued more times to force QS.
+	 * Just continue waiting if we have already forced it.
+	 */
+	if (!rcu_gp_handle_qs_ready(rsp) &&
+	    ULONG_CMP_LT(jiffies, rsp->jiffies_force_qs))
+		goto wait_continue;
+
+	rsp->gp_state = RCU_GP_DOING_FQS;
+try_again:
+	/* Locking provides needed memory barriers. */
+	/* If grace period done, leave loop. */
+	if (!READ_ONCE(rnp->qsmask) &&
+	    !rcu_preempt_blocked_readers_cgp(rnp)) {
 		/* Handle grace-period end. */
 		rsp->gp_state = RCU_GP_CLEANUP;
 		rcu_gp_cleanup(rsp);
 		rsp->gp_state = RCU_GP_CLEANED;
+		trace_rcu_grace_period(rsp->name,
+				       READ_ONCE(rsp->gpnum),
+				       TPS("reqwait"));
+		rsp->gp_state = RCU_GP_WAIT_GPS;
+		return;
+	}
+
+	/* If time for quiescent-state forcing, do it. */
+	if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
+	    (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)) {
+		trace_rcu_grace_period(rsp->name,
+				       READ_ONCE(rsp->gpnum),
+				       TPS("fqsstart"));
+		rcu_gp_fqs(rsp);
+		timeout = normalize_jiffies_till_next_fqs();
+		rsp->jiffies_force_qs = jiffies + timeout;
+		trace_rcu_grace_period(rsp->name,
+				       READ_ONCE(rsp->gpnum),
+				       TPS("fqsend"));
+	} else {
+		/* Deal with stray signal. */
+		WARN_ON(signal_pending(current));
+		trace_rcu_grace_period(rsp->name,
+				       READ_ONCE(rsp->gpnum),
+				       TPS("fqswaitsig"));
 	}
+wait_continue:
+	cond_resched_rcu_qs();
+	WRITE_ONCE(rsp->gp_activity, jiffies);
+	/*
+	 * Count the remaining timeout when it was a spurious
+	 * wakeup. Well, it is useful also when we have slept
+	 * in the cond_resched().
+	 */
+	j = jiffies;
+	if (ULONG_CMP_GE(j, rsp->jiffies_force_qs))
+		goto try_again;
+	timeout = rsp->jiffies_force_qs - j;
+
+	trace_rcu_grace_period(rsp->name,
+			       READ_ONCE(rsp->gpnum),
+			       TPS("fqswait"));
+	rsp->gp_state = RCU_GP_WAIT_FQS;
+	queue_delayed_kthread_work(rsp->gp_worker, &rsp->gp_handle_qs_work,
+				   timeout);
 }
 
 /*
@@ -2172,7 +2263,7 @@ static bool
 rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
 		      struct rcu_data *rdp)
 {
-	if (!rsp->gp_kthread || !cpu_needs_another_gp(rsp, rdp)) {
+	if (!rsp->gp_worker || !cpu_needs_another_gp(rsp, rdp)) {
 		/*
 		 * Either we have not yet spawned the grace-period
 		 * task, this CPU does not need another grace period,
@@ -2234,7 +2325,7 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
 	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
 	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
 	raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
-	rcu_gp_kthread_wake(rsp);
+	rcu_gp_kthread_worker_poke(rsp);
 }
 
 /*
@@ -2355,7 +2446,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 {
 	unsigned long flags;
 	unsigned long mask;
-	bool needwake;
+	bool needpoke;
 	struct rcu_node *rnp;
 
 	rnp = rdp->mynode;
@@ -2387,12 +2478,12 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 		 * This GP can't end until cpu checks in, so all of our
 		 * callbacks can be processed during the next GP.
 		 */
-		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+		needpoke = rcu_accelerate_cbs(rsp, rnp, rdp);
 
 		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags);
 		/* ^^^ Released rnp->lock */
-		if (needwake)
-			rcu_gp_kthread_wake(rsp);
+		if (needpoke)
+			rcu_gp_kthread_worker_poke(rsp);
 	}
 }
 
@@ -2895,7 +2986,7 @@ static void force_quiescent_state(struct rcu_state *rsp)
 	}
 	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
 	raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
-	rcu_gp_kthread_wake(rsp);
+	rcu_gp_kthread_worker_poke(rsp);
 }
 
 /*
@@ -2907,7 +2998,7 @@ static void
 __rcu_process_callbacks(struct rcu_state *rsp)
 {
 	unsigned long flags;
-	bool needwake;
+	bool needpoke;
 	struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
 
 	WARN_ON_ONCE(rdp->beenonline == 0);
@@ -2919,10 +3010,10 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	local_irq_save(flags);
 	if (cpu_needs_another_gp(rsp, rdp)) {
 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
-		needwake = rcu_start_gp(rsp);
+		needpoke = rcu_start_gp(rsp);
 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
-		if (needwake)
-			rcu_gp_kthread_wake(rsp);
+		if (needpoke)
+			rcu_gp_kthread_worker_poke(rsp);
 	} else {
 		local_irq_restore(flags);
 	}
@@ -2980,7 +3071,7 @@ static void invoke_rcu_core(void)
 static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 			    struct rcu_head *head, unsigned long flags)
 {
-	bool needwake;
+	bool needpoke;
 
 	/*
 	 * If called from an extended quiescent state, invoke the RCU
@@ -3011,10 +3102,10 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 
 			raw_spin_lock(&rnp_root->lock);
 			smp_mb__after_unlock_lock();
-			needwake = rcu_start_gp(rsp);
+			needpoke = rcu_start_gp(rsp);
 			raw_spin_unlock(&rnp_root->lock);
-			if (needwake)
-				rcu_gp_kthread_wake(rsp);
+			if (needpoke)
+				rcu_gp_kthread_worker_poke(rsp);
 		} else {
 			/* Give the grace period a kick. */
 			rdp->blimit = LONG_MAX;
@@ -4044,7 +4135,7 @@ static int __init rcu_spawn_gp_kthread(void)
 	struct rcu_node *rnp;
 	struct rcu_state *rsp;
 	struct sched_param sp;
-	struct task_struct *t;
+	struct kthread_worker *w;
 
 	/* Force priority into range. */
 	if (IS_ENABLED(CONFIG_RCU_BOOST) && kthread_prio < 1)
@@ -4059,16 +4150,20 @@ static int __init rcu_spawn_gp_kthread(void)
 
 	rcu_scheduler_fully_active = 1;
 	for_each_rcu_flavor(rsp) {
-		t = kthread_create(rcu_gp_kthread, rsp, "%s", rsp->name);
-		BUG_ON(IS_ERR(t));
+		init_kthread_work(&rsp->gp_init_work, rcu_gp_init_func);
+		init_kthread_work(&rsp->gp_start_work, rcu_gp_start_func);
+		init_delayed_kthread_work(&rsp->gp_handle_qs_work,
+					  rcu_gp_handle_qs_func);
+		w = create_kthread_worker("%s", rsp->name);
+		BUG_ON(IS_ERR(w));
 		rnp = rcu_get_root(rsp);
 		raw_spin_lock_irqsave(&rnp->lock, flags);
-		rsp->gp_kthread = t;
+		rsp->gp_worker = w;
 		if (kthread_prio) {
 			sp.sched_priority = kthread_prio;
-			sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+			sched_setscheduler_nocheck(w->task, SCHED_FIFO, &sp);
 		}
-		wake_up_process(t);
+		queue_kthread_work(w, &rsp->gp_init_work);
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 	rcu_spawn_nocb_kthreads();
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index f16578a5eefe..b9490e975dd7 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -25,6 +25,7 @@
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
+#include <linux/kthread.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
 #include <linux/stop_machine.h>
@@ -466,7 +467,12 @@ struct rcu_state {
 						/* Subject to priority boost. */
 	unsigned long gpnum;			/* Current gp number. */
 	unsigned long completed;		/* # of last completed gp. */
-	struct task_struct *gp_kthread;		/* Task for grace periods. */
+	struct kthread_worker *gp_worker;	/* Worker for grace periods */
+	struct kthread_work gp_init_work;	/* Init work for handling gp */
+	struct kthread_work gp_start_work;	/* Work for starting gp */
+	struct delayed_kthread_work
+		gp_handle_qs_work;		/* Work for QS state handling */
+
 	wait_queue_head_t gp_wq;		/* Where GP task waits. */
 	short gp_flags;				/* Commands for GP task. */
 	short gp_state;				/* GP kthread sleep state. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b2bf3963a0ae..55ae68530b7a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1476,7 +1476,7 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt)
  */
 static void rcu_prepare_for_idle(void)
 {
-	bool needwake;
+	bool needpoke;
 	struct rcu_data *rdp;
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
 	struct rcu_node *rnp;
@@ -1528,10 +1528,10 @@ static void rcu_prepare_for_idle(void)
 		rnp = rdp->mynode;
 		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
 		smp_mb__after_unlock_lock();
-		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+		needpoke = rcu_accelerate_cbs(rsp, rnp, rdp);
 		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
-		if (needwake)
-			rcu_gp_kthread_wake(rsp);
+		if (needpoke)
+			rcu_gp_kthread_worker_poke(rsp);
 	}
 }
 
@@ -2020,15 +2020,15 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 	unsigned long c;
 	bool d;
 	unsigned long flags;
-	bool needwake;
+	bool needpoke;
 	struct rcu_node *rnp = rdp->mynode;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
 	smp_mb__after_unlock_lock();
-	needwake = rcu_start_future_gp(rnp, rdp, &c);
+	needpoke = rcu_start_future_gp(rnp, rdp, &c);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-	if (needwake)
-		rcu_gp_kthread_wake(rdp->rsp);
+	if (needpoke)
+		rcu_gp_kthread_worker_poke(rdp->rsp);
 
 	/*
 	 * Wait for the grace period.  Do so interruptibly to avoid messing
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 18/18] kthread: Better support freezable kthread workers
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (16 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API Petr Mladek
@ 2015-09-21 13:03 ` Petr Mladek
  2015-09-22 20:32 ` [RFC v2 00/18] kthread: Use kthread worker API more widely Tejun Heo
  2015-09-30  5:08 ` Paul E. McKenney
  19 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-21 13:03 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel, Petr Mladek

This patch allows to make kthread worker freezable via a new @flags
parameter. It will allow to avoid an init work in some kthreads.

It currently does not affect the function of kthread_worker_fn()
but it might help to do some optimization or fixes eventually.

I currently do not know about any other use for the @flags
parameter but I believe that we will want more flags
in the future.

Finally, I hope that it will not cause confusion with @flags member
in struct kthread. Well, I guess that we will want to rework the
basic kthreads implementation once all kthreads are converted into
kthread workers or workqueues. It is possible that we will merge
the two structures.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 include/linux/kthread.h              | 14 ++++++++++----
 kernel/kthread.c                     |  9 ++++++++-
 kernel/rcu/tree.c                    |  2 +-
 kernel/trace/ring_buffer_benchmark.c |  4 ++--
 mm/huge_memory.c                     |  4 ++--
 5 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 2110a55bd769..5f27013edd29 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -65,7 +65,12 @@ struct kthread_work;
 typedef void (*kthread_work_func_t)(struct kthread_work *work);
 void delayed_kthread_work_timer_fn(unsigned long __data);
 
+enum {
+	KTW_FREEZABLE		= 1 << 2,	/* freeze during suspend */
+};
+
 struct kthread_worker {
+	unsigned int		flags;
 	spinlock_t		lock;
 	struct list_head	work_list;
 	struct task_struct	*task;
@@ -164,12 +169,13 @@ extern void __init_kthread_worker(struct kthread_worker *worker,
 
 int kthread_worker_fn(void *worker_ptr);
 
-__printf(2, 3)
+__printf(3, 4)
 struct kthread_worker *
-create_kthread_worker_on_node(int node, const char namefmt[], ...);
+create_kthread_worker_on_node(unsigned int flags, int node,
+			      const char namefmt[], ...);
 
-#define create_kthread_worker(namefmt, arg...)				\
-	create_kthread_worker_on_node(-1, namefmt, ##arg)
+#define create_kthread_worker(flags, namefmt, arg...)			\
+	create_kthread_worker_on_node(flags, -1, namefmt, ##arg)
 
 bool queue_kthread_work(struct kthread_worker *worker,
 			struct kthread_work *work);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 27bf242064d1..3d726acb3103 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -552,6 +552,7 @@ void __init_kthread_worker(struct kthread_worker *worker,
 				const char *name,
 				struct lock_class_key *key)
 {
+	worker->flags = 0;
 	spin_lock_init(&worker->lock);
 	lockdep_set_class_and_name(&worker->lock, key, name);
 	INIT_LIST_HEAD(&worker->work_list);
@@ -585,6 +586,10 @@ int kthread_worker_fn(void *worker_ptr)
 	 */
 	WARN_ON(worker->task && worker->task != current);
 	worker->task = current;
+
+	if (worker->flags & KTW_FREEZABLE)
+		set_freezable();
+
 repeat:
 	set_current_state(TASK_INTERRUPTIBLE);	/* mb paired w/ kthread_stop */
 
@@ -631,7 +636,8 @@ EXPORT_SYMBOL_GPL(kthread_worker_fn);
  * when the worker was SIGKILLed.
  */
 struct kthread_worker *
-create_kthread_worker_on_node(int node, const char namefmt[], ...)
+create_kthread_worker_on_node(unsigned int flags, int node,
+			      const char namefmt[], ...)
 {
 	struct kthread_worker *worker;
 	struct task_struct *task;
@@ -651,6 +657,7 @@ create_kthread_worker_on_node(int node, const char namefmt[], ...)
 	if (IS_ERR(task))
 		goto fail_task;
 
+	worker->flags = flags;
 	worker->task = task;
 	wake_up_process(task);
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e115c3aee65d..211a473e295b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4154,7 +4154,7 @@ static int __init rcu_spawn_gp_kthread(void)
 		init_kthread_work(&rsp->gp_start_work, rcu_gp_start_func);
 		init_delayed_kthread_work(&rsp->gp_handle_qs_work,
 					  rcu_gp_handle_qs_func);
-		w = create_kthread_worker("%s", rsp->name);
+		w = create_kthread_worker(0, "%s", rsp->name);
 		BUG_ON(IS_ERR(w));
 		rnp = rcu_get_root(rsp);
 		raw_spin_lock_irqsave(&rnp->lock, flags);
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 3f27ff6debd3..1613c40f636b 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -416,14 +416,14 @@ static int __init ring_buffer_benchmark_init(void)
 		return -ENOMEM;
 
 	if (!disable_reader) {
-		rb_consumer_worker = create_kthread_worker("rb_consumer");
+		rb_consumer_worker = create_kthread_worker(0, "rb_consumer");
 		if (IS_ERR(rb_consumer_worker)) {
 			ret = PTR_ERR(rb_consumer_worker);
 			goto out_fail;
 		}
 	}
 
-	rb_producer_worker = create_kthread_worker("rb_producer");
+		rb_producer_worker = create_kthread_worker(0, "rb_producer");
 	if (IS_ERR(rb_producer_worker)) {
 		ret = PTR_ERR(rb_producer_worker);
 		goto out_kill;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d5030fe7b687..8651c822c3cd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -165,7 +165,8 @@ static int start_stop_khugepaged(void)
 		if (khugepaged_worker)
 			goto out;
 
-		khugepaged_worker = create_kthread_worker("khugepaged");
+		khugepaged_worker =
+			create_kthread_worker(KTW_FREEZABLE, "khugepaged");
 
 		if (unlikely(IS_ERR(khugepaged_worker))) {
 			pr_err("khugepaged: failed to create kthread worker\n");
@@ -2878,7 +2879,6 @@ breakouterloop_mmap_sem:
 
 static void khugepaged_init_func(struct kthread_work *dummy)
 {
-	set_freezable();
 	set_user_nice(current, MAX_NICE);
 }
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC v2 02/18] kthread: Add create_kthread_worker*()
  2015-09-21 13:03 ` [RFC v2 02/18] kthread: Add create_kthread_worker*() Petr Mladek
@ 2015-09-22 18:20   ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2015-09-22 18:20 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello, Petr.

On Mon, Sep 21, 2015 at 03:03:43PM +0200, Petr Mladek wrote:
> It enforces using kthread_worker_fn() for the main thread. But I doubt
> that there are any plans to create any alternative. In fact, I think
> that we do not want any alternative main thread because it would be
> hard to support consistency with the rest of the kthread worker API.

The original intention was allowing users to use a wrapper function
which sets up kthread attributes and then calls kthread_worker_fn().
That can be done by a work item too but is more cumbersome.  Just
wanted to note that.  Will keep reading on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 03/18] kthread: Add drain_kthread_worker()
  2015-09-21 13:03 ` [RFC v2 03/18] kthread: Add drain_kthread_worker() Petr Mladek
@ 2015-09-22 18:26   ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2015-09-22 18:26 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon, Sep 21, 2015 at 03:03:44PM +0200, Petr Mladek wrote:
> flush_kthread_worker() returns when the currently queued works are proceed.
								     ^
								   processed?

> But some other works might have been queued in the meantime.
...
> +/**
> + * drain_kthread_worker - drain a kthread worker
> + * @worker: worker to be drained
> + *
> + * Wait until there is none work queued for the given kthread worker.
                          ^
                          no

> + * Only currently running work on @worker can queue further work items
                                             ^^^^^^^^^
				 should be queueing is prolly more accurate

> + * on it.  @worker is flushed repeatedly until it becomes empty.
> + * The number of flushing is determined by the depth of chaining
> + * and should be relatively short.  Whine if it takes too long.
> + *
> + * The caller is responsible for blocking all existing works
> + * from an infinite re-queuing!
           
The caller is responsible for preventing the existing work items from
requeueing themselves indefinitely.

> + *
> + * Also the caller is responsible for blocking all the kthread
> + * worker users from queuing any new work. It is especially
> + * important if the queue has to stay empty once this function
> + * finishes.

The last sentence reads a bit weird to me.  New work items just aren't
allowed while draining.  It isn't "especially important" for certain
cases.  It's just buggy otherwise.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 04/18] kthread: Add destroy_kthread_worker()
  2015-09-21 13:03 ` [RFC v2 04/18] kthread: Add destroy_kthread_worker() Petr Mladek
@ 2015-09-22 18:30   ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2015-09-22 18:30 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello,

On Mon, Sep 21, 2015 at 03:03:45PM +0200, Petr Mladek wrote:
...
> Note that flush() does not guarantee that the queue is empty. drain()
> is more safe. It returns when the queue is really empty. Also it warns

Maybe it'd be better to be a bit more specific.  drain() is safer
because it can handle self-requeueing work items.

> when too many work is being queued when draining.
...
> +/**
> + * destroy_kthread_worker - destroy a kthread worker
> + * @worker: worker to be destroyed
> + *
> + * Destroy @worker. It should be idle when this is called.

So, no new work item should be queued from this point on but @worker
is allowed to be not idle.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-09-21 13:03 ` [RFC v2 07/18] kthread: Allow to cancel " Petr Mladek
@ 2015-09-22 19:35   ` Tejun Heo
  2015-09-25 11:26     ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2015-09-22 19:35 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon, Sep 21, 2015 at 03:03:48PM +0200, Petr Mladek wrote:
>  /**
> + * try_to_grab_pending_kthread_work - steal kthread work item from worklist,
> + *	and disable irq
> + * @work: work item to steal
> + * @is_dwork: @work is a delayed_work
> + * @flags: place to store irq state
> + *
> + * Try to grab PENDING bit of @work.  This function can handle @work in any
> + * stable state - idle, on timer or on worklist.
> + *
> + * Return:
> + *  1		if @work was pending and we successfully stole PENDING
> + *  0		if @work was idle and we claimed PENDING
> + *  -EAGAIN	if PENDING couldn't be grabbed at the moment, safe to busy-retry
> + *  -ENOENT	if someone else is canceling @work, this state may persist
> + *		for arbitrarily long
> + *
> + * Note:
> + * On >= 0 return, the caller owns @work's PENDING bit.  To avoid getting
> + * interrupted while holding PENDING and @work off queue, irq must be
> + * disabled on return.  This, combined with delayed_work->timer being
> + * irqsafe, ensures that we return -EAGAIN for finite short period of time.
> + *
> + * On successful return, >= 0, irq is disabled and the caller is
> + * responsible for releasing it using local_irq_restore(*@flags).
> + *
> + * This function is safe to call from any context including IRQ handler.
> + */

Ugh... I think this is way too much for kthread_worker.  Workqueue is
as complex as it is partly for historical reasons and partly because
it's used so widely and heavily.  kthread_worker is always guaranteed
to have a single worker and in most cases maybe several work items.
There's no reason to bring this level of complexity onto it.
Providing simliar semantics is fine but it should be possible to do
this in a lot simpler way if the requirements on space and concurrency
is this much lower.

e.g. always embed timer_list in a work item and use per-worker
spinlock to synchronize access to both the work item and timer and use
per-work-item mutex to synchronize multiple cancelers.  Let's please
keep it simple.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 09/18] mm/huge_page: Convert khugepaged() into kthread worker API
  2015-09-21 13:03 ` [RFC v2 09/18] mm/huge_page: Convert khugepaged() into kthread worker API Petr Mladek
@ 2015-09-22 20:26   ` Tejun Heo
  2015-09-23  9:50     ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2015-09-22 20:26 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello,

On Mon, Sep 21, 2015 at 03:03:50PM +0200, Petr Mladek wrote:
> +static int khugepaged_has_work(void)
> +{
> +	return !list_empty(&khugepaged_scan.mm_head) &&
> +		khugepaged_enabled();
> +}

Hmmm... no biggie but this is a bit bothering.

> @@ -425,7 +447,10 @@ static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
>  		return -EINVAL;
>  
>  	khugepaged_scan_sleep_millisecs = msecs;
> -	wake_up_interruptible(&khugepaged_wait);
> +	if (khugepaged_has_work())
> +		mod_delayed_kthread_work(khugepaged_worker,
> +					 &khugepaged_do_scan_work,
> +					 0);

What's wrong with just doing the following?

	if (khugepaged_enabled())
		mod_delayed_kthread_work(...);

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 00/18] kthread: Use kthread worker API more widely
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (17 preceding siblings ...)
  2015-09-21 13:03 ` [RFC v2 18/18] kthread: Better support freezable kthread workers Petr Mladek
@ 2015-09-22 20:32 ` Tejun Heo
  2015-09-30  5:08 ` Paul E. McKenney
  19 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2015-09-22 20:32 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello, Petr.

On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> 9th, 12th, 17th patches: convert three kthreads into the new API,
>      namely: khugepaged, ring buffer benchmark, RCU gp kthreads[*]

I haven't gone through each conversion in detail but they generally
look good to me.

Thanks a lot for doing this!

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 09/18] mm/huge_page: Convert khugepaged() into kthread worker API
  2015-09-22 20:26   ` Tejun Heo
@ 2015-09-23  9:50     ` Petr Mladek
  0 siblings, 0 replies; 44+ messages in thread
From: Petr Mladek @ 2015-09-23  9:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Tue 2015-09-22 16:26:04, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 21, 2015 at 03:03:50PM +0200, Petr Mladek wrote:
> > +static int khugepaged_has_work(void)
> > +{
> > +	return !list_empty(&khugepaged_scan.mm_head) &&
> > +		khugepaged_enabled();
> > +}
> 
> Hmmm... no biggie but this is a bit bothering.

This function has been there even before and is used on more locations.
I have just moved the definition.

> > @@ -425,7 +447,10 @@ static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
> >  		return -EINVAL;
> >  
> >  	khugepaged_scan_sleep_millisecs = msecs;
> > -	wake_up_interruptible(&khugepaged_wait);
> > +	if (khugepaged_has_work())
> > +		mod_delayed_kthread_work(khugepaged_worker,
> > +					 &khugepaged_do_scan_work,
> > +					 0);
> 
> What's wrong with just doing the following?
> 
> 	if (khugepaged_enabled())
> 		mod_delayed_kthread_work(...);

It was just an optimization. It does not make sense to queue the work
if there is nothing to do.

Note that the timeout between the scans is there to throttle the work.
If all pages are scanned, the work stops re-queuing until
__khugepaged_enter() adds new job.

Thanks a lot for review. I am going to update the patchset according
to the other comments.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-09-22 19:35   ` Tejun Heo
@ 2015-09-25 11:26     ` Petr Mladek
  2015-09-28 17:03       ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-09-25 11:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Tue 2015-09-22 15:35:13, Tejun Heo wrote:
> On Mon, Sep 21, 2015 at 03:03:48PM +0200, Petr Mladek wrote:
> >  /**
> > + * try_to_grab_pending_kthread_work - steal kthread work item from worklist,
> > + *	and disable irq
> > + * @work: work item to steal
> > + * @is_dwork: @work is a delayed_work
> > + * @flags: place to store irq state
> > + *
> > + * Try to grab PENDING bit of @work.  This function can handle @work in any
> > + * stable state - idle, on timer or on worklist.
> > + *
> > + * Return:
> > + *  1		if @work was pending and we successfully stole PENDING
> > + *  0		if @work was idle and we claimed PENDING
> > + *  -EAGAIN	if PENDING couldn't be grabbed at the moment, safe to busy-retry
> > + *  -ENOENT	if someone else is canceling @work, this state may persist
> > + *		for arbitrarily long
> > + *
> > + * Note:
> > + * On >= 0 return, the caller owns @work's PENDING bit.  To avoid getting
> > + * interrupted while holding PENDING and @work off queue, irq must be
> > + * disabled on return.  This, combined with delayed_work->timer being
> > + * irqsafe, ensures that we return -EAGAIN for finite short period of time.
> > + *
> > + * On successful return, >= 0, irq is disabled and the caller is
> > + * responsible for releasing it using local_irq_restore(*@flags).
> > + *
> > + * This function is safe to call from any context including IRQ handler.
> > + */
> 
> Ugh... I think this is way too much for kthread_worker.  Workqueue is
> as complex as it is partly for historical reasons and partly because
> it's used so widely and heavily.  kthread_worker is always guaranteed
> to have a single worker and in most cases maybe several work items.
> There's no reason to bring this level of complexity onto it.
> Providing simliar semantics is fine but it should be possible to do
> this in a lot simpler way if the requirements on space and concurrency
> is this much lower.
> 
> e.g. always embed timer_list in a work item and use per-worker
> spinlock to synchronize access to both the work item and timer and use
> per-work-item mutex to synchronize multiple cancelers.  Let's please
> keep it simple.

I thought about it a lot and I do not see a way how to make it easier
using the locks.

I guess that you are primary interested into the two rather
complicated things:


1) PENDING state plus -EAGAIN/busy loop cycle
---------------------------------------------

IMHO, we want to use the timer because it is an elegant solution.
Then we must release the lock when the timer is running. The lock
must be taken by the timer->function(). And there is a small window
when the timer is not longer pending but timer->function is not running:

CPU0                            CPU1

run_timer_softirq()
  __run_timers()
    detach_expired_timer()
      detach_timer()
	#clear_pending

				try_to_grab_pending_kthread_work()
				  del_timer()
				    # fails because not pending

				  test_and_set_bit(KTHREAD_WORK_PENDING_BIT)
				    # fails because already set

				  if (!list_empty(&work->node))
				    # fails because still not queued

			!!! problematic window !!!

    call_timer_fn()
     queue_kthraed_work()


2) CANCEL state plus custom waitqueue
-------------------------------------

cancel_kthread_work_sync() has to wait for the running work. It might take
quite some time. Therefore we could not block others by a spinlock.
Also others could not wait for the spin lock in a busy wait.


IMHO, the proposed and rather complex solutions are needed in both cases.

Or did I miss a possible trick, please?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-09-25 11:26     ` Petr Mladek
@ 2015-09-28 17:03       ` Tejun Heo
  2015-10-02 15:43         ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2015-09-28 17:03 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello, Petr.

On Fri, Sep 25, 2015 at 01:26:17PM +0200, Petr Mladek wrote:
> 1) PENDING state plus -EAGAIN/busy loop cycle
> ---------------------------------------------
> 
> IMHO, we want to use the timer because it is an elegant solution.
> Then we must release the lock when the timer is running. The lock
> must be taken by the timer->function(). And there is a small window
> when the timer is not longer pending but timer->function is not running:
> 
> CPU0                            CPU1
> 
> run_timer_softirq()
>   __run_timers()
>     detach_expired_timer()
>       detach_timer()
> 	#clear_pending
> 
> 				try_to_grab_pending_kthread_work()
> 				  del_timer()
> 				    # fails because not pending
> 
> 				  test_and_set_bit(KTHREAD_WORK_PENDING_BIT)
> 				    # fails because already set
> 
> 				  if (!list_empty(&work->node))
> 				    # fails because still not queued
> 
> 			!!! problematic window !!!
> 
>     call_timer_fn()
>      queue_kthraed_work()

Let's say each work item has a state variable which is protected by a
lock and the state can be one of IDLE, PENDING, CANCELING.  Let's also
assume that all cancelers synchronize with each other via mutex, so we
only have to worry about a single canceler.  Wouldn't something like
the following work while being a lot simpler?

Delayed queueing and execution.

1. Lock and check whether state is IDLE.  If not, nothing to do.

2. Set state to PENDING and schedule the timer and unlock.

3. On expiration, timer_fn grabs the lock and see whether state is
   still PENDING.  If so, schedule the work item for execution;
   otherwise, nothing to do.

4. After dequeueing from execution queue with lock held, the worker is
   marked as executing the work item and state is reset to IDLE.

Canceling

1. Lock, dequeue and set the state to CANCELING.

2. Unlock and perform del_timer_sync().

3. Flush the work item.

4. Lock and reset the state to IDLE and unlock.


> 2) CANCEL state plus custom waitqueue
> -------------------------------------
> 
> cancel_kthread_work_sync() has to wait for the running work. It might take
> quite some time. Therefore we could not block others by a spinlock.
> Also others could not wait for the spin lock in a busy wait.

Hmmm?  Cancelers can synchronize amongst them using a mutex and the
actual work item wait can use flushing.

> IMHO, the proposed and rather complex solutions are needed in both cases.
> 
> Or did I miss a possible trick, please?

I probably have missed something in the above and it is not completley
correct but I do think it can be way simpler than how workqueue does
it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API
  2015-09-21 13:03 ` [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API Petr Mladek
@ 2015-09-28 17:14   ` Paul E. McKenney
  2015-10-01 15:43     ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2015-09-28 17:14 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon, Sep 21, 2015 at 03:03:58PM +0200, Petr Mladek wrote:
> Kthreads are currently implemented as an infinite loop. Each
> has its own variant of checks for terminating, freezing,
> awakening. In many cases it is unclear to say in which state
> it is and sometimes it is done a wrong way.
> 
> The plan is to convert kthreads into kthread_worker or workqueues
> API. It allows to split the functionality into separate operations.
> It helps to make a better structure. Also it defines a clean state
> where no locks are taken, IRQs blocked, the kthread might sleep
> or even be safely migrated.
> 
> The kthread worker API is useful when we want to have a dedicated
> single kthread for the work. It helps to make sure that it is
> available when needed. Also it allows a better control, e.g.
> define a scheduling priority.
> 
> This patch converts RCU gp threads into the kthread worker API.
> They modify the scheduling, have their own logic to bind the process.
> They provide functions that are critical for the system to work
> and thus deserve a dedicated kthread.
> 
> This patch tries to split start of the grace period and the quiescent
> state handling into separate works. The motivation is to avoid
> wait_events inside the work. Instead it queues the works when
> appropriate which is more typical for this API.
> 
> On one hand, it should reduce spurious wakeups where the condition
> in the wait_event failed and the kthread went to sleep again.
> 
> On the other hand, there is a small race window when the other
> work might get queued. We could detect and fix this situation
> at the beginning of the work but it is a bit ugly.
> 
> The patch renames the functions kthread_wake() to kthread_worker_poke()
> that sounds more appropriate.
> 
> Otherwise, the logic should stay the same. I did a lot of torturing
> and I did not see any problem with the current patch. But of course,
> it would deserve much more testing and reviewing before applying.

Suppose I later need to add helper kthreads to parallelize grace-period
initialization.  How would I implement that in a freeze-friendly way?

							Thanx, Paul

> Signed-off-by: Petr Mladek <pmladek@suse.com>
> ---
>  kernel/rcu/tree.c        | 349 ++++++++++++++++++++++++++++++-----------------
>  kernel/rcu/tree.h        |   8 +-
>  kernel/rcu/tree_plugin.h |  16 +--
>  3 files changed, 237 insertions(+), 136 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 08d1d3e63b9b..e115c3aee65d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -482,7 +482,7 @@ void show_rcu_gp_kthreads(void)
> 
>  	for_each_rcu_flavor(rsp) {
>  		pr_info("%s: wait state: %d ->state: %#lx\n",
> -			rsp->name, rsp->gp_state, rsp->gp_kthread->state);
> +			rsp->name, rsp->gp_state, rsp->gp_worker->task->state);
>  		/* sched_show_task(rsp->gp_kthread); */
>  	}
>  }
> @@ -1179,7 +1179,7 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
>  		       rsp->name, j - gpa,
>  		       rsp->gpnum, rsp->completed,
>  		       rsp->gp_flags, rsp->gp_state,
> -		       rsp->gp_kthread ? rsp->gp_kthread->state : 0);
> +		       rsp->gp_worker ? rsp->gp_worker->task->state : 0);
>  }
> 
>  /*
> @@ -1577,19 +1577,66 @@ static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
>  }
> 
>  /*
> - * Awaken the grace-period kthread for the specified flavor of RCU.
> - * Don't do a self-awaken, and don't bother awakening when there is
> - * nothing for the grace-period kthread to do (as in several CPUs
> - * raced to awaken, and we lost), and finally don't try to awaken
> - * a kthread that has not yet been created.
> + * Check if it makes sense to queue the kthread work that would
> + * start a new grace period.
>   */
> -static void rcu_gp_kthread_wake(struct rcu_state *rsp)
> +static bool rcu_gp_start_ready(struct rcu_state *rsp)
>  {
> -	if (current == rsp->gp_kthread ||
> -	    !READ_ONCE(rsp->gp_flags) ||
> -	    !rsp->gp_kthread)
> +	/* Someone like call_rcu() requested a new grace period. */
> +	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_INIT)
> +		return true;
> +
> +	return false;
> +
> +}
> +
> +/*
> + * Check if it makes sense to immediately queue the kthread work
> + * that would handle quiescent state.
> + *
> + * It does not check the timeout for forcing the quiescent state
> + * because the delayed kthread work should be scheduled at this
> + * time.
> + */
> +static bool rcu_gp_handle_qs_ready(struct rcu_state *rsp)
> +{
> +	struct rcu_node *rnp = rcu_get_root(rsp);
> +
> +	/* Someone like call_rcu() requested a force-quiescent-state scan. */
> +	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
> +		return true;
> +
> +	/* The current grace period has completed. */
> +	if (!READ_ONCE(rnp->qsmask) && !rcu_preempt_blocked_readers_cgp(rnp))
> +		return true;
> +
> +	return false;
> +}
> +
> +/*
> + * Poke the kthread worker that handles grace periods for the specified
> + * flavor of RCU. Return when there is nothing for the grace-period kthread
> + * worker to do (as in several CPUs raced to awaken, and we lost). Also
> + * don't try to use the kthread worker that has not been created yet.
> + * Finally, ignore requests from the kthread servicing the worker itself.
> + */
> +static void rcu_gp_kthread_worker_poke(struct rcu_state *rsp)
> +{
> +	if (!READ_ONCE(rsp->gp_flags) ||
> +	    !rsp->gp_worker ||
> +	    rsp->gp_worker->task == current)
>  		return;
> -	wake_up(&rsp->gp_wq);
> +
> +	if (!rcu_gp_in_progress(rsp)) {
> +		if (rcu_gp_start_ready(rsp))
> +			queue_kthread_work(rsp->gp_worker, &rsp->gp_start_work);
> +		return;
> +	}
> +
> +	if (rcu_gp_handle_qs_ready(rsp))
> +		mod_delayed_kthread_work(rsp->gp_worker,
> +					 &rsp->gp_handle_qs_work,
> +					 0);
>  }
> 
>  /*
> @@ -1756,7 +1803,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
>  static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
>  {
>  	unsigned long flags;
> -	bool needwake;
> +	bool needpoke;
>  	struct rcu_node *rnp;
> 
>  	local_irq_save(flags);
> @@ -1769,10 +1816,10 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
>  		return;
>  	}
>  	smp_mb__after_unlock_lock();
> -	needwake = __note_gp_changes(rsp, rnp, rdp);
> +	needpoke = __note_gp_changes(rsp, rnp, rdp);
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -	if (needwake)
> -		rcu_gp_kthread_wake(rsp);
> +	if (needpoke)
> +		rcu_gp_kthread_worker_poke(rsp);
>  }
> 
>  static void rcu_gp_slow(struct rcu_state *rsp, int delay)
> @@ -1905,25 +1952,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  }
> 
>  /*
> - * Helper function for wait_event_interruptible_timeout() wakeup
> - * at force-quiescent-state time.
> - */
> -static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp)
> -{
> -	struct rcu_node *rnp = rcu_get_root(rsp);
> -
> -	/* Someone like call_rcu() requested a force-quiescent-state scan. */
> -	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
> -		return true;
> -
> -	/* The current grace period has completed. */
> -	if (!READ_ONCE(rnp->qsmask) && !rcu_preempt_blocked_readers_cgp(rnp))
> -		return true;
> -
> -	return false;
> -}
> -
> -/*
>   * Do one round of quiescent-state forcing.
>   */
>  static void rcu_gp_fqs(struct rcu_state *rsp)
> @@ -2067,94 +2095,157 @@ static unsigned long normalize_jiffies_till_next_fqs(void)
>  }
> 
>  /*
> - * Body of kthread that handles grace periods.
> + * Initialize kthread worker for handling grace periods.
>   */
> -static int __noreturn rcu_gp_kthread(void *arg)
> +static void rcu_gp_init_func(struct kthread_work *work)
>  {
> -	unsigned long timeout, j;
> -	struct rcu_state *rsp = arg;
> -	struct rcu_node *rnp = rcu_get_root(rsp);
> +	struct rcu_state *rsp = container_of(work, struct rcu_state,
> +					     gp_init_work);
> 
>  	rcu_bind_gp_kthread();
> -	for (;;) {
> 
> -		/* Handle grace-period start. */
> -		for (;;) {
> -			trace_rcu_grace_period(rsp->name,
> -					       READ_ONCE(rsp->gpnum),
> -					       TPS("reqwait"));
> -			rsp->gp_state = RCU_GP_WAIT_GPS;
> -			wait_event_interruptible(rsp->gp_wq,
> -						 READ_ONCE(rsp->gp_flags) &
> -						 RCU_GP_FLAG_INIT);
> -			rsp->gp_state = RCU_GP_DONE_GPS;
> -			/* Locking provides needed memory barrier. */
> -			if (rcu_gp_init(rsp))
> -				break;
> -			cond_resched_rcu_qs();
> -			WRITE_ONCE(rsp->gp_activity, jiffies);
> -			WARN_ON(signal_pending(current));
> -			trace_rcu_grace_period(rsp->name,
> -					       READ_ONCE(rsp->gpnum),
> -					       TPS("reqwaitsig"));
> +	trace_rcu_grace_period(rsp->name,
> +			       READ_ONCE(rsp->gpnum),
> +			       TPS("reqwait"));
> +	rsp->gp_state = RCU_GP_WAIT_GPS;
> +}
> +
> +/*
> + * Function for RCU kthread work that starts a new grace period.
> + */
> +static void rcu_gp_start_func(struct kthread_work *work)
> +{
> +	unsigned long timeout;
> +	struct rcu_state *rsp = container_of(work, struct rcu_state,
> +					     gp_start_work);
> +
> +	/*
> +	 * There is a small race window in rcu_gp_kthread_worker_poke().
> +	 * Check if the grace period has already started and the quiescent
> +	 * state should get handled instead.
> +	 */
> +	if (rcu_gp_in_progress(rsp)) {
> +		if (rcu_gp_handle_qs_ready(rsp)) {
> +			mod_delayed_kthread_work(rsp->gp_worker,
> +						 &rsp->gp_handle_qs_work,
> +						 0);
>  		}
> +		return;
> +	}
> 
> +	rsp->gp_state = RCU_GP_DONE_GPS;
> +	if (rcu_gp_init(rsp)) {
>  		/* Handle quiescent-state forcing. */
>  		rsp->first_gp_fqs = true;
>  		timeout = normalize_jiffies_till_first_fqs();
>  		rsp->jiffies_force_qs = jiffies + timeout;
> -		for (;;) {
> -			trace_rcu_grace_period(rsp->name,
> -					       READ_ONCE(rsp->gpnum),
> -					       TPS("fqswait"));
> -			rsp->gp_state = RCU_GP_WAIT_FQS;
> -			wait_event_interruptible_timeout(rsp->gp_wq,
> -					rcu_gp_fqs_check_wake(rsp),
> -					timeout);
> -			rsp->gp_state = RCU_GP_DOING_FQS;
> -try_again:
> -			/* Locking provides needed memory barriers. */
> -			/* If grace period done, leave loop. */
> -			if (!READ_ONCE(rnp->qsmask) &&
> -			    !rcu_preempt_blocked_readers_cgp(rnp))
> -				break;
> -			/* If time for quiescent-state forcing, do it. */
> -			if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
> -			    (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)) {
> -				trace_rcu_grace_period(rsp->name,
> -						       READ_ONCE(rsp->gpnum),
> -						       TPS("fqsstart"));
> -				rcu_gp_fqs(rsp);
> -				timeout = normalize_jiffies_till_next_fqs();
> -				rsp->jiffies_force_qs = jiffies + timeout;
> -				trace_rcu_grace_period(rsp->name,
> -						       READ_ONCE(rsp->gpnum),
> -						       TPS("fqsend"));
> -			} else {
> -				/* Deal with stray signal. */
> -				WARN_ON(signal_pending(current));
> -				trace_rcu_grace_period(rsp->name,
> -						       READ_ONCE(rsp->gpnum),
> -						       TPS("fqswaitsig"));
> -			}
> -			cond_resched_rcu_qs();
> -			WRITE_ONCE(rsp->gp_activity, jiffies);
> -			/*
> -			 * Count the remaining timeout when it was a spurious
> -			 * wakeup. Well, it is useful also when we have slept
> -			 * in the cond_resched().
> -			 */
> -			j = jiffies;
> -			if (ULONG_CMP_GE(j, rsp->jiffies_force_qs))
> -				goto try_again;
> -			timeout = rsp->jiffies_force_qs - j;
> -		}
> +		trace_rcu_grace_period(rsp->name,
> +				       READ_ONCE(rsp->gpnum),
> +				       TPS("fqswait"));
> +		rsp->gp_state = RCU_GP_WAIT_FQS;
> +		queue_delayed_kthread_work(rsp->gp_worker,
> +					   &rsp->gp_handle_qs_work,
> +					   timeout);
> +		return;
> +	}
> +
> +	cond_resched_rcu_qs();
> +	WRITE_ONCE(rsp->gp_activity, jiffies);
> +	WARN_ON(signal_pending(current));
> +	trace_rcu_grace_period(rsp->name,
> +			       READ_ONCE(rsp->gpnum),
> +			       TPS("reqwaitsig"));
> +	trace_rcu_grace_period(rsp->name,
> +			       READ_ONCE(rsp->gpnum),
> +			       TPS("reqwait"));
> +}
> +
> +/*
> + * Function for RCU kthread work that handles a quiescent state
> + * and closes the related grace period.
> + */
> +static void rcu_gp_handle_qs_func(struct kthread_work *work)
> +{
> +	unsigned long timeout, j;
> +	struct rcu_state *rsp = container_of(work, struct rcu_state,
> +					     gp_handle_qs_work.work);
> +	struct rcu_node *rnp = rcu_get_root(rsp);
> +
> 
> +	/*
> +	 * There is a small race window in rcu_gp_kthread_worker_poke()
> +	 * when the work might be queued more times. First, check if
> +	 * we are already waiting for the GP start instead.
> +	 */
> +	if (!rcu_gp_in_progress(rsp)) {
> +		if (rcu_gp_start_ready(rsp))
> +			queue_kthread_work(rsp->gp_worker, &rsp->gp_start_work);
> +		return;
> +	}
> +
> +	/*
> +	 * Second, we might have been queued more times to force QS.
> +	 * Just continue waiting if we have already forced it.
> +	 */
> +	if (!rcu_gp_handle_qs_ready(rsp) &&
> +	    ULONG_CMP_LT(jiffies, rsp->jiffies_force_qs))
> +		goto wait_continue;
> +
> +	rsp->gp_state = RCU_GP_DOING_FQS;
> +try_again:
> +	/* Locking provides needed memory barriers. */
> +	/* If grace period done, leave loop. */
> +	if (!READ_ONCE(rnp->qsmask) &&
> +	    !rcu_preempt_blocked_readers_cgp(rnp)) {
>  		/* Handle grace-period end. */
>  		rsp->gp_state = RCU_GP_CLEANUP;
>  		rcu_gp_cleanup(rsp);
>  		rsp->gp_state = RCU_GP_CLEANED;
> +		trace_rcu_grace_period(rsp->name,
> +				       READ_ONCE(rsp->gpnum),
> +				       TPS("reqwait"));
> +		rsp->gp_state = RCU_GP_WAIT_GPS;
> +		return;
> +	}
> +
> +	/* If time for quiescent-state forcing, do it. */
> +	if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
> +	    (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)) {
> +		trace_rcu_grace_period(rsp->name,
> +				       READ_ONCE(rsp->gpnum),
> +				       TPS("fqsstart"));
> +		rcu_gp_fqs(rsp);
> +		timeout = normalize_jiffies_till_next_fqs();
> +		rsp->jiffies_force_qs = jiffies + timeout;
> +		trace_rcu_grace_period(rsp->name,
> +				       READ_ONCE(rsp->gpnum),
> +				       TPS("fqsend"));
> +	} else {
> +		/* Deal with stray signal. */
> +		WARN_ON(signal_pending(current));
> +		trace_rcu_grace_period(rsp->name,
> +				       READ_ONCE(rsp->gpnum),
> +				       TPS("fqswaitsig"));
>  	}
> +wait_continue:
> +	cond_resched_rcu_qs();
> +	WRITE_ONCE(rsp->gp_activity, jiffies);
> +	/*
> +	 * Count the remaining timeout when it was a spurious
> +	 * wakeup. Well, it is useful also when we have slept
> +	 * in the cond_resched().
> +	 */
> +	j = jiffies;
> +	if (ULONG_CMP_GE(j, rsp->jiffies_force_qs))
> +		goto try_again;
> +	timeout = rsp->jiffies_force_qs - j;
> +
> +	trace_rcu_grace_period(rsp->name,
> +			       READ_ONCE(rsp->gpnum),
> +			       TPS("fqswait"));
> +	rsp->gp_state = RCU_GP_WAIT_FQS;
> +	queue_delayed_kthread_work(rsp->gp_worker, &rsp->gp_handle_qs_work,
> +				   timeout);
>  }
> 
>  /*
> @@ -2172,7 +2263,7 @@ static bool
>  rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
>  		      struct rcu_data *rdp)
>  {
> -	if (!rsp->gp_kthread || !cpu_needs_another_gp(rsp, rdp)) {
> +	if (!rsp->gp_worker || !cpu_needs_another_gp(rsp, rdp)) {
>  		/*
>  		 * Either we have not yet spawned the grace-period
>  		 * task, this CPU does not need another grace period,
> @@ -2234,7 +2325,7 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
>  	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
>  	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
>  	raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> -	rcu_gp_kthread_wake(rsp);
> +	rcu_gp_kthread_worker_poke(rsp);
>  }
> 
>  /*
> @@ -2355,7 +2446,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
>  {
>  	unsigned long flags;
>  	unsigned long mask;
> -	bool needwake;
> +	bool needpoke;
>  	struct rcu_node *rnp;
> 
>  	rnp = rdp->mynode;
> @@ -2387,12 +2478,12 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
>  		 * This GP can't end until cpu checks in, so all of our
>  		 * callbacks can be processed during the next GP.
>  		 */
> -		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> +		needpoke = rcu_accelerate_cbs(rsp, rnp, rdp);
> 
>  		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags);
>  		/* ^^^ Released rnp->lock */
> -		if (needwake)
> -			rcu_gp_kthread_wake(rsp);
> +		if (needpoke)
> +			rcu_gp_kthread_worker_poke(rsp);
>  	}
>  }
> 
> @@ -2895,7 +2986,7 @@ static void force_quiescent_state(struct rcu_state *rsp)
>  	}
>  	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
>  	raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
> -	rcu_gp_kthread_wake(rsp);
> +	rcu_gp_kthread_worker_poke(rsp);
>  }
> 
>  /*
> @@ -2907,7 +2998,7 @@ static void
>  __rcu_process_callbacks(struct rcu_state *rsp)
>  {
>  	unsigned long flags;
> -	bool needwake;
> +	bool needpoke;
>  	struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> 
>  	WARN_ON_ONCE(rdp->beenonline == 0);
> @@ -2919,10 +3010,10 @@ __rcu_process_callbacks(struct rcu_state *rsp)
>  	local_irq_save(flags);
>  	if (cpu_needs_another_gp(rsp, rdp)) {
>  		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> -		needwake = rcu_start_gp(rsp);
> +		needpoke = rcu_start_gp(rsp);
>  		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> -		if (needwake)
> -			rcu_gp_kthread_wake(rsp);
> +		if (needpoke)
> +			rcu_gp_kthread_worker_poke(rsp);
>  	} else {
>  		local_irq_restore(flags);
>  	}
> @@ -2980,7 +3071,7 @@ static void invoke_rcu_core(void)
>  static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
>  			    struct rcu_head *head, unsigned long flags)
>  {
> -	bool needwake;
> +	bool needpoke;
> 
>  	/*
>  	 * If called from an extended quiescent state, invoke the RCU
> @@ -3011,10 +3102,10 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
> 
>  			raw_spin_lock(&rnp_root->lock);
>  			smp_mb__after_unlock_lock();
> -			needwake = rcu_start_gp(rsp);
> +			needpoke = rcu_start_gp(rsp);
>  			raw_spin_unlock(&rnp_root->lock);
> -			if (needwake)
> -				rcu_gp_kthread_wake(rsp);
> +			if (needpoke)
> +				rcu_gp_kthread_worker_poke(rsp);
>  		} else {
>  			/* Give the grace period a kick. */
>  			rdp->blimit = LONG_MAX;
> @@ -4044,7 +4135,7 @@ static int __init rcu_spawn_gp_kthread(void)
>  	struct rcu_node *rnp;
>  	struct rcu_state *rsp;
>  	struct sched_param sp;
> -	struct task_struct *t;
> +	struct kthread_worker *w;
> 
>  	/* Force priority into range. */
>  	if (IS_ENABLED(CONFIG_RCU_BOOST) && kthread_prio < 1)
> @@ -4059,16 +4150,20 @@ static int __init rcu_spawn_gp_kthread(void)
> 
>  	rcu_scheduler_fully_active = 1;
>  	for_each_rcu_flavor(rsp) {
> -		t = kthread_create(rcu_gp_kthread, rsp, "%s", rsp->name);
> -		BUG_ON(IS_ERR(t));
> +		init_kthread_work(&rsp->gp_init_work, rcu_gp_init_func);
> +		init_kthread_work(&rsp->gp_start_work, rcu_gp_start_func);
> +		init_delayed_kthread_work(&rsp->gp_handle_qs_work,
> +					  rcu_gp_handle_qs_func);
> +		w = create_kthread_worker("%s", rsp->name);
> +		BUG_ON(IS_ERR(w));
>  		rnp = rcu_get_root(rsp);
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		rsp->gp_kthread = t;
> +		rsp->gp_worker = w;
>  		if (kthread_prio) {
>  			sp.sched_priority = kthread_prio;
> -			sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> +			sched_setscheduler_nocheck(w->task, SCHED_FIFO, &sp);
>  		}
> -		wake_up_process(t);
> +		queue_kthread_work(w, &rsp->gp_init_work);
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	}
>  	rcu_spawn_nocb_kthreads();
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index f16578a5eefe..b9490e975dd7 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -25,6 +25,7 @@
>  #include <linux/cache.h>
>  #include <linux/spinlock.h>
>  #include <linux/threads.h>
> +#include <linux/kthread.h>
>  #include <linux/cpumask.h>
>  #include <linux/seqlock.h>
>  #include <linux/stop_machine.h>
> @@ -466,7 +467,12 @@ struct rcu_state {
>  						/* Subject to priority boost. */
>  	unsigned long gpnum;			/* Current gp number. */
>  	unsigned long completed;		/* # of last completed gp. */
> -	struct task_struct *gp_kthread;		/* Task for grace periods. */
> +	struct kthread_worker *gp_worker;	/* Worker for grace periods */
> +	struct kthread_work gp_init_work;	/* Init work for handling gp */
> +	struct kthread_work gp_start_work;	/* Work for starting gp */
> +	struct delayed_kthread_work
> +		gp_handle_qs_work;		/* Work for QS state handling */
> +
>  	wait_queue_head_t gp_wq;		/* Where GP task waits. */
>  	short gp_flags;				/* Commands for GP task. */
>  	short gp_state;				/* GP kthread sleep state. */
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index b2bf3963a0ae..55ae68530b7a 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1476,7 +1476,7 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>   */
>  static void rcu_prepare_for_idle(void)
>  {
> -	bool needwake;
> +	bool needpoke;
>  	struct rcu_data *rdp;
>  	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>  	struct rcu_node *rnp;
> @@ -1528,10 +1528,10 @@ static void rcu_prepare_for_idle(void)
>  		rnp = rdp->mynode;
>  		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
>  		smp_mb__after_unlock_lock();
> -		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> +		needpoke = rcu_accelerate_cbs(rsp, rnp, rdp);
>  		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> -		if (needwake)
> -			rcu_gp_kthread_wake(rsp);
> +		if (needpoke)
> +			rcu_gp_kthread_worker_poke(rsp);
>  	}
>  }
> 
> @@ -2020,15 +2020,15 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
>  	unsigned long c;
>  	bool d;
>  	unsigned long flags;
> -	bool needwake;
> +	bool needpoke;
>  	struct rcu_node *rnp = rdp->mynode;
> 
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
>  	smp_mb__after_unlock_lock();
> -	needwake = rcu_start_future_gp(rnp, rdp, &c);
> +	needpoke = rcu_start_future_gp(rnp, rdp, &c);
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -	if (needwake)
> -		rcu_gp_kthread_wake(rdp->rsp);
> +	if (needpoke)
> +		rcu_gp_kthread_worker_poke(rdp->rsp);
> 
>  	/*
>  	 * Wait for the grace period.  Do so interruptibly to avoid messing
> -- 
> 1.8.5.6
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 00/18] kthread: Use kthread worker API more widely
  2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
                   ` (18 preceding siblings ...)
  2015-09-22 20:32 ` [RFC v2 00/18] kthread: Use kthread worker API more widely Tejun Heo
@ 2015-09-30  5:08 ` Paul E. McKenney
  2015-10-01 15:59   ` Petr Mladek
  19 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2015-09-30  5:08 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> My intention is to make it easier to manipulate kthreads. This RFC tries
> to use the kthread worker API. It is based on comments from the
> first attempt. See https://lkml.org/lkml/2015/7/28/648 and
> the list of changes below.
> 
> 1st..8th patches: improve the existing kthread worker API
> 
> 9th, 12th, 17th patches: convert three kthreads into the new API,
>      namely: khugepaged, ring buffer benchmark, RCU gp kthreads[*]
> 
> 10th, 11th patches: fix potential problems in the ring buffer
>       benchmark; also sent separately
> 
> 13th patch: small fix for RCU kthread; also sent separately;
>      being tested by Paul
> 
> 14th..16th patches: preparation steps for the RCU threads
>      conversion; they are needed _only_ if we split GP start
>      and QS handling into separate works[*]
> 
> 18th patch: does a possible improvement of the kthread worker API;
>      it adds an extra parameter to the create*() functions, so I
>      rather put it into this draft
>      
> 
> [*] IMPORTANT: I tried to split RCU GP start and GS state handling
>     into separate works this time. But there is a problem with
>     a race in rcu_gp_kthread_worker_poke(). It might queue
>     the wrong work. It can be detected and fixed by the work
>     itself but it is a bit ugly. Alternative solution is to
>     do both operations in one work. But then we sleep too much
>     in the work which is ugly as well. Any idea is appreciated.

I think that the kernel is trying really hard to tell you that splitting
up the RCU grace-period kthreads in this manner is not such a good idea.

So what are we really trying to accomplish here?  I am guessing something
like the following:

1.	Get each grace-period kthread to a known safe state within a
	short time of having requested a safe state.  If I recall
	correctly, the point of this is to allow no-downtime kernel
	patches to the functions executed by the grace-period kthreads.

2.	At the same time, if someone suddenly needs a grace period
	at some point in this process, the grace period kthreads are
	going to have to wake back up and handle the grace period.
	Or do you have some tricky way to guarantee that no one is
	going to need a grace period beyond the time you freeze
	the grace-period kthreads?

3.	The boost kthreads should not be a big problem because failing
	to boost simply lets the grace period run longer.

4.	The callback-offload kthreads are likely to be a big problem,
	because in systems configured with them, they need to be running
	to invoke the callbacks, and if the callbacks are not invoked,
	the grace period might just as well have failed to end.

5.	The per-CPU kthreads are in the same boat as the callback-offload
	kthreads.  One approach is to offline all the CPUs but one, and
	that will park all but the last per-CPU kthread.  But handling
	that last per-CPU kthread would likely be "good clean fun"...

6.	Other requirements?

One approach would be to simply say that the top-level rcu_gp_kthread()
function cannot be patched, and arrange for the grace-period kthreads
to park at some point within this function.  Or is there some requirement
that I am missing?

							Thanx, Paul

> Changes against v1:
> 
> + remove wrappers to manipulate the scheduling policy and priority
> 
> + remove questionable wakeup_and_destroy_kthread_worker() variant
> 
> + do not check for chained work when draining the queue
> 
> + allocate struct kthread worker in create_kthread_work() and
>   use more simple checks for running worker
> 
> + add support for delayed kthread works and use them instead
>   of waiting inside the works
> 
> + rework the "unrelated" fixes for the ring buffer benchmark
>   as discussed in the 1st RFC; also sent separately
> 
> + convert also the consumer in the ring buffer benchmark
> 
> 
> I have tested this patch set against the stable Linus tree
> for 4.3-rc2.
> 
> Petr Mladek (18):
>   kthread: Allow to call __kthread_create_on_node() with va_list args
>   kthread: Add create_kthread_worker*()
>   kthread: Add drain_kthread_worker()
>   kthread: Add destroy_kthread_worker()
>   kthread: Add pending flag to kthread work
>   kthread: Initial support for delayed kthread work
>   kthread: Allow to cancel kthread work
>   kthread: Allow to modify delayed kthread work
>   mm/huge_page: Convert khugepaged() into kthread worker API
>   ring_buffer: Do no not complete benchmark reader too early
>   ring_buffer: Fix more races when terminating the producer in the
>     benchmark
>   ring_buffer: Convert benchmark kthreads into kthread worker API
>   rcu: Finish folding ->fqs_state into ->gp_state
>   rcu: Store first_gp_fqs into struct rcu_state
>   rcu: Clean up timeouts for forcing the quiescent state
>   rcu: Check actual RCU_GP_FLAG_FQS when handling quiescent state
>   rcu: Convert RCU gp kthreads into kthread worker API
>   kthread: Better support freezable kthread workers
> 
>  include/linux/kthread.h              |  67 +++++
>  kernel/kthread.c                     | 544 ++++++++++++++++++++++++++++++++---
>  kernel/rcu/tree.c                    | 407 ++++++++++++++++----------
>  kernel/rcu/tree.h                    |  24 +-
>  kernel/rcu/tree_plugin.h             |  16 +-
>  kernel/rcu/tree_trace.c              |   2 +-
>  kernel/trace/ring_buffer_benchmark.c | 194 ++++++-------
>  mm/huge_memory.c                     | 116 ++++----
>  8 files changed, 1017 insertions(+), 353 deletions(-)
> 
> -- 
> 1.8.5.6
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API
  2015-09-28 17:14   ` Paul E. McKenney
@ 2015-10-01 15:43     ` Petr Mladek
  2015-10-01 16:33       ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-01 15:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon 2015-09-28 10:14:37, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 03:03:58PM +0200, Petr Mladek wrote:
> > Kthreads are currently implemented as an infinite loop. Each
> > has its own variant of checks for terminating, freezing,
> > awakening. In many cases it is unclear to say in which state
> > it is and sometimes it is done a wrong way.
> > 
> > The plan is to convert kthreads into kthread_worker or workqueues
> > API. It allows to split the functionality into separate operations.
> > It helps to make a better structure. Also it defines a clean state
> > where no locks are taken, IRQs blocked, the kthread might sleep
> > or even be safely migrated.
> > 
> > The kthread worker API is useful when we want to have a dedicated
> > single kthread for the work. It helps to make sure that it is
> > available when needed. Also it allows a better control, e.g.
> > define a scheduling priority.
> > 
> > This patch converts RCU gp threads into the kthread worker API.
> > They modify the scheduling, have their own logic to bind the process.
> > They provide functions that are critical for the system to work
> > and thus deserve a dedicated kthread.
> > 
> > This patch tries to split start of the grace period and the quiescent
> > state handling into separate works. The motivation is to avoid
> > wait_events inside the work. Instead it queues the works when
> > appropriate which is more typical for this API.
> > 
> > On one hand, it should reduce spurious wakeups where the condition
> > in the wait_event failed and the kthread went to sleep again.
> > 
> > On the other hand, there is a small race window when the other
> > work might get queued. We could detect and fix this situation
> > at the beginning of the work but it is a bit ugly.
> > 
> > The patch renames the functions kthread_wake() to kthread_worker_poke()
> > that sounds more appropriate.
> > 
> > Otherwise, the logic should stay the same. I did a lot of torturing
> > and I did not see any problem with the current patch. But of course,
> > it would deserve much more testing and reviewing before applying.
> 
> Suppose I later need to add helper kthreads to parallelize grace-period
> initialization.  How would I implement that in a freeze-friendly way?

I have been convinced that there only few kthreads that really need
freezing. See the discussion around my first attempt at
https://lkml.org/lkml/2015/6/13/190

In fact, RCU is a good example of kthreads that should not get
frozen because they are needed even later when the system
is suspended.

If I understand it correctly, they will do the job until most devices
and all non-boot CPUs are disabled. Then the task doing the suspend
will get scheduled. It will write the image and stop the machine.
RCU should not be needed by this very last step.

By other words. RCU should not be much concerned about freezing.

If you are concerned about adding more kthreads, it should be
possible to just add more workers if we agree on using the
kthreads worker API.


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 00/18] kthread: Use kthread worker API more widely
  2015-09-30  5:08 ` Paul E. McKenney
@ 2015-10-01 15:59   ` Petr Mladek
  2015-10-01 17:00     ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-01 15:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Tue 2015-09-29 22:08:33, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> > My intention is to make it easier to manipulate kthreads. This RFC tries
> > to use the kthread worker API. It is based on comments from the
> > first attempt. See https://lkml.org/lkml/2015/7/28/648 and
> > the list of changes below.
> > 
> > 1st..8th patches: improve the existing kthread worker API
> > 
> > 9th, 12th, 17th patches: convert three kthreads into the new API,
> >      namely: khugepaged, ring buffer benchmark, RCU gp kthreads[*]
> > 
> > 10th, 11th patches: fix potential problems in the ring buffer
> >       benchmark; also sent separately
> > 
> > 13th patch: small fix for RCU kthread; also sent separately;
> >      being tested by Paul
> > 
> > 14th..16th patches: preparation steps for the RCU threads
> >      conversion; they are needed _only_ if we split GP start
> >      and QS handling into separate works[*]
> > 
> > 18th patch: does a possible improvement of the kthread worker API;
> >      it adds an extra parameter to the create*() functions, so I
> >      rather put it into this draft
> >      
> > 
> > [*] IMPORTANT: I tried to split RCU GP start and GS state handling
> >     into separate works this time. But there is a problem with
> >     a race in rcu_gp_kthread_worker_poke(). It might queue
> >     the wrong work. It can be detected and fixed by the work
> >     itself but it is a bit ugly. Alternative solution is to
> >     do both operations in one work. But then we sleep too much
> >     in the work which is ugly as well. Any idea is appreciated.
> 
> I think that the kernel is trying really hard to tell you that splitting
> up the RCU grace-period kthreads in this manner is not such a good idea.

Yup, I guess that it would be better to stay with the approach taken
in the previous RFC. I mean to start the grace period and handle
the quiescent state in a single work. See
https://lkml.org/lkml/2015/7/28/650  It basically keeps the
functionality. The only difference is that we regularly leave
the RCU-specific function, so it will be possible to patch it.

The RCU kthreads are very special because they basically ignore
freezer and they never stop. They do not show well the advantage
of any new API. I tried to convert them primary because they were
so sensitive. I thought that it was good for testing limits
of the API.


> So what are we really trying to accomplish here?  I am guessing something
> like the following:
> 
> 1.	Get each grace-period kthread to a known safe state within a
> 	short time of having requested a safe state.  If I recall
> 	correctly, the point of this is to allow no-downtime kernel
> 	patches to the functions executed by the grace-period kthreads.
> 
> 2.	At the same time, if someone suddenly needs a grace period
> 	at some point in this process, the grace period kthreads are
> 	going to have to wake back up and handle the grace period.
> 	Or do you have some tricky way to guarantee that no one is
> 	going to need a grace period beyond the time you freeze
> 	the grace-period kthreads?
> 
> 3.	The boost kthreads should not be a big problem because failing
> 	to boost simply lets the grace period run longer.
> 
> 4.	The callback-offload kthreads are likely to be a big problem,
> 	because in systems configured with them, they need to be running
> 	to invoke the callbacks, and if the callbacks are not invoked,
> 	the grace period might just as well have failed to end.
> 
> 5.	The per-CPU kthreads are in the same boat as the callback-offload
> 	kthreads.  One approach is to offline all the CPUs but one, and
> 	that will park all but the last per-CPU kthread.  But handling
> 	that last per-CPU kthread would likely be "good clean fun"...
> 
> 6.	Other requirements?
> 
> One approach would be to simply say that the top-level rcu_gp_kthread()
> function cannot be patched, and arrange for the grace-period kthreads
> to park at some point within this function.  Or is there some requirement
> that I am missing?

I am a bit confused by the above paragraphs because they mix patching,
stopping, and parking. Note that we do not need to stop any process
when live patching.

I hope that it is more clear after my response in the other mail about
freezing. Or maybe, I am missing something.

Anyway, thanks a lot for looking at the patches and feedback.


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API
  2015-10-01 15:43     ` Petr Mladek
@ 2015-10-01 16:33       ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2015-10-01 16:33 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Thu, Oct 01, 2015 at 05:43:00PM +0200, Petr Mladek wrote:
> On Mon 2015-09-28 10:14:37, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2015 at 03:03:58PM +0200, Petr Mladek wrote:
> > > Kthreads are currently implemented as an infinite loop. Each
> > > has its own variant of checks for terminating, freezing,
> > > awakening. In many cases it is unclear to say in which state
> > > it is and sometimes it is done a wrong way.
> > > 
> > > The plan is to convert kthreads into kthread_worker or workqueues
> > > API. It allows to split the functionality into separate operations.
> > > It helps to make a better structure. Also it defines a clean state
> > > where no locks are taken, IRQs blocked, the kthread might sleep
> > > or even be safely migrated.
> > > 
> > > The kthread worker API is useful when we want to have a dedicated
> > > single kthread for the work. It helps to make sure that it is
> > > available when needed. Also it allows a better control, e.g.
> > > define a scheduling priority.
> > > 
> > > This patch converts RCU gp threads into the kthread worker API.
> > > They modify the scheduling, have their own logic to bind the process.
> > > They provide functions that are critical for the system to work
> > > and thus deserve a dedicated kthread.
> > > 
> > > This patch tries to split start of the grace period and the quiescent
> > > state handling into separate works. The motivation is to avoid
> > > wait_events inside the work. Instead it queues the works when
> > > appropriate which is more typical for this API.
> > > 
> > > On one hand, it should reduce spurious wakeups where the condition
> > > in the wait_event failed and the kthread went to sleep again.
> > > 
> > > On the other hand, there is a small race window when the other
> > > work might get queued. We could detect and fix this situation
> > > at the beginning of the work but it is a bit ugly.
> > > 
> > > The patch renames the functions kthread_wake() to kthread_worker_poke()
> > > that sounds more appropriate.
> > > 
> > > Otherwise, the logic should stay the same. I did a lot of torturing
> > > and I did not see any problem with the current patch. But of course,
> > > it would deserve much more testing and reviewing before applying.
> > 
> > Suppose I later need to add helper kthreads to parallelize grace-period
> > initialization.  How would I implement that in a freeze-friendly way?
> 
> I have been convinced that there only few kthreads that really need
> freezing. See the discussion around my first attempt at
> https://lkml.org/lkml/2015/6/13/190
> 
> In fact, RCU is a good example of kthreads that should not get
> frozen because they are needed even later when the system
> is suspended.
> 
> If I understand it correctly, they will do the job until most devices
> and all non-boot CPUs are disabled. Then the task doing the suspend
> will get scheduled. It will write the image and stop the machine.
> RCU should not be needed by this very last step.
> 
> By other words. RCU should not be much concerned about freezing.
> 
> If you are concerned about adding more kthreads, it should be
> possible to just add more workers if we agree on using the
> kthreads worker API.

OK, I will bite.  If RCU should not be much concerned about freezing,
why can't it retain its current simpler implementation using the current
kthread APIs?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 00/18] kthread: Use kthread worker API more widely
  2015-10-01 15:59   ` Petr Mladek
@ 2015-10-01 17:00     ` Paul E. McKenney
  2015-10-02 12:00       ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2015-10-01 17:00 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Thu, Oct 01, 2015 at 05:59:43PM +0200, Petr Mladek wrote:
> On Tue 2015-09-29 22:08:33, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> > > My intention is to make it easier to manipulate kthreads. This RFC tries
> > > to use the kthread worker API. It is based on comments from the
> > > first attempt. See https://lkml.org/lkml/2015/7/28/648 and
> > > the list of changes below.
> > > 
> > > 1st..8th patches: improve the existing kthread worker API
> > > 
> > > 9th, 12th, 17th patches: convert three kthreads into the new API,
> > >      namely: khugepaged, ring buffer benchmark, RCU gp kthreads[*]
> > > 
> > > 10th, 11th patches: fix potential problems in the ring buffer
> > >       benchmark; also sent separately
> > > 
> > > 13th patch: small fix for RCU kthread; also sent separately;
> > >      being tested by Paul
> > > 
> > > 14th..16th patches: preparation steps for the RCU threads
> > >      conversion; they are needed _only_ if we split GP start
> > >      and QS handling into separate works[*]
> > > 
> > > 18th patch: does a possible improvement of the kthread worker API;
> > >      it adds an extra parameter to the create*() functions, so I
> > >      rather put it into this draft
> > >      
> > > 
> > > [*] IMPORTANT: I tried to split RCU GP start and GS state handling
> > >     into separate works this time. But there is a problem with
> > >     a race in rcu_gp_kthread_worker_poke(). It might queue
> > >     the wrong work. It can be detected and fixed by the work
> > >     itself but it is a bit ugly. Alternative solution is to
> > >     do both operations in one work. But then we sleep too much
> > >     in the work which is ugly as well. Any idea is appreciated.
> > 
> > I think that the kernel is trying really hard to tell you that splitting
> > up the RCU grace-period kthreads in this manner is not such a good idea.
> 
> Yup, I guess that it would be better to stay with the approach taken
> in the previous RFC. I mean to start the grace period and handle
> the quiescent state in a single work. See
> https://lkml.org/lkml/2015/7/28/650  It basically keeps the
> functionality. The only difference is that we regularly leave
> the RCU-specific function, so it will be possible to patch it.
> 
> The RCU kthreads are very special because they basically ignore
> freezer and they never stop. They do not show well the advantage
> of any new API. I tried to convert them primary because they were
> so sensitive. I thought that it was good for testing limits
> of the API.

OK, I am all for stress-testing new APIs.  But I don't yet see a reason
to push this RCU kthread stress-test of your new API upstream.  ;-)

And yes, trying to freeze the RCU grace-period kthreads mid-way through
a grace period does not appear to me to be a strategy to win.  I suggest
trying that at the end of a grace period.  Even then, halting grace
periods at pretty much any time risks hanging the system.

> > So what are we really trying to accomplish here?  I am guessing something
> > like the following:
> > 
> > 1.	Get each grace-period kthread to a known safe state within a
> > 	short time of having requested a safe state.  If I recall
> > 	correctly, the point of this is to allow no-downtime kernel
> > 	patches to the functions executed by the grace-period kthreads.
> > 
> > 2.	At the same time, if someone suddenly needs a grace period
> > 	at some point in this process, the grace period kthreads are
> > 	going to have to wake back up and handle the grace period.
> > 	Or do you have some tricky way to guarantee that no one is
> > 	going to need a grace period beyond the time you freeze
> > 	the grace-period kthreads?
> > 
> > 3.	The boost kthreads should not be a big problem because failing
> > 	to boost simply lets the grace period run longer.
> > 
> > 4.	The callback-offload kthreads are likely to be a big problem,
> > 	because in systems configured with them, they need to be running
> > 	to invoke the callbacks, and if the callbacks are not invoked,
> > 	the grace period might just as well have failed to end.
> > 
> > 5.	The per-CPU kthreads are in the same boat as the callback-offload
> > 	kthreads.  One approach is to offline all the CPUs but one, and
> > 	that will park all but the last per-CPU kthread.  But handling
> > 	that last per-CPU kthread would likely be "good clean fun"...
> > 
> > 6.	Other requirements?
> > 
> > One approach would be to simply say that the top-level rcu_gp_kthread()
> > function cannot be patched, and arrange for the grace-period kthreads
> > to park at some point within this function.  Or is there some requirement
> > that I am missing?
> 
> I am a bit confused by the above paragraphs because they mix patching,
> stopping, and parking. Note that we do not need to stop any process
> when live patching.
> 
> I hope that it is more clear after my response in the other mail about
> freezing. Or maybe, I am missing something.
> 
> Anyway, thanks a lot for looking at the patches and feedback.

If the point of these patches was simply to test your API, and if you are
not looking to get them upstream, we are OK.  If you want them upstream,
you need to explain to me why the patches help something.  And also how
the patches avoid breaking things.  I am currently coming up empty-handed
on both counts:  I don't understand how these patches help, and I don't
see how they can possibly be safe.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 00/18] kthread: Use kthread worker API more widely
  2015-10-01 17:00     ` Paul E. McKenney
@ 2015-10-02 12:00       ` Petr Mladek
  2015-10-02 13:59         ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-02 12:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Thu 2015-10-01 10:00:53, Paul E. McKenney wrote:
> On Thu, Oct 01, 2015 at 05:59:43PM +0200, Petr Mladek wrote:
> > On Tue 2015-09-29 22:08:33, Paul E. McKenney wrote:
> > > On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> > > > My intention is to make it easier to manipulate kthreads. This RFC tries
> > > > to use the kthread worker API. It is based on comments from the
> > > > first attempt. See https://lkml.org/lkml/2015/7/28/648 and
> > > > the list of changes below.
> > > > 
> If the point of these patches was simply to test your API, and if you are
> not looking to get them upstream, we are OK.

I would like to eventually transform all kthreads into an API that
will better define the kthread workflow. It need not be this one,
though. I am still looking for a good API that will be acceptable[*]

One of the reason that I played with RCU, khugepaged, and ring buffer
kthreads is that they are maintained by core developers. I hope that
it will help to get better consensus.


> If you want them upstream, you need to explain to me why the patches
> help something.

As I said, RCU kthreads do not show a big win because they ignore
freezer, are not parked, never stop, do not handle signals. But the
change will allow to live patch them because they leave the main
function on a safe place.

The ring buffer benchmark is much better example. It reduced
the main function of the consumer kthread to two lines.
It removed some error prone code that modified task state,
called scheduler, handled kthread_should_stop. IMHO, the workflow
is better and more safe now.

I am going to prepare and send more examples where the change makes
the workflow easier.


> And also how the patches avoid breaking things.

I do my best to keep the original functionality. If we decide to use
the kthread worker API, my first attempt is much more safe, see
https://lkml.org/lkml/2015/7/28/650. It basically replaces the
top level for cycle with one self-queuing work. There are some more
instructions to go back to the cycle but they define a common
safe point that will be maintained on a single location for
all kthread workers.


[*] I have played with two APIs yet. They define a safe point
for freezing, parking, stopping, signal handling, live patching.
Also some non-trivial logic of the main cycle is maintained
on a single location.

Here are some details:

1. iterant API
--------------

  It allows to define three callbacks that are called the following
  way:

     init();
     while (!stop)
	func();
     destroy();

  See also https://lkml.org/lkml/2015/6/5/556.

  Advantages:
    + simple and clear workflow
    + simple use
    + simple conversion from the current kthreads API

  Disadvantages:
    + problematic solution of sleeping between events
    + completely new API


2. kthread worker API
---------------------

  It is similar to workqueues. The difference is that the works
  have a dedicated kthread, so we could better control the resources,
  e.g. priority, scheduling policy, ...

  Advantages:
    + already in use
    + design proven to work (workqueues)
    + nature way to wait for work in the common code (worker)
      using event driven works and delayed works
    + easy to convert to/from workqueues API

  Disadvantages:
    + more code needed to define, initialize, and queue works
    + more complicated conversion from the current API
      if we want to make it a clean way (event driven)
    + might need more synchronization in some cases[**]

  Questionable:
    + event driven vs. procedural programming style
    + allows more grained split of the functionality into
      separate units (works) that might be queued
      as needed


[**] wake_up() is nope for empty waitqueue. But queuing a work
     into non-existing worker might cause a crash. Well, this is
     usually already synchronized.


Any thoughts or preferences are highly appreciated.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 00/18] kthread: Use kthread worker API more widely
  2015-10-02 12:00       ` Petr Mladek
@ 2015-10-02 13:59         ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2015-10-02 13:59 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Tejun Heo, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Fri, Oct 02, 2015 at 02:00:14PM +0200, Petr Mladek wrote:
> On Thu 2015-10-01 10:00:53, Paul E. McKenney wrote:
> > On Thu, Oct 01, 2015 at 05:59:43PM +0200, Petr Mladek wrote:
> > > On Tue 2015-09-29 22:08:33, Paul E. McKenney wrote:
> > > > On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> > > > > My intention is to make it easier to manipulate kthreads. This RFC tries
> > > > > to use the kthread worker API. It is based on comments from the
> > > > > first attempt. See https://lkml.org/lkml/2015/7/28/648 and
> > > > > the list of changes below.
> > > > > 
> > If the point of these patches was simply to test your API, and if you are
> > not looking to get them upstream, we are OK.
> 
> I would like to eventually transform all kthreads into an API that
> will better define the kthread workflow. It need not be this one,
> though. I am still looking for a good API that will be acceptable[*]
> 
> One of the reason that I played with RCU, khugepaged, and ring buffer
> kthreads is that they are maintained by core developers. I hope that
> it will help to get better consensus.
> 
> 
> > If you want them upstream, you need to explain to me why the patches
> > help something.
> 
> As I said, RCU kthreads do not show a big win because they ignore
> freezer, are not parked, never stop, do not handle signals. But the
> change will allow to live patch them because they leave the main
> function on a safe place.
> 
> The ring buffer benchmark is much better example. It reduced
> the main function of the consumer kthread to two lines.
> It removed some error prone code that modified task state,
> called scheduler, handled kthread_should_stop. IMHO, the workflow
> is better and more safe now.
> 
> I am going to prepare and send more examples where the change makes
> the workflow easier.
> 
> 
> > And also how the patches avoid breaking things.
> 
> I do my best to keep the original functionality. If we decide to use
> the kthread worker API, my first attempt is much more safe, see
> https://lkml.org/lkml/2015/7/28/650. It basically replaces the
> top level for cycle with one self-queuing work. There are some more
> instructions to go back to the cycle but they define a common
> safe point that will be maintained on a single location for
> all kthread workers.
> 
> 
> [*] I have played with two APIs yet. They define a safe point
> for freezing, parking, stopping, signal handling, live patching.
> Also some non-trivial logic of the main cycle is maintained
> on a single location.
> 
> Here are some details:
> 
> 1. iterant API
> --------------
> 
>   It allows to define three callbacks that are called the following
>   way:
> 
>      init();
>      while (!stop)
> 	func();
>      destroy();
> 
>   See also https://lkml.org/lkml/2015/6/5/556.
> 
>   Advantages:
>     + simple and clear workflow
>     + simple use
>     + simple conversion from the current kthreads API
> 
>   Disadvantages:
>     + problematic solution of sleeping between events
>     + completely new API
> 
> 
> 2. kthread worker API
> ---------------------
> 
>   It is similar to workqueues. The difference is that the works
>   have a dedicated kthread, so we could better control the resources,
>   e.g. priority, scheduling policy, ...
> 
>   Advantages:
>     + already in use
>     + design proven to work (workqueues)
>     + nature way to wait for work in the common code (worker)
>       using event driven works and delayed works
>     + easy to convert to/from workqueues API
> 
>   Disadvantages:
>     + more code needed to define, initialize, and queue works
>     + more complicated conversion from the current API
>       if we want to make it a clean way (event driven)
>     + might need more synchronization in some cases[**]
> 
>   Questionable:
>     + event driven vs. procedural programming style
>     + allows more grained split of the functionality into
>       separate units (works) that might be queued
>       as needed
> 
> 
> [**] wake_up() is nope for empty waitqueue. But queuing a work
>      into non-existing worker might cause a crash. Well, this is
>      usually already synchronized.
> 
> 
> Any thoughts or preferences are highly appreciated.

For the RCU grace-period kthreads, I am not seeing the advantage of
either API over the current approach.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-09-28 17:03       ` Tejun Heo
@ 2015-10-02 15:43         ` Petr Mladek
  2015-10-02 19:24           ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-02 15:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon 2015-09-28 13:03:14, Tejun Heo wrote:
> Hello, Petr.
> 
> On Fri, Sep 25, 2015 at 01:26:17PM +0200, Petr Mladek wrote:
> > 1) PENDING state plus -EAGAIN/busy loop cycle
> > ---------------------------------------------
> > 
> > IMHO, we want to use the timer because it is an elegant solution.
> > Then we must release the lock when the timer is running. The lock
> > must be taken by the timer->function(). And there is a small window
> > when the timer is not longer pending but timer->function is not running:
> > 
> > CPU0                            CPU1
> > 
> > run_timer_softirq()
> >   __run_timers()
> >     detach_expired_timer()
> >       detach_timer()
> > 	#clear_pending
> > 
> > 				try_to_grab_pending_kthread_work()
> > 				  del_timer()
> > 				    # fails because not pending
> > 
> > 				  test_and_set_bit(KTHREAD_WORK_PENDING_BIT)
> > 				    # fails because already set
> > 
> > 				  if (!list_empty(&work->node))
> > 				    # fails because still not queued
> > 
> > 			!!! problematic window !!!
> > 
> >     call_timer_fn()
> >      queue_kthraed_work()
> 
> Let's say each work item has a state variable which is protected by a
> lock and the state can be one of IDLE, PENDING, CANCELING.  Let's also
> assume that all cancelers synchronize with each other via mutex, so we
> only have to worry about a single canceler.  Wouldn't something like
> the following work while being a lot simpler?
> 
> Delayed queueing and execution.
> 
> 1. Lock and check whether state is IDLE.  If not, nothing to do.
> 
> 2. Set state to PENDING and schedule the timer and unlock.
> 
> 3. On expiration, timer_fn grabs the lock and see whether state is
>    still PENDING.  If so, schedule the work item for execution;
>    otherwise, nothing to do.
> 
> 4. After dequeueing from execution queue with lock held, the worker is
>    marked as executing the work item and state is reset to IDLE.
> 
> Canceling
> 
> 1. Lock, dequeue and set the state to CANCELING.
> 
> 2. Unlock and perform del_timer_sync().
> 
> 3. Flush the work item.
> 
> 4. Lock and reset the state to IDLE and unlock.
> 
> 
> > 2) CANCEL state plus custom waitqueue
> > -------------------------------------
> > 
> > cancel_kthread_work_sync() has to wait for the running work. It might take
> > quite some time. Therefore we could not block others by a spinlock.
> > Also others could not wait for the spin lock in a busy wait.
> 
> Hmmm?  Cancelers can synchronize amongst them using a mutex and the
> actual work item wait can use flushing.
> 
> > IMHO, the proposed and rather complex solutions are needed in both cases.
> > 
> > Or did I miss a possible trick, please?
> 
> I probably have missed something in the above and it is not completley
> correct but I do think it can be way simpler than how workqueue does
> it.

I have played with this idea and it opens a can of worms with locking
problems and it looks even more complicated.

Let me show this on a snippet of code:

struct kthread_worker {
	spinlock_t		lock;
	struct list_head	work_list;
	struct kthread_work	*current_work;
};

enum {
	KTHREAD_WORK_IDLE,
	KTHREAD_WORK_PENDING,
	KTHREAD_WORK_CANCELING,
};

struct kthread_work {
	unsigned int		flags;
	spinlock_t		lock;
	struct list_head	node;
	kthread_work_func_t	func;
	struct kthread_worker	*worker;
};


/* the main kthread worker cycle */
int kthread_worker_fn(void *worker_ptr)
{
	struct kthread_worker *worker = worker_ptr;
	struct kthread_work *work;

repeat:

	work = NULL;
	spin_lock_irq(&worker->lock);
	if (!list_empty(&worker->work_list)) {
		work = list_first_entry(&worker->work_list,
					struct kthread_work, node);
		spin_lock(&work->lock);
		list_del_init(&work->node);
		work->flags = KTHREAD_WORK_IDLE;
		spin_unlock(&work->lock);
	}
	worker->current_work = work;
	spin_unlock_irq(&worker->lock);

	if (work) {
		__set_current_state(TASK_RUNNING);
		work->func(work);
	} else if (!freezing(current))
		schedule();

	goto repeat;
}
EXPORT_SYMBOL_GPL(kthread_worker_fn);


static void __queue_kthread_work(struct kthread_worker *worker,
			  struct kthread_work *work)
{
	list_add_tail(&work->node, pos);
	work->worker = worker;
}


bool queue_kthread_work(struct kthread_worker *worker,
			struct kthread_work *work)
{
	bool ret = false;
	unsigned long flags;

	/*
	 * Lock worker first to avoid ABBA deadlock.
	 * What if the work is already queued to another worker?
	 */
	spin_lock_irqsave(&worker->lock, flags);
	spin_lock(&work->lock);

	if (work->flags != KTHREAD_WORK_IDLE)
		goto unlock_out;

	__queue_kthread_work(worker, work);
	ret = true;

unlock_out:
	spin_unlock(&work->lock);
	spin_unlock_irqrestore(&worker->lock, flags);
out:
	return ret;
}

bool cancel_kthread_work_sync(struct kthread_work *work)
{
	struct kthread_worker *worker;
	bool flush = true;
	unsigned long flags;

again:
	worker = ACCESS_ONCE(work->worker);

	if (worker)
		spin_lock_irqsave(&worker->lock, flags);
	else
		local_irq_save(flags);

	spin_lock(&work->lock);

	if (worker && worker != work->worker) {
		spin_unlock(&work->lock);
		spin_unlock_irqrestore(&worker->lock, flags);
		goto again;
	}

	switch (work->flags) {
	case KTHREAD_WORK_PENDING:
		list_del_init(&work->node);
	case KTHREAD_WORK_IDLE:
		work->flags = KTHREAD_WORK_CANCELING;
		break;
	case KTHREAD_WORK_CANCELING:
		/*
		 * Some other work is canceling. Let's wait in a
		 * custom waitqueue until the work is flushed.
		 */
		prepare_to_wait_exclusive(&cancel_waitq, &cwait.wait,
						  TASK_UNINTERRUPTIBLE);
		flush = false;
		break;
	}

	spin_unlock(&work->lock);
	if (worker)
		spin_unlock_irqrestore(&worker->lock, flags);
	else
		local_irq_restore(flags);

	if (flush) {
		kthread_work_flush(work);
		/*
		 * We are the cancel leader. Nobody else could manipulate the
		 * work.
		 */
		work->flags = KTHREAD_WORK_IDLE;
		/*
		 * Paired with prepare_to_wait() above so that either
		 * waitqueue_active() is visible here or CANCELING bit is
		 * visible there.
		 */
		smp_mb();
		if (waitqueue_active(&cancel_waitq))
			__wake_up(&cancel_waitq, TASK_NORMAL, 1, work);
	} elseif (READ_ONCE(work->flags) == KTHREAD_WORK_CANCELING) {
	       schedule();
	}
}


IMHO, we need both locks. The worker manipulates more works and
need its own lock. We need work-specific lock because the work
might be assigned to different workers and we need to be sure
that the operations are really serialized, e.g. queuing.

Now, worker_fn() needs to get the first work from the the worker list
and then manipulate the work => it needs to get the worker->lock
and then work->lock.

It means that most other functions would need to take both locks
in this order to avoid ABBA deadlock. This causes quite some
complications, e.g. in cancel_work().

There might be even more problems if we try to queue the same work
into more workers and we start mixing the locks from different
workers.

I do not know. It is possible that you prefer this solution than
the atomic bit operations. Also it is possible that there exists
a trick that might make it easier.

I would personally prefer to use the tricks from the workqueues.
They are proven to work and they make the code faster. I think
that we might need to queue the work in a sensitive fast path.

IMHO, the code already is much easier than the workqueues because
there are no pools of kthreads behind each worker.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-10-02 15:43         ` Petr Mladek
@ 2015-10-02 19:24           ` Tejun Heo
  2015-10-05 10:07             ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2015-10-02 19:24 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello,

On Fri, Oct 02, 2015 at 05:43:36PM +0200, Petr Mladek wrote:
> IMHO, we need both locks. The worker manipulates more works and
> need its own lock. We need work-specific lock because the work
> might be assigned to different workers and we need to be sure
> that the operations are really serialized, e.g. queuing.

I don't think we need per-work lock.  Do we have such usage in kernel
at all?  If you're worried, let the first queueing record the worker
and trigger warning if someone tries to queue it anywhere else.  This
doesn't need to be full-on general like workqueue.  Let's make
reasonable trade-offs where possible.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-10-02 19:24           ` Tejun Heo
@ 2015-10-05 10:07             ` Petr Mladek
  2015-10-05 11:09               ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-05 10:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Fri 2015-10-02 15:24:53, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 02, 2015 at 05:43:36PM +0200, Petr Mladek wrote:
> > IMHO, we need both locks. The worker manipulates more works and
> > need its own lock. We need work-specific lock because the work
> > might be assigned to different workers and we need to be sure
> > that the operations are really serialized, e.g. queuing.
> 
> I don't think we need per-work lock.  Do we have such usage in kernel
> at all?  If you're worried, let the first queueing record the worker
> and trigger warning if someone tries to queue it anywhere else.  This
> doesn't need to be full-on general like workqueue.  Let's make
> reasonable trade-offs where possible.

I actually thought about this simplification as well. But then I am
in doubts about the API. It would make sense to assign the worker
when the work is being initialized and avoid the duplicate information
when the work is being queued:

	init_kthread_work(work, fn, worker);
	queue_work(work);

Or would you prefer to keep the API similar to workqueues even when
it makes less sense here?


In each case, we need a way to switch the worker if the old one
is destroyed and a new one is started later. We would need
something like:

	reset_work(work, worker)
or
	reinit_work(work, fn, worker)


Thanks for feedback.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-10-05 10:07             ` Petr Mladek
@ 2015-10-05 11:09               ` Petr Mladek
  2015-10-07  9:21                 ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-05 11:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon 2015-10-05 12:07:58, Petr Mladek wrote:
> On Fri 2015-10-02 15:24:53, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Oct 02, 2015 at 05:43:36PM +0200, Petr Mladek wrote:
> > > IMHO, we need both locks. The worker manipulates more works and
> > > need its own lock. We need work-specific lock because the work
> > > might be assigned to different workers and we need to be sure
> > > that the operations are really serialized, e.g. queuing.
> > 
> > I don't think we need per-work lock.  Do we have such usage in kernel
> > at all?  If you're worried, let the first queueing record the worker
> > and trigger warning if someone tries to queue it anywhere else.  This
> > doesn't need to be full-on general like workqueue.  Let's make
> > reasonable trade-offs where possible.
> 
> I actually thought about this simplification as well. But then I am
> in doubts about the API. It would make sense to assign the worker
> when the work is being initialized and avoid the duplicate information
> when the work is being queued:
> 
> 	init_kthread_work(work, fn, worker);
> 	queue_work(work);
> 
> Or would you prefer to keep the API similar to workqueues even when
> it makes less sense here?
> 
> 
> In each case, we need a way to switch the worker if the old one
> is destroyed and a new one is started later. We would need
> something like:
> 
> 	reset_work(work, worker)
> or
> 	reinit_work(work, fn, worker)

I was too fast. We could set "work->worker = NULL" when the work
finishes and it is not pending. It means that it will be connected
to the particular worker only when used. Then we could keep the
workqueues-like API and do not need reset_work().

I am going to play with this. I feel that it might work.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-10-05 11:09               ` Petr Mladek
@ 2015-10-07  9:21                 ` Petr Mladek
  2015-10-07 14:24                   ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-07  9:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Mon 2015-10-05 13:09:24, Petr Mladek wrote:
> On Mon 2015-10-05 12:07:58, Petr Mladek wrote:
> > On Fri 2015-10-02 15:24:53, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Oct 02, 2015 at 05:43:36PM +0200, Petr Mladek wrote:
> > > > IMHO, we need both locks. The worker manipulates more works and
> > > > need its own lock. We need work-specific lock because the work
> > > > might be assigned to different workers and we need to be sure
> > > > that the operations are really serialized, e.g. queuing.
> > > 
> > > I don't think we need per-work lock.  Do we have such usage in kernel
> > > at all?  If you're worried, let the first queueing record the worker
> > > and trigger warning if someone tries to queue it anywhere else.  This
> > > doesn't need to be full-on general like workqueue.  Let's make
> > > reasonable trade-offs where possible.
> > 
> > I actually thought about this simplification as well. But then I am
> > in doubts about the API. It would make sense to assign the worker
> > when the work is being initialized and avoid the duplicate information
> > when the work is being queued:
> > 
> > 	init_kthread_work(work, fn, worker);
> > 	queue_work(work);
> > 
> > Or would you prefer to keep the API similar to workqueues even when
> > it makes less sense here?
> > 
> > 
> > In each case, we need a way to switch the worker if the old one
> > is destroyed and a new one is started later. We would need
> > something like:
> > 
> > 	reset_work(work, worker)
> > or
> > 	reinit_work(work, fn, worker)
> 
> I was too fast. We could set "work->worker = NULL" when the work
> finishes and it is not pending. It means that it will be connected
> to the particular worker only when used. Then we could keep the
> workqueues-like API and do not need reset_work().

I have played with this idea and the result is not satisfactory.
I am not able to make the code easier using the single lock.

First, the worker lock is not enough to safely queue the work
without a test_and_set() atomic operation. Let me show this on
a pseudo code:

bool queue_kthread_work(worker, work)
{
	bool ret = false;

	lock(&worker->lock);

	if (test_bit(WORK_PENDING, work->flags);
		goto out;

	if (WARN(work->worker != worker,
		 "Work could not be used by two workers at the same time\n"))
		goto out;

	set_bit(WORK_PENDING, work->flags);
	work->worker = worker;
	insert_work(worker->work_list, work);
	ret = true;

out:
	unlock(&worker->lock);
	return ret;
}

Now, let's have one work: W, two workers: A, B, and try to queue
the same work to the two workers at the same time:

CPU0					CPU1

queue_kthread_work(A, W);		queue_kthread_work(B, W);
  lock(&A->lock);			lock(&B->lock);
  test_bit(WORK_PENDING, W->flags)      test_bit(WORK_PENDING, W->flags)
    # false				  # false
  WARN(W->worker != A);			WARN(W->worker != B);
    # false				  # false

  set_bit(WORK_PENDING, W->flags);	set_bit(WORK_PENDING, W->flags);
  W->worker = A;			W->worker = B;
  insert_work(A->work_list, W);		insert_work(B->work_list, W);

  unlock(&A->lock);			unlock(&B->lock);

=> It is possible and the result is unclear.

We would need to set either WORK_PENDING flag or the work->worker
using a test_and_set atomic operation and bail out if it fails.
But then we are back in the original code.


Second, we still need the busy waiting for the pending timer callback.
Yes, we could set some flag so that the call back does not queue
the work. But cancel_kthread_work_sync() still has to wait.
It could not return if there is still some pending operation
with the struct kthread_work. Otherwise, it never could
be freed a safe way.

Also note that we still need the WORK_PENDING flag. Otherwise, we
would not be able to detect the race when timer is removed but
the callback has not run yet.


Let me to repeat that using per-work and per-worker lock is not an
option either. We would need some crazy hacks to avoid ABBA deadlocks.


All in all, I would prefer to keep the original approach that is
heavily inspired by the workqueues. I think that it is actually
an advantage to reuse some working concept that reinventing wheels.


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-10-07  9:21                 ` Petr Mladek
@ 2015-10-07 14:24                   ` Tejun Heo
  2015-10-14 10:20                     ` Petr Mladek
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2015-10-07 14:24 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello, Petr.

On Wed, Oct 07, 2015 at 11:21:30AM +0200, Petr Mladek wrote:
> Now, let's have one work: W, two workers: A, B, and try to queue
> the same work to the two workers at the same time:

It's a debug WARN condition to catch silly mistakes.  It can have
minor race conditions.

...
> Second, we still need the busy waiting for the pending timer callback.

Isn't that del_timer_sync()?

> Yes, we could set some flag so that the call back does not queue
> the work. But cancel_kthread_work_sync() still has to wait.
> It could not return if there is still some pending operation
> with the struct kthread_work. Otherwise, it never could
> be freed a safe way.
> 
> Also note that we still need the WORK_PENDING flag. Otherwise, we
> would not be able to detect the race when timer is removed but
> the callback has not run yet.

Yeah, just use a state field as I wrote before.

> Let me to repeat that using per-work and per-worker lock is not an
> option either. We would need some crazy hacks to avoid ABBA deadlocks.
> 
> 
> All in all, I would prefer to keep the original approach that is
> heavily inspired by the workqueues. I think that it is actually
> an advantage to reuse some working concept that reinventing wheels.

At each turn, you come up with non-issues and declare that it needs to
be full workqueue-like implementation but the issues you're raising
seem all rather irrelevant.  Can you please try to take a step back
and put some distance from the implementation details of workqueue?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-10-07 14:24                   ` Tejun Heo
@ 2015-10-14 10:20                     ` Petr Mladek
  2015-10-14 17:30                       ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: Petr Mladek @ 2015-10-14 10:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

On Wed 2015-10-07 07:24:46, Tejun Heo wrote:
>  At each turn, you come up with non-issues and declare that it needs to
> be full workqueue-like implementation but the issues you're raising
> seem all rather irrelevant.  Can you please try to take a step back
> and put some distance from the implementation details of workqueue?

JFYI, I do a step back and am trying to convert more kthreads to
the kthread worker API. It helps me to get better insight into
the problematic.

I am still not sure where you see the difference between
workqueues and the kthread worker API. My view is that
the main differences are:

Workqueues			Kthread worker

  + pool of kthreads		  + dedicated kthread

  + kthreads created and	  + kthread created and
    destroyed on demand		    destroyed with the worker

  + can proceed more works	  + one work is proceed at a time
    in parallel from one queue

Otherwise, similar basic set of operations would be useful:

  + create_worker
  + queue_work, queue_delayed_work
  + mod_delayed_work
  + cancel_work, cancel_delayed_work
  + flush_work
  + flush_worker
  + drain_worker
  + destroy_worker

, where queue, mod, cancel operations should work also from IRQ
context.

There are few potentially complicated and sensitive users of the
kthread workers API, e.g. handling nfs callbacks, some kthreads
used for handling network packets, eventually the rcu stuff.
Here the operations need to be secure and rather fast.

IMHO, it would be great if it is easy to convert between the
kthread worker and workqueues API. It will allow to choose
the most effective variant for a given purpose. IMHO, this is
sometimes hard to say without real life testing.

I wonder if I miss some important angle of view.


In each case, it is still not clear if the API will be acceptable
for the affected parties. Therefore I do not want to spend too
much time on perfectionalizing the API implementation at this
point. Is it OK, please?

Thanks for feedback.

Best Regards,
Petr


PS: I am not convinced that all my concerns were non-issues.
For example, I agree that a race when queuing the same work
to more kthread workers might look theoretical. On the other
hand, the API allows it and it might be hard to debug. IMHO,
it might be an acceptable trade off if the implementation is
much easier and more secure in other areas. But my draft
implementation did not suggested this.

For example, there were more situations when I needed to double
check that the work was still connected with the locked worker
after taking the lock. I know that it will not happen when
the API is used a reasonable way but...

Ah, I am back in the details. I have to stop it for now ;-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 07/18] kthread: Allow to cancel kthread work
  2015-10-14 10:20                     ` Petr Mladek
@ 2015-10-14 17:30                       ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2015-10-14 17:30 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Linus Torvalds, Jiri Kosina, Borislav Petkov, Michal Hocko,
	linux-mm, Vlastimil Babka, live-patching, linux-api,
	linux-kernel

Hello,

On Wed, Oct 14, 2015 at 12:20:22PM +0200, Petr Mladek wrote:
> IMHO, it would be great if it is easy to convert between the
> kthread worker and workqueues API. It will allow to choose

Sure, keep the APIs similar so that they can be easily converted back
and forth but that doesn't mean kthread_worker should be as complex as
workqueue.  Workqueue is *really* complex partly for historical
reasons and partly because it has to serve all corner cases.  Please
make something simple which is similar enough to enable easy miration.
That amount of complexity simply isn't necessary for kthread_worker.

...
> PS: I am not convinced that all my concerns were non-issues.
> For example, I agree that a race when queuing the same work
> to more kthread workers might look theoretical. On the other
> hand, the API allows it and it might be hard to debug. IMHO,

There are big differences in terms of complexity between ensuring
something like the above working correctly under all circumstances and
implementing a warning trap which would trigger well enough to warn
against unsupported usages.  These are active trade-offs to make and
not particularly hard ones either.  Let's please keep kthread_worker
simple.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2015-10-14 17:30 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-21 13:03 [RFC v2 00/18] kthread: Use kthread worker API more widely Petr Mladek
2015-09-21 13:03 ` [RFC v2 01/18] kthread: Allow to call __kthread_create_on_node() with va_list args Petr Mladek
2015-09-21 13:03 ` [RFC v2 02/18] kthread: Add create_kthread_worker*() Petr Mladek
2015-09-22 18:20   ` Tejun Heo
2015-09-21 13:03 ` [RFC v2 03/18] kthread: Add drain_kthread_worker() Petr Mladek
2015-09-22 18:26   ` Tejun Heo
2015-09-21 13:03 ` [RFC v2 04/18] kthread: Add destroy_kthread_worker() Petr Mladek
2015-09-22 18:30   ` Tejun Heo
2015-09-21 13:03 ` [RFC v2 05/18] kthread: Add pending flag to kthread work Petr Mladek
2015-09-21 13:03 ` [RFC v2 06/18] kthread: Initial support for delayed " Petr Mladek
2015-09-21 13:03 ` [RFC v2 07/18] kthread: Allow to cancel " Petr Mladek
2015-09-22 19:35   ` Tejun Heo
2015-09-25 11:26     ` Petr Mladek
2015-09-28 17:03       ` Tejun Heo
2015-10-02 15:43         ` Petr Mladek
2015-10-02 19:24           ` Tejun Heo
2015-10-05 10:07             ` Petr Mladek
2015-10-05 11:09               ` Petr Mladek
2015-10-07  9:21                 ` Petr Mladek
2015-10-07 14:24                   ` Tejun Heo
2015-10-14 10:20                     ` Petr Mladek
2015-10-14 17:30                       ` Tejun Heo
2015-09-21 13:03 ` [RFC v2 08/18] kthread: Allow to modify delayed " Petr Mladek
2015-09-21 13:03 ` [RFC v2 09/18] mm/huge_page: Convert khugepaged() into kthread worker API Petr Mladek
2015-09-22 20:26   ` Tejun Heo
2015-09-23  9:50     ` Petr Mladek
2015-09-21 13:03 ` [RFC v2 10/18] ring_buffer: Do no not complete benchmark reader too early Petr Mladek
2015-09-21 13:03 ` [RFC v2 11/18] ring_buffer: Fix more races when terminating the producer in the benchmark Petr Mladek
2015-09-21 13:03 ` [RFC v2 12/18] ring_buffer: Convert benchmark kthreads into kthread worker API Petr Mladek
2015-09-21 13:03 ` [RFC v2 13/18] rcu: Finish folding ->fqs_state into ->gp_state Petr Mladek
2015-09-21 13:03 ` [RFC v2 14/18] rcu: Store first_gp_fqs into struct rcu_state Petr Mladek
2015-09-21 13:03 ` [RFC v2 15/18] rcu: Clean up timeouts for forcing the quiescent state Petr Mladek
2015-09-21 13:03 ` [RFC v2 16/18] rcu: Check actual RCU_GP_FLAG_FQS when handling " Petr Mladek
2015-09-21 13:03 ` [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API Petr Mladek
2015-09-28 17:14   ` Paul E. McKenney
2015-10-01 15:43     ` Petr Mladek
2015-10-01 16:33       ` Paul E. McKenney
2015-09-21 13:03 ` [RFC v2 18/18] kthread: Better support freezable kthread workers Petr Mladek
2015-09-22 20:32 ` [RFC v2 00/18] kthread: Use kthread worker API more widely Tejun Heo
2015-09-30  5:08 ` Paul E. McKenney
2015-10-01 15:59   ` Petr Mladek
2015-10-01 17:00     ` Paul E. McKenney
2015-10-02 12:00       ` Petr Mladek
2015-10-02 13:59         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).