io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption
@ 2020-08-19 12:37 Sebastian Andrzej Siewior
  2020-08-19 13:15 ` peterz
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-08-19 12:37 UTC (permalink / raw)
  To: linux-kernel, io-uring
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Jens Axboe, Thomas Gleixner

During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).

The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the sleeping lock
again while scheduling out.

While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.

I don't see a significant reason why this lock should become a
raw_spinlock_t therefore I suggest to move it after the
tsk_is_pi_blocked() check.
The io_worker::flags are usually modified under the lock except in the
scheduler path. Ideally the lock is always acquired since the
IO_WORKER_F_UP flag is set early in the startup and IO_WORKER_F_RUNNING
should be set unless the task loops within schedule(). I *think* ::flags
requires the same protection like workqueue's ::sleeping and therefore I
move the check within the locked section.

Any feedback on this vs raw_spinlock_t?

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/io-wq.c          |  8 ++++----
 kernel/sched/core.c | 10 +++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index e92c4724480ca..a7e07b3ac5b95 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -623,15 +623,15 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 	struct io_worker *worker = kthread_data(tsk);
 	struct io_wqe *wqe = worker->wqe;
 
+	spin_lock_irq(&wqe->lock);
 	if (!(worker->flags & IO_WORKER_F_UP))
-		return;
+		goto out;
 	if (!(worker->flags & IO_WORKER_F_RUNNING))
-		return;
+		goto out;
 
 	worker->flags &= ~IO_WORKER_F_RUNNING;
-
-	spin_lock_irq(&wqe->lock);
 	io_wqe_dec_running(wqe, worker);
+out:
 	spin_unlock_irq(&wqe->lock);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bbb60b97c73c..b76c0f27bd95e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4694,18 +4694,18 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * in the possible wakeup of a kworker and because wq_worker_sleeping()
 	 * requires it.
 	 */
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & PF_WQ_WORKER) {
 		preempt_disable();
-		if (tsk->flags & PF_WQ_WORKER)
-			wq_worker_sleeping(tsk);
-		else
-			io_wq_worker_sleeping(tsk);
+		wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
 	}
 
 	if (tsk_is_pi_blocked(tsk))
 		return;
 
+	if (tsk->flags & PF_IO_WORKER)
+		io_wq_worker_sleeping(tsk);
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption
  2020-08-19 12:37 [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption Sebastian Andrzej Siewior
@ 2020-08-19 13:15 ` peterz
  2020-08-19 13:18   ` Jens Axboe
  2020-08-19 13:33   ` [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption Sebastian Andrzej Siewior
  0 siblings, 2 replies; 12+ messages in thread
From: peterz @ 2020-08-19 13:15 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, io-uring, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Jens Axboe, Thomas Gleixner

On Wed, Aug 19, 2020 at 02:37:58PM +0200, Sebastian Andrzej Siewior wrote:

> I don't see a significant reason why this lock should become a
> raw_spinlock_t therefore I suggest to move it after the
> tsk_is_pi_blocked() check.

> Any feedback on this vs raw_spinlock_t?
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  fs/io-wq.c          |  8 ++++----
>  kernel/sched/core.c | 10 +++++-----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3bbb60b97c73c..b76c0f27bd95e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4694,18 +4694,18 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  	 * in the possible wakeup of a kworker and because wq_worker_sleeping()
>  	 * requires it.
>  	 */
> -	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> +	if (tsk->flags & PF_WQ_WORKER) {
>  		preempt_disable();
> -		if (tsk->flags & PF_WQ_WORKER)
> -			wq_worker_sleeping(tsk);
> -		else
> -			io_wq_worker_sleeping(tsk);
> +		wq_worker_sleeping(tsk);
>  		preempt_enable_no_resched();
>  	}
>  
>  	if (tsk_is_pi_blocked(tsk))
>  		return;
>  
> +	if (tsk->flags & PF_IO_WORKER)
> +		io_wq_worker_sleeping(tsk);
> +

Urgh, so this adds a branch in what is normally considered a fairly hot
path.

I'm thinking that the raw_spinlock_t option would permit leaving that
single:

	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER))

branch intact?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption
  2020-08-19 13:15 ` peterz
@ 2020-08-19 13:18   ` Jens Axboe
  2020-08-19 19:44     ` [PATCH] io_wq: Make io_wqe::lock a raw_spinlock_t Sebastian Andrzej Siewior
  2020-08-19 13:33   ` [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-08-19 13:18 UTC (permalink / raw)
  To: peterz, Sebastian Andrzej Siewior
  Cc: linux-kernel, io-uring, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Thomas Gleixner

On 8/19/20 6:15 AM, peterz@infradead.org wrote:
> On Wed, Aug 19, 2020 at 02:37:58PM +0200, Sebastian Andrzej Siewior wrote:
> 
>> I don't see a significant reason why this lock should become a
>> raw_spinlock_t therefore I suggest to move it after the
>> tsk_is_pi_blocked() check.
> 
>> Any feedback on this vs raw_spinlock_t?
>>
>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> ---
>>  fs/io-wq.c          |  8 ++++----
>>  kernel/sched/core.c | 10 +++++-----
>>  2 files changed, 9 insertions(+), 9 deletions(-)
>>
> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3bbb60b97c73c..b76c0f27bd95e 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4694,18 +4694,18 @@ static inline void sched_submit_work(struct task_struct *tsk)
>>  	 * in the possible wakeup of a kworker and because wq_worker_sleeping()
>>  	 * requires it.
>>  	 */
>> -	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>> +	if (tsk->flags & PF_WQ_WORKER) {
>>  		preempt_disable();
>> -		if (tsk->flags & PF_WQ_WORKER)
>> -			wq_worker_sleeping(tsk);
>> -		else
>> -			io_wq_worker_sleeping(tsk);
>> +		wq_worker_sleeping(tsk);
>>  		preempt_enable_no_resched();
>>  	}
>>  
>>  	if (tsk_is_pi_blocked(tsk))
>>  		return;
>>  
>> +	if (tsk->flags & PF_IO_WORKER)
>> +		io_wq_worker_sleeping(tsk);
>> +
> 
> Urgh, so this adds a branch in what is normally considered a fairly hot
> path.
>
> 
> I'm thinking that the raw_spinlock_t option would permit leaving that
> single:
> 
> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER))
> 
> branch intact?

Yes, the raw spinlock would do it, and leave the single branch intact
in the hot path. I'd be fine with going that route for io-wq.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption
  2020-08-19 13:15 ` peterz
  2020-08-19 13:18   ` Jens Axboe
@ 2020-08-19 13:33   ` Sebastian Andrzej Siewior
  2020-08-19 14:21     ` peterz
  1 sibling, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-08-19 13:33 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, io-uring, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Jens Axboe, Thomas Gleixner

On 2020-08-19 15:15:07 [+0200], peterz@infradead.org wrote:

> > -	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> > +	if (tsk->flags & PF_WQ_WORKER) {
> >  		preempt_disable();
> > -		if (tsk->flags & PF_WQ_WORKER)
> > -			wq_worker_sleeping(tsk);
> > -		else
> > -			io_wq_worker_sleeping(tsk);
> > +		wq_worker_sleeping(tsk);
> >  		preempt_enable_no_resched();
> >  	}
> >  
> >  	if (tsk_is_pi_blocked(tsk))
> >  		return;
> >  
> > +	if (tsk->flags & PF_IO_WORKER)
> > +		io_wq_worker_sleeping(tsk);
> > +
> 
> Urgh, so this adds a branch in what is normally considered a fairly hot
> path.
> 
> I'm thinking that the raw_spinlock_t option would permit leaving that
> single:
> 
> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER))
> 
> branch intact?

The compiler generates code to test for both flags at once. If none of
both possible flags are set then there is one branch (get out and bring
me to tst_is_pi…).
And yes, with raw_spinlock_t we could keep that one branch.

If you want to optimize further, we could move PF_IO_WORKER to an lower
bit. x86 can test for both via
(gcc-10)
|         testl   $536870944, 44(%rbp)    #, _11->flags
|         jne     .L1635  #,

(clang-9)
|         testl   $536870944, 44(%rbx)    # imm = 0x20000020
|         je      .LBB112_6


but ARM can't and does
|          ldr     r1, [r5, #16]   @ tsk_3->flags, tsk_3->flags
|         mov     r2, #32 @ tmp157,
|         movt    r2, 8192        @ tmp157,
|         tst     r2, r1  @ tmp157, tsk_3->flags
|         beq     .L998           @,

same ARM64
|         ldr     w0, [x20, 60]   //, _11->flags
|         and     w0, w0, 1073741792      // tmp117, _11->flags,
|         and     w0, w0, -536870849      // tmp117, tmp117,
|         cbnz    w0, .L453       // tmp117,

using 0x10 for PF_IO_WORKER instead will turn this into:
|         ldr     w0, [x20, 60]   //, _11->flags
|         tst     w0, 48  // _11->flags,
|         bne     .L453           //,

ARM:
|         ldr     r2, [r5, #16]   @ tsk_3->flags, tsk_3->flags
|         tst     r2, #48 @ tsk_3->flags,
|         beq     .L998           @,

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption
  2020-08-19 13:33   ` [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption Sebastian Andrzej Siewior
@ 2020-08-19 14:21     ` peterz
  2020-08-19 19:55       ` [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: peterz @ 2020-08-19 14:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, io-uring, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Jens Axboe, Thomas Gleixner

On Wed, Aug 19, 2020 at 03:33:20PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-08-19 15:15:07 [+0200], peterz@infradead.org wrote:

> If you want to optimize further, we could move PF_IO_WORKER to an lower
> bit. x86 can test for both via
> (gcc-10)
> |         testl   $536870944, 44(%rbp)    #, _11->flags
> |         jne     .L1635  #,
> 
> (clang-9)
> |         testl   $536870944, 44(%rbx)    # imm = 0x20000020
> |         je      .LBB112_6
> 
> 
> but ARM can't and does
> |          ldr     r1, [r5, #16]   @ tsk_3->flags, tsk_3->flags
> |         mov     r2, #32 @ tmp157,
> |         movt    r2, 8192        @ tmp157,
> |         tst     r2, r1  @ tmp157, tsk_3->flags
> |         beq     .L998           @,
> 
> same ARM64
> |         ldr     w0, [x20, 60]   //, _11->flags
> |         and     w0, w0, 1073741792      // tmp117, _11->flags,
> |         and     w0, w0, -536870849      // tmp117, tmp117,
> |         cbnz    w0, .L453       // tmp117,
> 
> using 0x10 for PF_IO_WORKER instead will turn this into:
> |         ldr     w0, [x20, 60]   //, _11->flags
> |         tst     w0, 48  // _11->flags,
> |         bne     .L453           //,
> 
> ARM:
> |         ldr     r2, [r5, #16]   @ tsk_3->flags, tsk_3->flags
> |         tst     r2, #48 @ tsk_3->flags,
> |         beq     .L998           @,

Good point, AFAICT there's a number of low bits still open (and we can
shuffle if we have to), so sure put a patch in to that effect while
you're at it.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] io_wq: Make io_wqe::lock a raw_spinlock_t
  2020-08-19 13:18   ` Jens Axboe
@ 2020-08-19 19:44     ` Sebastian Andrzej Siewior
  2020-09-01  8:41       ` [PATCH v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-08-19 19:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: peterz, linux-kernel, io-uring, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Thomas Gleixner

During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).

The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.

While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.

In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.

The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.

Convert io_wqe::lock to a raw_spinlock_t.c

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/io-wq.c | 52 ++++++++++++++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index e92c4724480ca..16dd3a5534a06 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -87,7 +87,7 @@ enum {
  */
 struct io_wqe {
 	struct {
-		spinlock_t lock;
+		raw_spinlock_t lock;
 		struct io_wq_work_list work_list;
 		unsigned long hash_map;
 		unsigned flags;
@@ -148,7 +148,7 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)
 
 	if (current->files != worker->restore_files) {
 		__acquire(&wqe->lock);
-		spin_unlock_irq(&wqe->lock);
+		raw_spin_unlock_irq(&wqe->lock);
 		dropped_lock = true;
 
 		task_lock(current);
@@ -166,7 +166,7 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)
 	if (worker->mm) {
 		if (!dropped_lock) {
 			__acquire(&wqe->lock);
-			spin_unlock_irq(&wqe->lock);
+			raw_spin_unlock_irq(&wqe->lock);
 			dropped_lock = true;
 		}
 		__set_current_state(TASK_RUNNING);
@@ -220,17 +220,17 @@ static void io_worker_exit(struct io_worker *worker)
 	worker->flags = 0;
 	preempt_enable();
 
-	spin_lock_irq(&wqe->lock);
+	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_del_rcu(&worker->nulls_node);
 	list_del_rcu(&worker->all_list);
 	if (__io_worker_unuse(wqe, worker)) {
 		__release(&wqe->lock);
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 	}
 	acct->nr_workers--;
 	nr_workers = wqe->acct[IO_WQ_ACCT_BOUND].nr_workers +
 			wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers;
-	spin_unlock_irq(&wqe->lock);
+	raw_spin_unlock_irq(&wqe->lock);
 
 	/* all workers gone, wq exit can proceed */
 	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
@@ -504,7 +504,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 		else if (!wq_list_empty(&wqe->work_list))
 			wqe->flags |= IO_WQE_FLAG_STALLED;
 
-		spin_unlock_irq(&wqe->lock);
+		raw_spin_unlock_irq(&wqe->lock);
 		if (!work)
 			break;
 		io_assign_current_work(worker, work);
@@ -538,17 +538,17 @@ static void io_worker_handle_work(struct io_worker *worker)
 				io_wqe_enqueue(wqe, linked);
 
 			if (hash != -1U && !next_hashed) {
-				spin_lock_irq(&wqe->lock);
+				raw_spin_lock_irq(&wqe->lock);
 				wqe->hash_map &= ~BIT_ULL(hash);
 				wqe->flags &= ~IO_WQE_FLAG_STALLED;
 				/* skip unnecessary unlock-lock wqe->lock */
 				if (!work)
 					goto get_next;
-				spin_unlock_irq(&wqe->lock);
+				raw_spin_unlock_irq(&wqe->lock);
 			}
 		} while (work);
 
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 	} while (1);
 }
 
@@ -563,7 +563,7 @@ static int io_wqe_worker(void *data)
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 		if (io_wqe_run_queue(wqe)) {
 			__set_current_state(TASK_RUNNING);
 			io_worker_handle_work(worker);
@@ -574,7 +574,7 @@ static int io_wqe_worker(void *data)
 			__release(&wqe->lock);
 			goto loop;
 		}
-		spin_unlock_irq(&wqe->lock);
+		raw_spin_unlock_irq(&wqe->lock);
 		if (signal_pending(current))
 			flush_signals(current);
 		if (schedule_timeout(WORKER_IDLE_TIMEOUT))
@@ -586,11 +586,11 @@ static int io_wqe_worker(void *data)
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 		if (!wq_list_empty(&wqe->work_list))
 			io_worker_handle_work(worker);
 		else
-			spin_unlock_irq(&wqe->lock);
+			raw_spin_unlock_irq(&wqe->lock);
 	}
 
 	io_worker_exit(worker);
@@ -630,9 +630,9 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 
 	worker->flags &= ~IO_WORKER_F_RUNNING;
 
-	spin_lock_irq(&wqe->lock);
+	raw_spin_lock_irq(&wqe->lock);
 	io_wqe_dec_running(wqe, worker);
-	spin_unlock_irq(&wqe->lock);
+	raw_spin_unlock_irq(&wqe->lock);
 }
 
 static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
@@ -656,7 +656,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 		return false;
 	}
 
-	spin_lock_irq(&wqe->lock);
+	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
 	worker->flags |= IO_WORKER_F_FREE;
@@ -665,7 +665,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	if (!acct->nr_workers && (worker->flags & IO_WORKER_F_BOUND))
 		worker->flags |= IO_WORKER_F_FIXED;
 	acct->nr_workers++;
-	spin_unlock_irq(&wqe->lock);
+	raw_spin_unlock_irq(&wqe->lock);
 
 	if (index == IO_WQ_ACCT_UNBOUND)
 		atomic_inc(&wq->user->processes);
@@ -720,12 +720,12 @@ static int io_wq_manager(void *data)
 			if (!node_online(node))
 				continue;
 
-			spin_lock_irq(&wqe->lock);
+			raw_spin_lock_irq(&wqe->lock);
 			if (io_wqe_need_worker(wqe, IO_WQ_ACCT_BOUND))
 				fork_worker[IO_WQ_ACCT_BOUND] = true;
 			if (io_wqe_need_worker(wqe, IO_WQ_ACCT_UNBOUND))
 				fork_worker[IO_WQ_ACCT_UNBOUND] = true;
-			spin_unlock_irq(&wqe->lock);
+			raw_spin_unlock_irq(&wqe->lock);
 			if (fork_worker[IO_WQ_ACCT_BOUND])
 				create_io_worker(wq, wqe, IO_WQ_ACCT_BOUND);
 			if (fork_worker[IO_WQ_ACCT_UNBOUND])
@@ -821,10 +821,10 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
 	}
 
 	work_flags = work->flags;
-	spin_lock_irqsave(&wqe->lock, flags);
+	raw_spin_lock_irqsave(&wqe->lock, flags);
 	io_wqe_insert_work(wqe, work);
 	wqe->flags &= ~IO_WQE_FLAG_STALLED;
-	spin_unlock_irqrestore(&wqe->lock, flags);
+	raw_spin_unlock_irqrestore(&wqe->lock, flags);
 
 	if ((work_flags & IO_WQ_WORK_CONCURRENT) ||
 	    !atomic_read(&acct->nr_running))
@@ -933,14 +933,14 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
 	unsigned long flags;
 
 retry:
-	spin_lock_irqsave(&wqe->lock, flags);
+	raw_spin_lock_irqsave(&wqe->lock, flags);
 	wq_list_for_each(node, prev, &wqe->work_list) {
 		work = container_of(node, struct io_wq_work, list);
 		if (!match->fn(work, match->data))
 			continue;
 
 		wq_list_del(&wqe->work_list, node, prev);
-		spin_unlock_irqrestore(&wqe->lock, flags);
+		raw_spin_unlock_irqrestore(&wqe->lock, flags);
 		io_run_cancel(work, wqe);
 		match->nr_pending++;
 		if (!match->cancel_all)
@@ -949,7 +949,7 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
 		/* not safe to continue after unlock */
 		goto retry;
 	}
-	spin_unlock_irqrestore(&wqe->lock, flags);
+	raw_spin_unlock_irqrestore(&wqe->lock, flags);
 }
 
 static void io_wqe_cancel_running_work(struct io_wqe *wqe,
@@ -1057,7 +1057,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 		}
 		atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
 		wqe->wq = wq;
-		spin_lock_init(&wqe->lock);
+		raw_spin_lock_init(&wqe->lock);
 		INIT_WQ_LIST(&wqe->work_list);
 		INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0);
 		INIT_LIST_HEAD(&wqe->all_list);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together
  2020-08-19 14:21     ` peterz
@ 2020-08-19 19:55       ` Sebastian Andrzej Siewior
  2020-08-19 20:00         ` [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work() Sebastian Andrzej Siewior
  2020-09-07 12:58         ` [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together Will Deacon
  0 siblings, 2 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-08-19 19:55 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, io-uring, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Jens Axboe, Thomas Gleixner, Catalin Marinas, Will Deacon

The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
sched_submit_work() which is considered to be a hot path.
If the two bits cross the 8 or 16 bit boundary then most architecture
require multiple load instructions in order to create the constant
value. Also, such a value can not be encoded within the compare opcode.

By moving the bit definition within the same block, the compiler can
create/use one immediate value.

For some reason gcc-10 on ARM64 requires both bits to be next to each
other in order to issue "tst reg, val; bne label". Otherwise the result
is "mov reg1, val; tst reg, reg1; bne label".

Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
PF_WQ_WORKER.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

Could someone from the ARM64 camp please verify if this a gcc "bug" or
opcode/arch limitation? With PF_IO_WORKER as 1 (without the PF_VCPU
swap) I get for ARM:

| tst     r2, #33 @ task_flags,
| beq     .L998           @,

however ARM64 does here:
| mov     w0, 33  // tmp117,
| tst     w19, w0 // task_flags, tmp117
| bne     .L453           //,

the extra mov operation. Moving PF_IO_WORKER next to PF_WQ_WORKER as
this patch gives me:
| tst     w19, 48 // task_flags,
| bne     .L453           //,


 include/linux/sched.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93ecd930efd31..2bf0af19a62a4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1489,9 +1489,10 @@ extern struct pid *cad_pid;
 /*
  * Per process flags
  */
+#define PF_VCPU			0x00000001	/* I'm a virtual CPU */
 #define PF_IDLE			0x00000002	/* I am an IDLE thread */
 #define PF_EXITING		0x00000004	/* Getting shut down */
-#define PF_VCPU			0x00000010	/* I'm a virtual CPU */
+#define PF_IO_WORKER		0x00000010	/* Task is an IO worker */
 #define PF_WQ_WORKER		0x00000020	/* I'm a workqueue worker */
 #define PF_FORKNOEXEC		0x00000040	/* Forked but didn't exec */
 #define PF_MCE_PROCESS		0x00000080      /* Process policy on mce errors */
@@ -1515,7 +1516,6 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
-#define PF_IO_WORKER		0x20000000	/* Task is an IO worker */
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work()
  2020-08-19 19:55       ` [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together Sebastian Andrzej Siewior
@ 2020-08-19 20:00         ` Sebastian Andrzej Siewior
  2020-08-19 20:11           ` Peter Zijlstra
  2020-09-07 12:58         ` [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together Will Deacon
  1 sibling, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-08-19 20:00 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, io-uring, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Jens Axboe, Thomas Gleixner

sched_submit_work() is considered to be a hot path. The preempt_disable()
instruction is a compiler barrier and forces the compiler to load
task_struct::flags for the second comparison.
By using a local variable, the compiler can load the value once and keep it in
a register for the second comparison.

Verified on x86-64 with gcc-10.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

Optimisation at molecule level, part two. Drop this in case this branch
isn't consider *that* hot and the cache hot value can be loaded again.
But then the value is around and be speculated early on :)

 kernel/sched/core.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8471a0f7eb322..c36dc1ae58beb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,9 +4551,12 @@ void __noreturn do_task_dead(void)
 
 static inline void sched_submit_work(struct task_struct *tsk)
 {
+	unsigned int task_flags;
+
 	if (!tsk->state)
 		return;
 
+	task_flags = tsk->flags;
 	/*
 	 * If a worker went to sleep, notify and ask workqueue whether
 	 * it wants to wake up a task to maintain concurrency.
@@ -4562,9 +4565,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * in the possible wakeup of a kworker and because wq_worker_sleeping()
 	 * requires it.
 	 */
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		preempt_disable();
-		if (tsk->flags & PF_WQ_WORKER)
+		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work()
  2020-08-19 20:00         ` [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work() Sebastian Andrzej Siewior
@ 2020-08-19 20:11           ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2020-08-19 20:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, io-uring, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Jens Axboe, Thomas Gleixner

On Wed, Aug 19, 2020 at 10:00:25PM +0200, Sebastian Andrzej Siewior wrote:
> sched_submit_work() is considered to be a hot path. The preempt_disable()
> instruction is a compiler barrier and forces the compiler to load
> task_struct::flags for the second comparison.
> By using a local variable, the compiler can load the value once and keep it in
> a register for the second comparison.
> 
> Verified on x86-64 with gcc-10.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> 
> Optimisation at molecule level, part two. Drop this in case this branch
> isn't consider *that* hot and the cache hot value can be loaded again.
> But then the value is around and be speculated early on :)

That's fine, task->flags can only be changed by current.

Patches look good to me, I'll stuff them in tomorrow. Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] io_wq: Make io_wqe::lock a raw_spinlock_t
  2020-08-19 19:44     ` [PATCH] io_wq: Make io_wqe::lock a raw_spinlock_t Sebastian Andrzej Siewior
@ 2020-09-01  8:41       ` Sebastian Andrzej Siewior
  2020-09-01 14:17         ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-09-01  8:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: peterz, linux-kernel, io-uring, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Thomas Gleixner

During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).

The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.

While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.

In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.

The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.

Convert io_wqe::lock to a raw_spinlock_t.c

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2: Rebase to -rc3

 fs/io-wq.c |   52 ++++++++++++++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -87,7 +87,7 @@ enum {
  */
 struct io_wqe {
 	struct {
-		spinlock_t lock;
+		raw_spinlock_t lock;
 		struct io_wq_work_list work_list;
 		unsigned long hash_map;
 		unsigned flags;
@@ -148,7 +148,7 @@ static bool __io_worker_unuse(struct io_
 
 	if (current->files != worker->restore_files) {
 		__acquire(&wqe->lock);
-		spin_unlock_irq(&wqe->lock);
+		raw_spin_unlock_irq(&wqe->lock);
 		dropped_lock = true;
 
 		task_lock(current);
@@ -166,7 +166,7 @@ static bool __io_worker_unuse(struct io_
 	if (worker->mm) {
 		if (!dropped_lock) {
 			__acquire(&wqe->lock);
-			spin_unlock_irq(&wqe->lock);
+			raw_spin_unlock_irq(&wqe->lock);
 			dropped_lock = true;
 		}
 		__set_current_state(TASK_RUNNING);
@@ -220,17 +220,17 @@ static void io_worker_exit(struct io_wor
 	worker->flags = 0;
 	preempt_enable();
 
-	spin_lock_irq(&wqe->lock);
+	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_del_rcu(&worker->nulls_node);
 	list_del_rcu(&worker->all_list);
 	if (__io_worker_unuse(wqe, worker)) {
 		__release(&wqe->lock);
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 	}
 	acct->nr_workers--;
 	nr_workers = wqe->acct[IO_WQ_ACCT_BOUND].nr_workers +
 			wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers;
-	spin_unlock_irq(&wqe->lock);
+	raw_spin_unlock_irq(&wqe->lock);
 
 	/* all workers gone, wq exit can proceed */
 	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
@@ -504,7 +504,7 @@ static void io_worker_handle_work(struct
 		else if (!wq_list_empty(&wqe->work_list))
 			wqe->flags |= IO_WQE_FLAG_STALLED;
 
-		spin_unlock_irq(&wqe->lock);
+		raw_spin_unlock_irq(&wqe->lock);
 		if (!work)
 			break;
 		io_assign_current_work(worker, work);
@@ -538,17 +538,17 @@ static void io_worker_handle_work(struct
 				io_wqe_enqueue(wqe, linked);
 
 			if (hash != -1U && !next_hashed) {
-				spin_lock_irq(&wqe->lock);
+				raw_spin_lock_irq(&wqe->lock);
 				wqe->hash_map &= ~BIT_ULL(hash);
 				wqe->flags &= ~IO_WQE_FLAG_STALLED;
 				/* skip unnecessary unlock-lock wqe->lock */
 				if (!work)
 					goto get_next;
-				spin_unlock_irq(&wqe->lock);
+				raw_spin_unlock_irq(&wqe->lock);
 			}
 		} while (work);
 
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 	} while (1);
 }
 
@@ -563,7 +563,7 @@ static int io_wqe_worker(void *data)
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 		if (io_wqe_run_queue(wqe)) {
 			__set_current_state(TASK_RUNNING);
 			io_worker_handle_work(worker);
@@ -574,7 +574,7 @@ static int io_wqe_worker(void *data)
 			__release(&wqe->lock);
 			goto loop;
 		}
-		spin_unlock_irq(&wqe->lock);
+		raw_spin_unlock_irq(&wqe->lock);
 		if (signal_pending(current))
 			flush_signals(current);
 		if (schedule_timeout(WORKER_IDLE_TIMEOUT))
@@ -586,11 +586,11 @@ static int io_wqe_worker(void *data)
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
-		spin_lock_irq(&wqe->lock);
+		raw_spin_lock_irq(&wqe->lock);
 		if (!wq_list_empty(&wqe->work_list))
 			io_worker_handle_work(worker);
 		else
-			spin_unlock_irq(&wqe->lock);
+			raw_spin_unlock_irq(&wqe->lock);
 	}
 
 	io_worker_exit(worker);
@@ -630,9 +630,9 @@ void io_wq_worker_sleeping(struct task_s
 
 	worker->flags &= ~IO_WORKER_F_RUNNING;
 
-	spin_lock_irq(&wqe->lock);
+	raw_spin_lock_irq(&wqe->lock);
 	io_wqe_dec_running(wqe, worker);
-	spin_unlock_irq(&wqe->lock);
+	raw_spin_unlock_irq(&wqe->lock);
 }
 
 static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
@@ -656,7 +656,7 @@ static bool create_io_worker(struct io_w
 		return false;
 	}
 
-	spin_lock_irq(&wqe->lock);
+	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
 	worker->flags |= IO_WORKER_F_FREE;
@@ -665,7 +665,7 @@ static bool create_io_worker(struct io_w
 	if (!acct->nr_workers && (worker->flags & IO_WORKER_F_BOUND))
 		worker->flags |= IO_WORKER_F_FIXED;
 	acct->nr_workers++;
-	spin_unlock_irq(&wqe->lock);
+	raw_spin_unlock_irq(&wqe->lock);
 
 	if (index == IO_WQ_ACCT_UNBOUND)
 		atomic_inc(&wq->user->processes);
@@ -720,12 +720,12 @@ static int io_wq_manager(void *data)
 			if (!node_online(node))
 				continue;
 
-			spin_lock_irq(&wqe->lock);
+			raw_spin_lock_irq(&wqe->lock);
 			if (io_wqe_need_worker(wqe, IO_WQ_ACCT_BOUND))
 				fork_worker[IO_WQ_ACCT_BOUND] = true;
 			if (io_wqe_need_worker(wqe, IO_WQ_ACCT_UNBOUND))
 				fork_worker[IO_WQ_ACCT_UNBOUND] = true;
-			spin_unlock_irq(&wqe->lock);
+			raw_spin_unlock_irq(&wqe->lock);
 			if (fork_worker[IO_WQ_ACCT_BOUND])
 				create_io_worker(wq, wqe, IO_WQ_ACCT_BOUND);
 			if (fork_worker[IO_WQ_ACCT_UNBOUND])
@@ -821,10 +821,10 @@ static void io_wqe_enqueue(struct io_wqe
 	}
 
 	work_flags = work->flags;
-	spin_lock_irqsave(&wqe->lock, flags);
+	raw_spin_lock_irqsave(&wqe->lock, flags);
 	io_wqe_insert_work(wqe, work);
 	wqe->flags &= ~IO_WQE_FLAG_STALLED;
-	spin_unlock_irqrestore(&wqe->lock, flags);
+	raw_spin_unlock_irqrestore(&wqe->lock, flags);
 
 	if ((work_flags & IO_WQ_WORK_CONCURRENT) ||
 	    !atomic_read(&acct->nr_running))
@@ -951,13 +951,13 @@ static void io_wqe_cancel_pending_work(s
 	unsigned long flags;
 
 retry:
-	spin_lock_irqsave(&wqe->lock, flags);
+	raw_spin_lock_irqsave(&wqe->lock, flags);
 	wq_list_for_each(node, prev, &wqe->work_list) {
 		work = container_of(node, struct io_wq_work, list);
 		if (!match->fn(work, match->data))
 			continue;
 		io_wqe_remove_pending(wqe, work, prev);
-		spin_unlock_irqrestore(&wqe->lock, flags);
+		raw_spin_unlock_irqrestore(&wqe->lock, flags);
 		io_run_cancel(work, wqe);
 		match->nr_pending++;
 		if (!match->cancel_all)
@@ -966,7 +966,7 @@ static void io_wqe_cancel_pending_work(s
 		/* not safe to continue after unlock */
 		goto retry;
 	}
-	spin_unlock_irqrestore(&wqe->lock, flags);
+	raw_spin_unlock_irqrestore(&wqe->lock, flags);
 }
 
 static void io_wqe_cancel_running_work(struct io_wqe *wqe,
@@ -1074,7 +1074,7 @@ struct io_wq *io_wq_create(unsigned boun
 		}
 		atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
 		wqe->wq = wq;
-		spin_lock_init(&wqe->lock);
+		raw_spin_lock_init(&wqe->lock);
 		INIT_WQ_LIST(&wqe->work_list);
 		INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0);
 		INIT_LIST_HEAD(&wqe->all_list);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] io_wq: Make io_wqe::lock a raw_spinlock_t
  2020-09-01  8:41       ` [PATCH v2] " Sebastian Andrzej Siewior
@ 2020-09-01 14:17         ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2020-09-01 14:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: peterz, linux-kernel, io-uring, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Thomas Gleixner

On 9/1/20 2:41 AM, Sebastian Andrzej Siewior wrote:
> During a context switch the scheduler invokes wq_worker_sleeping() with
> disabled preemption. Disabling preemption is needed because it protects
> access to `worker->sleeping'. As an optimisation it avoids invoking
> schedule() within the schedule path as part of possible wake up (thus
> preempt_enable_no_resched() afterwards).
> 
> The io-wq has been added to the mix in the same section with disabled
> preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
> acquires a spinlock_t. Also within the schedule() the spinlock_t must be
> acquired after tsk_is_pi_blocked() otherwise it will block on the
> sleeping lock again while scheduling out.
> 
> While playing with `io_uring-bench' I didn't notice a significant
> latency spike after converting io_wqe::lock to a raw_spinlock_t. The
> latency was more or less the same.
> 
> In order to keep the spinlock_t it would have to be moved after the
> tsk_is_pi_blocked() check which would introduce a branch instruction
> into the hot path.
> 
> The lock is used to maintain the `work_list' and wakes one task up at
> most.
> Should io_wqe_cancel_pending_work() cause latency spikes, while
> searching for a specific item, then it would need to drop the lock
> during iterations.
> revert_creds() is also invoked under the lock. According to debug
> cred::non_rcu is 0. Otherwise it should be moved outside of the locked
> section because put_cred_rcu()->free_uid() acquires a sleeping lock.
> 
> Convert io_wqe::lock to a raw_spinlock_t.c

Thanks, I've applied this for 5.10.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together
  2020-08-19 19:55       ` [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together Sebastian Andrzej Siewior
  2020-08-19 20:00         ` [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work() Sebastian Andrzej Siewior
@ 2020-09-07 12:58         ` Will Deacon
  1 sibling, 0 replies; 12+ messages in thread
From: Will Deacon @ 2020-09-07 12:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: peterz, linux-kernel, io-uring, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Jens Axboe, Thomas Gleixner, Catalin Marinas

On Wed, Aug 19, 2020 at 09:55:05PM +0200, Sebastian Andrzej Siewior wrote:
> The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
> sched_submit_work() which is considered to be a hot path.
> If the two bits cross the 8 or 16 bit boundary then most architecture
> require multiple load instructions in order to create the constant
> value. Also, such a value can not be encoded within the compare opcode.
> 
> By moving the bit definition within the same block, the compiler can
> create/use one immediate value.
> 
> For some reason gcc-10 on ARM64 requires both bits to be next to each
> other in order to issue "tst reg, val; bne label". Otherwise the result
> is "mov reg1, val; tst reg, reg1; bne label".
> 
> Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
> PF_WQ_WORKER.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> 
> Could someone from the ARM64 camp please verify if this a gcc "bug" or
> opcode/arch limitation? With PF_IO_WORKER as 1 (without the PF_VCPU
> swap) I get for ARM:
> 
> | tst     r2, #33 @ task_flags,
> | beq     .L998           @,
> 
> however ARM64 does here:
> | mov     w0, 33  // tmp117,
> | tst     w19, w0 // task_flags, tmp117
> | bne     .L453           //,
> 
> the extra mov operation. Moving PF_IO_WORKER next to PF_WQ_WORKER as
> this patch gives me:
> | tst     w19, 48 // task_flags,
> | bne     .L453           //,

Moving an immediate into a register really shouldn't be a performance
issue, so I don't think this is a problem. However, the reason GCC does
this is because of the slightly weird way in which immediates are encoded,
meaning that '33' can't be packed into the 'tst' alias. You can try to
decipher the "DecodeBitMasks()" pseudocode in the Arm ARM if you're
interested.

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-09-07 12:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19 12:37 [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption Sebastian Andrzej Siewior
2020-08-19 13:15 ` peterz
2020-08-19 13:18   ` Jens Axboe
2020-08-19 19:44     ` [PATCH] io_wq: Make io_wqe::lock a raw_spinlock_t Sebastian Andrzej Siewior
2020-09-01  8:41       ` [PATCH v2] " Sebastian Andrzej Siewior
2020-09-01 14:17         ` Jens Axboe
2020-08-19 13:33   ` [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption Sebastian Andrzej Siewior
2020-08-19 14:21     ` peterz
2020-08-19 19:55       ` [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together Sebastian Andrzej Siewior
2020-08-19 20:00         ` [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work() Sebastian Andrzej Siewior
2020-08-19 20:11           ` Peter Zijlstra
2020-09-07 12:58         ` [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).