* [PATCHSET v2 0/2] Split iowait into two states @ 2024-02-27 21:06 Jens Axboe 2024-02-27 21:06 ` [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Jens Axboe @ 2024-02-27 21:06 UTC (permalink / raw) To: linux-kernel; +Cc: peterz, mingo Hi, This is v2 of the patch posted yesterday, where the current in_iowait state is split into two parts: 1) The "task is sleeping waiting on IO", and would like cpufreq goodness in terms of sleep and wakeup latencies. 2) The above, and also accounted as such in the iowait stats. The current ->in_iowait covers both, with this series we have ->in_iowait for step 1 above, and ->in_iowait_acct for step 2. You cannot be ->in_iowait_acct without also having ->in_iowait set. Patch 1 is a prep patch, that turns rq->nr_iowait into an unsigned int rather than an atomic_t. Reasons given in that patch. Patch 2 adds the ->in_iowait_acct stage inside the current ->in_iowait setting. I haven't been able to properly benchmark patch 1, as the atomics are noise in any workloads that approximate normality. I can certainly concoct a synthetic test case if folks are interested. My gut says that we're trading 3 fast path atomics for none, and with the 4th case _probably_ being way less likely. There we grab the rq lock. Comments welcome! Peter, CC'ing you since I did in the previous, feel free to ignore. Since v1: - Add prep patch 1, switching nr_iowait to an unsigned int - Modify patch 2 to not use atomic_t as well, no changes otherwise arch/s390/appldata/appldata_base.c | 2 +- arch/s390/appldata/appldata_os.c | 2 +- fs/proc/stat.c | 2 +- include/linux/sched.h | 6 ++++ include/linux/sched/stat.h | 10 ++++-- kernel/sched/core.c | 55 +++++++++++++++++++++++------- kernel/sched/cputime.c | 2 +- kernel/sched/sched.h | 3 +- kernel/time/tick-sched.c | 6 ++-- 9 files changed, 66 insertions(+), 22 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int 2024-02-27 21:06 [PATCHSET v2 0/2] Split iowait into two states Jens Axboe @ 2024-02-27 21:06 ` Jens Axboe 2024-02-27 23:05 ` David Wei 2024-02-27 21:06 ` [PATCH 2/2] sched/core: split iowait state into two states Jens Axboe 2024-02-28 2:21 ` [PATCHSET v2 0/2] Split iowait " Jens Axboe 2 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2024-02-27 21:06 UTC (permalink / raw) To: linux-kernel; +Cc: peterz, mingo, Jens Axboe In 3 of the 4 spots where we modify rq->nr_iowait we already hold the rq lock, and in the 4th one we can just grab it. This avoids an atomic in the scheduler fast path if we're in iowait, with the tradeoff being that we'll grab the rq lock for the case where a task is scheduled out in iowait mode on one CPU, and scheduled in on another CPU. This obviously leaves the reading side as potentially racy, but that should be OK. iowait states change all of the time, and can change while it's being read as well, or summed up. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- kernel/sched/core.c | 15 ++++++++++----- kernel/sched/cputime.c | 2 +- kernel/sched/sched.h | 2 +- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9116bcc90346..ecc6c26096e5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3789,7 +3789,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, #endif if (p->in_iowait) { delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); + task_rq(p)->nr_iowait--; } activate_task(rq, p, en_flags); @@ -4354,8 +4354,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); if (task_cpu(p) != cpu) { if (p->in_iowait) { + struct rq *rq = task_rq(p); + struct rq_flags rf; + + rq_lock(rq, &rf); + task_rq(p)->nr_iowait--; + rq_unlock(rq, &rf); delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); } wake_flags |= WF_MIGRATED; @@ -5463,7 +5468,7 @@ unsigned long long nr_context_switches(void) unsigned int nr_iowait_cpu(int cpu) { - return atomic_read(&cpu_rq(cpu)->nr_iowait); + return cpu_rq(cpu)->nr_iowait; } /* @@ -6681,7 +6686,7 @@ static void __sched notrace __schedule(unsigned int sched_mode) deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); if (prev->in_iowait) { - atomic_inc(&rq->nr_iowait); + rq->nr_iowait++; delayacct_blkio_start(); } } @@ -10029,7 +10034,7 @@ void __init sched_init(void) #endif #endif /* CONFIG_SMP */ hrtick_rq_init(rq); - atomic_set(&rq->nr_iowait, 0); + rq->nr_iowait = 0; #ifdef CONFIG_SCHED_CORE rq->core = rq; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index af7952f12e6c..b970b6c6151e 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -224,7 +224,7 @@ void account_idle_time(u64 cputime) u64 *cpustat = kcpustat_this_cpu->cpustat; struct rq *rq = this_rq(); - if (atomic_read(&rq->nr_iowait) > 0) + if (rq->nr_iowait) cpustat[CPUTIME_IOWAIT] += cputime; else cpustat[CPUTIME_IDLE] += cputime; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 001fe047bd5d..a1222a4bdc7b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1049,7 +1049,7 @@ struct rq { u64 clock_idle_copy; #endif - atomic_t nr_iowait; + unsigned int nr_iowait; #ifdef CONFIG_SCHED_DEBUG u64 last_seen_need_resched_ns; -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int 2024-02-27 21:06 ` [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int Jens Axboe @ 2024-02-27 23:05 ` David Wei 2024-02-27 23:17 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: David Wei @ 2024-02-27 23:05 UTC (permalink / raw) To: Jens Axboe, linux-kernel; +Cc: peterz, mingo On 2024-02-27 21:06, Jens Axboe wrote: > In 3 of the 4 spots where we modify rq->nr_iowait we already hold the > rq lock, and in the 4th one we can just grab it. This avoids an atomic > in the scheduler fast path if we're in iowait, with the tradeoff being > that we'll grab the rq lock for the case where a task is scheduled out > in iowait mode on one CPU, and scheduled in on another CPU. > > This obviously leaves the reading side as potentially racy, but that > should be OK. iowait states change all of the time, and can change while > it's being read as well, or summed up. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > kernel/sched/core.c | 15 ++++++++++----- > kernel/sched/cputime.c | 2 +- > kernel/sched/sched.h | 2 +- > 3 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 9116bcc90346..ecc6c26096e5 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3789,7 +3789,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, > #endif > if (p->in_iowait) { > delayacct_blkio_end(p); > - atomic_dec(&task_rq(p)->nr_iowait); > + task_rq(p)->nr_iowait--; > } > > activate_task(rq, p, en_flags); > @@ -4354,8 +4354,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); > if (task_cpu(p) != cpu) { > if (p->in_iowait) { > + struct rq *rq = task_rq(p); > + struct rq_flags rf; > + > + rq_lock(rq, &rf); > + task_rq(p)->nr_iowait--; Could this use rq directly, or does it not matter? > + rq_unlock(rq, &rf); > delayacct_blkio_end(p); > - atomic_dec(&task_rq(p)->nr_iowait); > } > > wake_flags |= WF_MIGRATED; > @@ -5463,7 +5468,7 @@ unsigned long long nr_context_switches(void) > > unsigned int nr_iowait_cpu(int cpu) > { > - return atomic_read(&cpu_rq(cpu)->nr_iowait); > + return cpu_rq(cpu)->nr_iowait; > } > > /* > @@ -6681,7 +6686,7 @@ static void __sched notrace __schedule(unsigned int sched_mode) > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); > > if (prev->in_iowait) { > - atomic_inc(&rq->nr_iowait); > + rq->nr_iowait++; > delayacct_blkio_start(); > } > } > @@ -10029,7 +10034,7 @@ void __init sched_init(void) > #endif > #endif /* CONFIG_SMP */ > hrtick_rq_init(rq); > - atomic_set(&rq->nr_iowait, 0); > + rq->nr_iowait = 0; I checked that both ttwu_do_activate() and __schedule() have the rq lock held, but I couldn't find it for this. Is it under the assumption that the rq is in a pre-init state (maybe because scheduler_running = 0?) so no lock is needed? > > #ifdef CONFIG_SCHED_CORE > rq->core = rq; > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index af7952f12e6c..b970b6c6151e 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -224,7 +224,7 @@ void account_idle_time(u64 cputime) > u64 *cpustat = kcpustat_this_cpu->cpustat; > struct rq *rq = this_rq(); > > - if (atomic_read(&rq->nr_iowait) > 0) > + if (rq->nr_iowait) > cpustat[CPUTIME_IOWAIT] += cputime; > else > cpustat[CPUTIME_IDLE] += cputime; > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 001fe047bd5d..a1222a4bdc7b 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1049,7 +1049,7 @@ struct rq { > u64 clock_idle_copy; > #endif > > - atomic_t nr_iowait; > + unsigned int nr_iowait; > > #ifdef CONFIG_SCHED_DEBUG > u64 last_seen_need_resched_ns; ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int 2024-02-27 23:05 ` David Wei @ 2024-02-27 23:17 ` Jens Axboe 0 siblings, 0 replies; 9+ messages in thread From: Jens Axboe @ 2024-02-27 23:17 UTC (permalink / raw) To: David Wei, linux-kernel; +Cc: peterz, mingo On 2/27/24 4:05 PM, David Wei wrote: >> @@ -4354,8 +4354,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) >> cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); >> if (task_cpu(p) != cpu) { >> if (p->in_iowait) { >> + struct rq *rq = task_rq(p); >> + struct rq_flags rf; >> + >> + rq_lock(rq, &rf); >> + task_rq(p)->nr_iowait--; > > Could this use rq directly, or does it not matter? It certainly could, I'll make that edit. Same thing, but may as well use the variable as defined. Also makes it clear we're modifying the one we've locked. >> @@ -10029,7 +10034,7 @@ void __init sched_init(void) >> #endif >> #endif /* CONFIG_SMP */ >> hrtick_rq_init(rq); >> - atomic_set(&rq->nr_iowait, 0); >> + rq->nr_iowait = 0; > > I checked that both ttwu_do_activate() and __schedule() have the rq lock > held, but I couldn't find it for this. Is it under the assumption that > the rq is in a pre-init state (maybe because scheduler_running = 0?) so > no lock is needed? This is run at boot time (it's __init), so it's before anything is running. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] sched/core: split iowait state into two states 2024-02-27 21:06 [PATCHSET v2 0/2] Split iowait into two states Jens Axboe 2024-02-27 21:06 ` [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int Jens Axboe @ 2024-02-27 21:06 ` Jens Axboe 2024-02-28 2:21 ` [PATCHSET v2 0/2] Split iowait " Jens Axboe 2 siblings, 0 replies; 9+ messages in thread From: Jens Axboe @ 2024-02-27 21:06 UTC (permalink / raw) To: linux-kernel; +Cc: peterz, mingo, Jens Axboe iowait is a bogus metric, but it's helpful in the sense that it allows short waits to not enter sleep states that have a higher exit latency than we would've picked for iowait'ing tasks. However, it's harmless in that lots of applications and monitoring assumes that iowait is busy time, or otherwise use it as a health metric. Particularly for async IO it's entirely nonsensical. Split the iowait part into two parts - one that tracks whether we need boosting for short waits, and one that says we need to account the task as such. ->in_iowait_acct nests inside of ->in_iowait, both for efficiency reasons, but also so that the relationship between the two is clear. A waiter may set ->in_wait alone and not care about the accounting. Existing users of nr_iowait() for accounting purposes are switched to use nr_iowait_acct(), which leaves the governor using nr_iowait() as it only cares about iowaiters, not the accounting side. io_schedule_prepare() and io_schedule_finish() are changed to return a simple mask of two state bits, as we now have more than one state to manage. Outside of that, no further changes are needed to suppor this generically. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- arch/s390/appldata/appldata_base.c | 2 +- arch/s390/appldata/appldata_os.c | 2 +- fs/proc/stat.c | 2 +- include/linux/sched.h | 6 +++++ include/linux/sched/stat.h | 10 +++++-- kernel/sched/core.c | 42 ++++++++++++++++++++++++------ kernel/sched/sched.h | 1 + kernel/time/tick-sched.c | 6 ++--- 8 files changed, 55 insertions(+), 16 deletions(-) diff --git a/arch/s390/appldata/appldata_base.c b/arch/s390/appldata/appldata_base.c index c2978cb03b36..6844b5294a8b 100644 --- a/arch/s390/appldata/appldata_base.c +++ b/arch/s390/appldata/appldata_base.c @@ -423,4 +423,4 @@ EXPORT_SYMBOL_GPL(si_swapinfo); #endif EXPORT_SYMBOL_GPL(nr_threads); EXPORT_SYMBOL_GPL(nr_running); -EXPORT_SYMBOL_GPL(nr_iowait); +EXPORT_SYMBOL_GPL(nr_iowait_acct); diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index a363d30ce739..fa4b278aca6c 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data) os_data->nr_threads = nr_threads; os_data->nr_running = nr_running(); - os_data->nr_iowait = nr_iowait(); + os_data->nr_iowait = nr_iowait_acct(); os_data->avenrun[0] = avenrun[0] + (FIXED_1/200); os_data->avenrun[1] = avenrun[1] + (FIXED_1/200); os_data->avenrun[2] = avenrun[2] + (FIXED_1/200); diff --git a/fs/proc/stat.c b/fs/proc/stat.c index da60956b2915..149be7a884fb 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v) (unsigned long long)boottime.tv_sec, total_forks, nr_running(), - nr_iowait()); + nr_iowait_acct()); seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq); diff --git a/include/linux/sched.h b/include/linux/sched.h index ffe8f618ab86..1e198e268df1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -922,7 +922,13 @@ struct task_struct { /* Bit to tell TOMOYO we're in execve(): */ unsigned in_execve:1; + /* task is in iowait */ unsigned in_iowait:1; + /* + * task is in iowait and should be accounted as such. can only be set + * if ->in_iowait is also set. + */ + unsigned in_iowait_acct:1; #ifndef TIF_RESTORE_SIGMASK unsigned restore_sigmask:1; #endif diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index 0108a38bb64d..7c48a35f98ee 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -19,8 +19,14 @@ DECLARE_PER_CPU(unsigned long, process_counts); extern int nr_processes(void); extern unsigned int nr_running(void); extern bool single_task_running(void); -extern unsigned int nr_iowait(void); -extern unsigned int nr_iowait_cpu(int cpu); +extern unsigned int nr_iowait_acct(void); +extern unsigned int nr_iowait_acct_cpu(int cpu); +unsigned int nr_iowait_cpu(int cpu); + +enum { + TASK_IOWAIT = 1, + TASK_IOWAIT_ACCT = 2, +}; static inline int sched_info_on(void) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ecc6c26096e5..003208c758e4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3790,6 +3790,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, if (p->in_iowait) { delayacct_blkio_end(p); task_rq(p)->nr_iowait--; + if (p->in_iowait_acct) + task_rq(p)->nr_iowait_acct--; } activate_task(rq, p, en_flags); @@ -4359,6 +4361,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) rq_lock(rq, &rf); task_rq(p)->nr_iowait--; + if (p->in_iowait_acct) + task_rq(p)->nr_iowait_acct--; rq_unlock(rq, &rf); delayacct_blkio_end(p); } @@ -5466,9 +5470,9 @@ unsigned long long nr_context_switches(void) * it does become runnable. */ -unsigned int nr_iowait_cpu(int cpu) +unsigned int nr_iowait_acct_cpu(int cpu) { - return cpu_rq(cpu)->nr_iowait; + return cpu_rq(cpu)->nr_iowait_acct; } /* @@ -5501,16 +5505,21 @@ unsigned int nr_iowait_cpu(int cpu) * Task CPU affinities can make all that even more 'interesting'. */ -unsigned int nr_iowait(void) +unsigned int nr_iowait_acct(void) { unsigned int i, sum = 0; for_each_possible_cpu(i) - sum += nr_iowait_cpu(i); + sum += nr_iowait_acct_cpu(i); return sum; } +unsigned int nr_iowait_cpu(int cpu) +{ + return cpu_rq(cpu)->nr_iowait; +} + #ifdef CONFIG_SMP /* @@ -6687,6 +6696,8 @@ static void __sched notrace __schedule(unsigned int sched_mode) if (prev->in_iowait) { rq->nr_iowait++; + if (prev->in_iowait_acct) + rq->nr_iowait_acct++; delayacct_blkio_start(); } } @@ -8989,18 +9000,32 @@ int __sched yield_to(struct task_struct *p, bool preempt) } EXPORT_SYMBOL_GPL(yield_to); +/* + * Returns a token which is comprised if the two bits of iowait wait state - + * one is whether we're making ourselves as in iowait for cpufreq reasons, + * and the other is if the task should be accounted as such. + */ int io_schedule_prepare(void) { - int old_iowait = current->in_iowait; + int old_wait_flags = 0; + + if (current->in_iowait) + old_wait_flags |= TASK_IOWAIT; + if (current->in_iowait_acct) + old_wait_flags |= TASK_IOWAIT_ACCT; current->in_iowait = 1; + current->in_iowait_acct = 1; blk_flush_plug(current->plug, true); - return old_iowait; + return old_wait_flags; } -void io_schedule_finish(int token) +void io_schedule_finish(int old_wait_flags) { - current->in_iowait = token; + if (!(old_wait_flags & TASK_IOWAIT)) + current->in_iowait = 0; + if (!(old_wait_flags & TASK_IOWAIT_ACCT)) + current->in_iowait_acct = 0; } /* @@ -10034,6 +10059,7 @@ void __init sched_init(void) #endif #endif /* CONFIG_SMP */ hrtick_rq_init(rq); + rq->nr_iowait_acct = 0; rq->nr_iowait = 0; #ifdef CONFIG_SCHED_CORE diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a1222a4bdc7b..ee390fc664ae 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1049,6 +1049,7 @@ struct rq { u64 clock_idle_copy; #endif + unsigned int nr_iowait_acct; unsigned int nr_iowait; #ifdef CONFIG_SCHED_DEBUG diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 01fb50c1b17e..f6709d543dac 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -669,7 +669,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now) delta = ktime_sub(now, ts->idle_entrytime); write_seqcount_begin(&ts->idle_sleeptime_seq); - if (nr_iowait_cpu(smp_processor_id()) > 0) + if (nr_iowait_acct_cpu(smp_processor_id()) > 0) ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); else ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta); @@ -742,7 +742,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time) struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu); return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime, - !nr_iowait_cpu(cpu), last_update_time); + !nr_iowait_acct_cpu(cpu), last_update_time); } EXPORT_SYMBOL_GPL(get_cpu_idle_time_us); @@ -768,7 +768,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time) struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu); return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime, - nr_iowait_cpu(cpu), last_update_time); + nr_iowait_acct_cpu(cpu), last_update_time); } EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us); -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCHSET v2 0/2] Split iowait into two states 2024-02-27 21:06 [PATCHSET v2 0/2] Split iowait into two states Jens Axboe 2024-02-27 21:06 ` [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int Jens Axboe 2024-02-27 21:06 ` [PATCH 2/2] sched/core: split iowait state into two states Jens Axboe @ 2024-02-28 2:21 ` Jens Axboe 2 siblings, 0 replies; 9+ messages in thread From: Jens Axboe @ 2024-02-28 2:21 UTC (permalink / raw) To: linux-kernel; +Cc: peterz, mingo On 2/27/24 2:06 PM, Jens Axboe wrote: > I haven't been able to properly benchmark patch 1, as the atomics are > noise in any workloads that approximate normality. I can certainly > concoct a synthetic test case if folks are interested. My gut says that > we're trading 3 fast path atomics for none, and with the 4th case > _probably_ being way less likely. There we grab the rq lock. OK, so on Chris's suggestion, I tried his schbench to exercise the scheduling side. It's very futex intensive, so I hacked up futex to set iowait state when sleeping. I also added simple accounting to that path so I knew how many times it ran. A run of: ./schbench -m 60 -t 10 -p 8 on a 2 socket Intel(R) Xeon(R) Platinum 8458P with 176 threads, there's no regression in performance and try_to_wake_up() locking the rq of the task being scheduled in from another CPU doesn't seem to register much. On the previous run, I saw 2.21% there and now it's 2.36%. But it was also a better performing run, which may have lead to the increase. Each run takes 30 seconds, and during that time I see around 290-310M hits of that path, or about ~10M/sec. Without modifying futex to use iowait, we obviously rarely hit it. About 200 times for a run, which makes sense as we're not really doing IO. Anyway, just some data on this. If I leave the futex/pipe iowait in and run the same test, I see no discernable difference in profiles. In fact, the highest cost across the tests is bringing in the task->in_iowait cacheline. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCHSET v3 0/2] Split iowait into two states @ 2024-02-28 19:16 Jens Axboe 2024-02-28 19:16 ` [PATCH 2/2] sched/core: split iowait state " Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2024-02-28 19:16 UTC (permalink / raw) To: linux-kernel; +Cc: peterz, mingo Hi, This is v3 of the patchset where the current in_iowait state is split into two parts: 1) The "task is sleeping waiting on IO", and would like cpufreq goodness in terms of sleep and wakeup latencies. 2) The above, and also accounted as such in the iowait stats. The current ->in_iowait covers both, with this series we have ->in_iowait for step 1 above, and ->in_iowait_acct for step 2. You cannot be ->in_iowait_acct without also having ->in_iowait set. Patch 1 is a prep patch, that turns rq->nr_iowait into an int rather than an atomic_t. Reasons given in that patch. This patch can stand alone, as it should not have any functional changes, outside of improving the handling of iowait a bit. Patch 2 adds the ->in_iowait_acct stage inside the current ->in_iowait setting. I wasn't super happy with the need to grab the rq lock for the one case where we still need it, and woke up with a better idea of how to do this. It's all in patch 1. Nice part here is that we get rid of atomics for 3 out of the 4 iowait inc/dec. Patch just follows the same principle. Comments welcome! Peter, CC'ing you since I did on the previous posting, feel free to ignore. Since v2: - Drop need for rq lock for the remote case by turning both nr_iowait and nr_iowait_acct into the difference between a remote atomic count and the local non-atomic one. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] sched/core: split iowait state into two states 2024-02-28 19:16 [PATCHSET v3 " Jens Axboe @ 2024-02-28 19:16 ` Jens Axboe 2024-02-29 17:31 ` Thomas Gleixner 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2024-02-28 19:16 UTC (permalink / raw) To: linux-kernel; +Cc: peterz, mingo, Jens Axboe iowait is a bogus metric, but it's helpful in the sense that it allows short waits to not enter sleep states that have a higher exit latency than we would've picked for iowait'ing tasks. However, it's harmless in that lots of applications and monitoring assumes that iowait is busy time, or otherwise use it as a health metric. Particularly for async IO it's entirely nonsensical. Split the iowait part into two parts - one that tracks whether we need boosting for short waits, and one that says we need to account the task as such. ->in_iowait_acct nests inside of ->in_iowait, both for efficiency reasons, but also so that the relationship between the two is clear. A waiter may set ->in_wait alone and not care about the accounting. Existing users of nr_iowait() for accounting purposes are switched to use nr_iowait_acct(), which leaves the governor using nr_iowait() as it only cares about iowaiters, not the accounting side. io_schedule_prepare() and io_schedule_finish() are changed to return a simple mask of two state bits, as we now have more than one state to manage. Outside of that, no further changes are needed to suppor this generically. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- arch/s390/appldata/appldata_base.c | 2 +- arch/s390/appldata/appldata_os.c | 2 +- fs/proc/stat.c | 2 +- include/linux/sched.h | 6 ++++ include/linux/sched/stat.h | 10 +++++-- kernel/sched/core.c | 45 ++++++++++++++++++++++++------ kernel/sched/sched.h | 2 ++ kernel/time/tick-sched.c | 6 ++-- 8 files changed, 59 insertions(+), 16 deletions(-) diff --git a/arch/s390/appldata/appldata_base.c b/arch/s390/appldata/appldata_base.c index c2978cb03b36..6844b5294a8b 100644 --- a/arch/s390/appldata/appldata_base.c +++ b/arch/s390/appldata/appldata_base.c @@ -423,4 +423,4 @@ EXPORT_SYMBOL_GPL(si_swapinfo); #endif EXPORT_SYMBOL_GPL(nr_threads); EXPORT_SYMBOL_GPL(nr_running); -EXPORT_SYMBOL_GPL(nr_iowait); +EXPORT_SYMBOL_GPL(nr_iowait_acct); diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index a363d30ce739..fa4b278aca6c 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data) os_data->nr_threads = nr_threads; os_data->nr_running = nr_running(); - os_data->nr_iowait = nr_iowait(); + os_data->nr_iowait = nr_iowait_acct(); os_data->avenrun[0] = avenrun[0] + (FIXED_1/200); os_data->avenrun[1] = avenrun[1] + (FIXED_1/200); os_data->avenrun[2] = avenrun[2] + (FIXED_1/200); diff --git a/fs/proc/stat.c b/fs/proc/stat.c index da60956b2915..149be7a884fb 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v) (unsigned long long)boottime.tv_sec, total_forks, nr_running(), - nr_iowait()); + nr_iowait_acct()); seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq); diff --git a/include/linux/sched.h b/include/linux/sched.h index ffe8f618ab86..1e198e268df1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -922,7 +922,13 @@ struct task_struct { /* Bit to tell TOMOYO we're in execve(): */ unsigned in_execve:1; + /* task is in iowait */ unsigned in_iowait:1; + /* + * task is in iowait and should be accounted as such. can only be set + * if ->in_iowait is also set. + */ + unsigned in_iowait_acct:1; #ifndef TIF_RESTORE_SIGMASK unsigned restore_sigmask:1; #endif diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index 0108a38bb64d..7c48a35f98ee 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -19,8 +19,14 @@ DECLARE_PER_CPU(unsigned long, process_counts); extern int nr_processes(void); extern unsigned int nr_running(void); extern bool single_task_running(void); -extern unsigned int nr_iowait(void); -extern unsigned int nr_iowait_cpu(int cpu); +extern unsigned int nr_iowait_acct(void); +extern unsigned int nr_iowait_acct_cpu(int cpu); +unsigned int nr_iowait_cpu(int cpu); + +enum { + TASK_IOWAIT = 1, + TASK_IOWAIT_ACCT = 2, +}; static inline int sched_info_on(void) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 48d15529a777..66a3654aab5d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3790,6 +3790,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, if (p->in_iowait) { delayacct_blkio_end(p); task_rq(p)->nr_iowait--; + if (p->in_iowait_acct) + task_rq(p)->nr_iowait_acct--; } activate_task(rq, p, en_flags); @@ -4358,6 +4360,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) delayacct_blkio_end(p); atomic_inc(&__rq->nr_iowait_remote); + if (p->in_iowait_acct) + atomic_inc(&__rq->nr_iowait_acct_remote); } wake_flags |= WF_MIGRATED; @@ -5463,11 +5467,11 @@ unsigned long long nr_context_switches(void) * it does become runnable. */ -unsigned int nr_iowait_cpu(int cpu) +unsigned int nr_iowait_acct_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); - return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote); + return rq->nr_iowait_acct - atomic_read(&rq->nr_iowait_acct_remote); } /* @@ -5500,16 +5504,23 @@ unsigned int nr_iowait_cpu(int cpu) * Task CPU affinities can make all that even more 'interesting'. */ -unsigned int nr_iowait(void) +unsigned int nr_iowait_acct(void) { unsigned int i, sum = 0; for_each_possible_cpu(i) - sum += nr_iowait_cpu(i); + sum += nr_iowait_acct_cpu(i); return sum; } +unsigned int nr_iowait_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote); +} + #ifdef CONFIG_SMP /* @@ -6686,6 +6697,8 @@ static void __sched notrace __schedule(unsigned int sched_mode) if (prev->in_iowait) { rq->nr_iowait++; + if (prev->in_iowait_acct) + rq->nr_iowait_acct++; delayacct_blkio_start(); } } @@ -8988,18 +9001,32 @@ int __sched yield_to(struct task_struct *p, bool preempt) } EXPORT_SYMBOL_GPL(yield_to); +/* + * Returns a token which is comprised of the two bits of iowait wait state - + * one is whether we're making ourselves as in iowait for cpufreq reasons, + * and the other is if the task should be accounted as such. + */ int io_schedule_prepare(void) { - int old_iowait = current->in_iowait; + int old_wait_flags = 0; + + if (current->in_iowait) + old_wait_flags |= TASK_IOWAIT; + if (current->in_iowait_acct) + old_wait_flags |= TASK_IOWAIT_ACCT; current->in_iowait = 1; + current->in_iowait_acct = 1; blk_flush_plug(current->plug, true); - return old_iowait; + return old_wait_flags; } -void io_schedule_finish(int token) +void io_schedule_finish(int old_wait_flags) { - current->in_iowait = token; + if (!(old_wait_flags & TASK_IOWAIT)) + current->in_iowait = 0; + if (!(old_wait_flags & TASK_IOWAIT_ACCT)) + current->in_iowait_acct = 0; } /* @@ -10033,6 +10060,8 @@ void __init sched_init(void) #endif #endif /* CONFIG_SMP */ hrtick_rq_init(rq); + rq->nr_iowait_acct = 0; + atomic_set(&rq->nr_iowait_acct_remote, 0); rq->nr_iowait = 0; atomic_set(&rq->nr_iowait_remote, 0); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 91fa5b4d45ed..abd7a938bc99 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1054,6 +1054,8 @@ struct rq { * modified under the rq lock (nr_iowait), and if we don't have the rq * lock, then nr_iowait_remote is used. */ + unsigned int nr_iowait_acct; + atomic_t nr_iowait_acct_remote; unsigned int nr_iowait; atomic_t nr_iowait_remote; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 01fb50c1b17e..f6709d543dac 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -669,7 +669,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now) delta = ktime_sub(now, ts->idle_entrytime); write_seqcount_begin(&ts->idle_sleeptime_seq); - if (nr_iowait_cpu(smp_processor_id()) > 0) + if (nr_iowait_acct_cpu(smp_processor_id()) > 0) ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); else ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta); @@ -742,7 +742,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time) struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu); return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime, - !nr_iowait_cpu(cpu), last_update_time); + !nr_iowait_acct_cpu(cpu), last_update_time); } EXPORT_SYMBOL_GPL(get_cpu_idle_time_us); @@ -768,7 +768,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time) struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu); return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime, - nr_iowait_cpu(cpu), last_update_time); + nr_iowait_acct_cpu(cpu), last_update_time); } EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us); -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched/core: split iowait state into two states 2024-02-28 19:16 ` [PATCH 2/2] sched/core: split iowait state " Jens Axboe @ 2024-02-29 17:31 ` Thomas Gleixner 2024-02-29 17:45 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Thomas Gleixner @ 2024-02-29 17:31 UTC (permalink / raw) To: Jens Axboe, linux-kernel; +Cc: peterz, mingo, Jens Axboe On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote: > iowait is a bogus metric, but it's helpful in the sense that it allows > short waits to not enter sleep states that have a higher exit latency > than we would've picked for iowait'ing tasks. However, it's harmless in > that lots of applications and monitoring assumes that iowait is busy > time, or otherwise use it as a health metric. Particularly for async > IO it's entirely nonsensical. > > Split the iowait part into two parts - one that tracks whether we need > boosting for short waits, and one that says we need to account the > task We :) > as such. ->in_iowait_acct nests inside of ->in_iowait, both for > efficiency reasons, but also so that the relationship between the two > is clear. A waiter may set ->in_wait alone and not care about the > accounting. > +/* > + * Returns a token which is comprised of the two bits of iowait wait state - > + * one is whether we're making ourselves as in iowait for cpufreq reasons, > + * and the other is if the task should be accounted as such. > + */ > int io_schedule_prepare(void) > { > - int old_iowait = current->in_iowait; > + int old_wait_flags = 0; > + > + if (current->in_iowait) > + old_wait_flags |= TASK_IOWAIT; > + if (current->in_iowait_acct) > + old_wait_flags |= TASK_IOWAIT_ACCT; > > current->in_iowait = 1; > + current->in_iowait_acct = 1; > blk_flush_plug(current->plug, true); > - return old_iowait; > + return old_wait_flags; > } > > -void io_schedule_finish(int token) > +void io_schedule_finish(int old_wait_flags) > { > - current->in_iowait = token; > + if (!(old_wait_flags & TASK_IOWAIT)) > + current->in_iowait = 0; > + if (!(old_wait_flags & TASK_IOWAIT_ACCT)) > + current->in_iowait_acct = 0; Why? TASK_IOWAIT_ACCT requires TASK_IOWAIT, right? So if TASK_IOWAIT was not set then TASK_IOWAIT_ACCT must have been clear too, no? Thanks, tglx ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched/core: split iowait state into two states 2024-02-29 17:31 ` Thomas Gleixner @ 2024-02-29 17:45 ` Jens Axboe 0 siblings, 0 replies; 9+ messages in thread From: Jens Axboe @ 2024-02-29 17:45 UTC (permalink / raw) To: Thomas Gleixner, linux-kernel; +Cc: peterz, mingo On 2/29/24 10:31 AM, Thomas Gleixner wrote: > On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote: >> iowait is a bogus metric, but it's helpful in the sense that it allows >> short waits to not enter sleep states that have a higher exit latency >> than we would've picked for iowait'ing tasks. However, it's harmless in >> that lots of applications and monitoring assumes that iowait is busy >> time, or otherwise use it as a health metric. Particularly for async >> IO it's entirely nonsensical. >> >> Split the iowait part into two parts - one that tracks whether we need >> boosting for short waits, and one that says we need to account the >> task > > We :) I appreciate the commit message police :-) I'll rewrite it. >> +/* >> + * Returns a token which is comprised of the two bits of iowait wait state - >> + * one is whether we're making ourselves as in iowait for cpufreq reasons, >> + * and the other is if the task should be accounted as such. >> + */ >> int io_schedule_prepare(void) >> { >> - int old_iowait = current->in_iowait; >> + int old_wait_flags = 0; >> + >> + if (current->in_iowait) >> + old_wait_flags |= TASK_IOWAIT; >> + if (current->in_iowait_acct) >> + old_wait_flags |= TASK_IOWAIT_ACCT; >> >> current->in_iowait = 1; >> + current->in_iowait_acct = 1; >> blk_flush_plug(current->plug, true); >> - return old_iowait; >> + return old_wait_flags; >> } >> >> -void io_schedule_finish(int token) >> +void io_schedule_finish(int old_wait_flags) >> { >> - current->in_iowait = token; >> + if (!(old_wait_flags & TASK_IOWAIT)) >> + current->in_iowait = 0; >> + if (!(old_wait_flags & TASK_IOWAIT_ACCT)) >> + current->in_iowait_acct = 0; > > Why? TASK_IOWAIT_ACCT requires TASK_IOWAIT, right? So if TASK_IOWAIT was > not set then TASK_IOWAIT_ACCT must have been clear too, no? It does, IOWAIT_ACCT always nests inside IOWAIT. I guess it would be more explanatory as: /* * If TASK_IOWAIT isn't set, then TASK_IOWAIT_ACCT cannot have * been set either as it nests inside TASK_IOWAIT. */ if (!(old_wait_flags & TASK_IOWAIT)) current->in_iowait = 0; else if (!(old_wait_flags & TASK_IOWAIT_ACCT)) current->in_iowait_acct = 0; ? -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-02-29 17:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-27 21:06 [PATCHSET v2 0/2] Split iowait into two states Jens Axboe 2024-02-27 21:06 ` [PATCH 1/2] sched/core: switch struct rq->nr_iowait to an unsigned int Jens Axboe 2024-02-27 23:05 ` David Wei 2024-02-27 23:17 ` Jens Axboe 2024-02-27 21:06 ` [PATCH 2/2] sched/core: split iowait state into two states Jens Axboe 2024-02-28 2:21 ` [PATCHSET v2 0/2] Split iowait " Jens Axboe 2024-02-28 19:16 [PATCHSET v3 " Jens Axboe 2024-02-28 19:16 ` [PATCH 2/2] sched/core: split iowait state " Jens Axboe 2024-02-29 17:31 ` Thomas Gleixner 2024-02-29 17:45 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).