* [PATCH v5 0/7] psi: pressure stall monitors v5 @ 2019-03-08 18:43 Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 1/7] psi: introduce state_mask to represent stalled psi states Suren Baghdasaryan ` (7 more replies) 0 siblings, 8 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan This is respin of: https://lwn.net/ml/linux-kernel/20190206023446.177362-1-surenb%40google.com/ Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices. Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high-enough frequency right now. This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached. As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring. With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish. In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly. The new interface is straight-forward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.: /* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000" fd = open("/proc/pressure/memory") write(fd, trigger, sizeof(trigger)) while (poll() >= 0) { ... }; close(fd); When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal. The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded. Patches 1-6 prepare the psi code for polling support. Patch 7 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files. The patches were developed in collaboration with Johannes Weiner. The patches are based on 5.0-rc8 (Merge tag 'drm-next-2019-03-06'). Suren Baghdasaryan (7): psi: introduce state_mask to represent stalled psi states psi: make psi_enable static psi: rename psi fields in preparation for psi trigger addition psi: split update_stats into parts psi: track changed states refactor header includes to allow kthread.h inclusion in psi_types.h psi: introduce psi monitor Documentation/accounting/psi.txt | 107 ++++++ include/linux/kthread.h | 3 +- include/linux/psi.h | 8 + include/linux/psi_types.h | 105 +++++- include/linux/sched.h | 1 - kernel/cgroup/cgroup.c | 71 +++- kernel/kthread.c | 1 + kernel/sched/psi.c | 613 ++++++++++++++++++++++++++++--- 8 files changed, 833 insertions(+), 76 deletions(-) Changes in v5: - Fixed sparse: error: incompatible types in comparison expression, as per Andrew - Changed psi_enable to static, as per Andrew - Refactored headers to be able to include kthread.h into psi_types.h without creating a circular inclusion, as per Johannes - Split psi monitor from aggregator, used RT worker for psi monitoring to prevent it being starved by other RT threads and memory pressure events being delayed or lost, as per Minchan and Android Performance Team - Fixed blockable memory allocation under rcu_read_lock inside psi_trigger_poll by using refcounting, as per Eva Huang and Minchan - Misc cleanup and improvements, as per Johannes Notes: 0001-psi-introduce-state_mask-to-represent-stalled-psi-st.patch is unchanged from the previous version and provided for completeness. -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v5 1/7] psi: introduce state_mask to represent stalled psi states 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan @ 2019-03-08 18:43 ` Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 2/7] psi: make psi_enable static Suren Baghdasaryan ` (6 subsequent siblings) 7 siblings, 0 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan, Stephen Rothwell The psi monitoring patches will need to determine the same states as record_times(). To avoid calculating them twice, maintain a state mask that can be consulted cheaply. Do this in a separate patch to keep the churn in the main feature patch at a minimum. This adds 4-byte state_mask member into psi_group_cpu struct which results in its first cacheline-aligned part becoming 52 bytes long. Add explicit values to enumeration element counters that affect psi_group_cpu struct size. Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> --- include/linux/psi_types.h | 9 ++++++--- kernel/sched/psi.c | 29 +++++++++++++++++++---------- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 2cf422db5d18..762c6bb16f3c 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -11,7 +11,7 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - NR_PSI_TASK_COUNTS, + NR_PSI_TASK_COUNTS = 3, }; /* Task state bitmasks */ @@ -24,7 +24,7 @@ enum psi_res { PSI_IO, PSI_MEM, PSI_CPU, - NR_PSI_RESOURCES, + NR_PSI_RESOURCES = 3, }; /* @@ -41,7 +41,7 @@ enum psi_states { PSI_CPU_SOME, /* Only per-CPU, to weigh the CPU in the global average: */ PSI_NONIDLE, - NR_PSI_STATES, + NR_PSI_STATES = 6, }; struct psi_group_cpu { @@ -53,6 +53,9 @@ struct psi_group_cpu { /* States of the tasks belonging to this group */ unsigned int tasks[NR_PSI_TASK_COUNTS]; + /* Aggregate pressure state derived from the tasks */ + u32 state_mask; + /* Period time sampling buckets for each state of interest (ns) */ u32 times[NR_PSI_STATES]; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 0e97ca9306ef..22c1505ad290 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -213,17 +213,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state) static void get_recent_times(struct psi_group *group, int cpu, u32 *times) { struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); - unsigned int tasks[NR_PSI_TASK_COUNTS]; u64 now, state_start; + enum psi_states s; unsigned int seq; - int s; + u32 state_mask; /* Snapshot a coherent view of the CPU state */ do { seq = read_seqcount_begin(&groupc->seq); now = cpu_clock(cpu); memcpy(times, groupc->times, sizeof(groupc->times)); - memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); + state_mask = groupc->state_mask; state_start = groupc->state_start; } while (read_seqcount_retry(&groupc->seq, seq)); @@ -239,7 +239,7 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times) * (u32) and our reported pressure close to what's * actually happening. */ - if (test_state(tasks, s)) + if (state_mask & (1 << s)) times[s] += now - state_start; delta = times[s] - groupc->times_prev[s]; @@ -407,15 +407,15 @@ static void record_times(struct psi_group_cpu *groupc, int cpu, delta = now - groupc->state_start; groupc->state_start = now; - if (test_state(groupc->tasks, PSI_IO_SOME)) { + if (groupc->state_mask & (1 << PSI_IO_SOME)) { groupc->times[PSI_IO_SOME] += delta; - if (test_state(groupc->tasks, PSI_IO_FULL)) + if (groupc->state_mask & (1 << PSI_IO_FULL)) groupc->times[PSI_IO_FULL] += delta; } - if (test_state(groupc->tasks, PSI_MEM_SOME)) { + if (groupc->state_mask & (1 << PSI_MEM_SOME)) { groupc->times[PSI_MEM_SOME] += delta; - if (test_state(groupc->tasks, PSI_MEM_FULL)) + if (groupc->state_mask & (1 << PSI_MEM_FULL)) groupc->times[PSI_MEM_FULL] += delta; else if (memstall_tick) { u32 sample; @@ -436,10 +436,10 @@ static void record_times(struct psi_group_cpu *groupc, int cpu, } } - if (test_state(groupc->tasks, PSI_CPU_SOME)) + if (groupc->state_mask & (1 << PSI_CPU_SOME)) groupc->times[PSI_CPU_SOME] += delta; - if (test_state(groupc->tasks, PSI_NONIDLE)) + if (groupc->state_mask & (1 << PSI_NONIDLE)) groupc->times[PSI_NONIDLE] += delta; } @@ -448,6 +448,8 @@ static void psi_group_change(struct psi_group *group, int cpu, { struct psi_group_cpu *groupc; unsigned int t, m; + enum psi_states s; + u32 state_mask = 0; groupc = per_cpu_ptr(group->pcpu, cpu); @@ -480,6 +482,13 @@ static void psi_group_change(struct psi_group *group, int cpu, if (set & (1 << t)) groupc->tasks[t]++; + /* Calculate state mask representing active states */ + for (s = 0; s < NR_PSI_STATES; s++) { + if (test_state(groupc->tasks, s)) + state_mask |= (1 << s); + } + groupc->state_mask = state_mask; + write_seqcount_end(&groupc->seq); } -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v5 2/7] psi: make psi_enable static 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 1/7] psi: introduce state_mask to represent stalled psi states Suren Baghdasaryan @ 2019-03-08 18:43 ` Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 3/7] psi: rename psi fields in preparation for psi trigger addition Suren Baghdasaryan ` (5 subsequent siblings) 7 siblings, 0 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan psi_enable is not used outside of psi.c, make it static. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- kernel/sched/psi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 22c1505ad290..281702de9772 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -140,9 +140,9 @@ static int psi_bug __read_mostly; DEFINE_STATIC_KEY_FALSE(psi_disabled); #ifdef CONFIG_PSI_DEFAULT_DISABLED -bool psi_enable; +static bool psi_enable; #else -bool psi_enable = true; +static bool psi_enable = true; #endif static int __init setup_psi(char *str) { -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v5 3/7] psi: rename psi fields in preparation for psi trigger addition 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 1/7] psi: introduce state_mask to represent stalled psi states Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 2/7] psi: make psi_enable static Suren Baghdasaryan @ 2019-03-08 18:43 ` Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 4/7] psi: split update_stats into parts Suren Baghdasaryan ` (4 subsequent siblings) 7 siblings, 0 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan Renaming psi_group structure member fields used for calculating psi totals and averages for clear distinction between them and trigger-related fields that will be added next. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- include/linux/psi_types.h | 14 ++++++------- kernel/sched/psi.c | 41 ++++++++++++++++++++------------------- 2 files changed, 28 insertions(+), 27 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 762c6bb16f3c..4d1c1f67be18 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -69,17 +69,17 @@ struct psi_group_cpu { }; struct psi_group { - /* Protects data updated during an aggregation */ - struct mutex stat_lock; + /* Protects data used by the aggregator */ + struct mutex avgs_lock; /* Per-cpu task state & time tracking */ struct psi_group_cpu __percpu *pcpu; - /* Periodic aggregation state */ - u64 total_prev[NR_PSI_STATES - 1]; - u64 last_update; - u64 next_update; - struct delayed_work clock_work; + /* Running pressure averages */ + u64 avg_total[NR_PSI_STATES - 1]; + u64 avg_last_update; + u64 avg_next_update; + struct delayed_work avgs_work; /* Total stall times and sampled pressure averages */ u64 total[NR_PSI_STATES - 1]; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 281702de9772..4fb4d9913bc8 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -165,7 +165,7 @@ static struct psi_group psi_system = { .pcpu = &system_group_pcpu, }; -static void psi_update_work(struct work_struct *work); +static void psi_avgs_work(struct work_struct *work); static void group_init(struct psi_group *group) { @@ -173,9 +173,9 @@ static void group_init(struct psi_group *group) for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); - group->next_update = sched_clock() + psi_period; - INIT_DELAYED_WORK(&group->clock_work, psi_update_work); - mutex_init(&group->stat_lock); + group->avg_next_update = sched_clock() + psi_period; + INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); + mutex_init(&group->avgs_lock); } void __init psi_init(void) @@ -278,7 +278,7 @@ static bool update_stats(struct psi_group *group) int cpu; int s; - mutex_lock(&group->stat_lock); + mutex_lock(&group->avgs_lock); /* * Collect the per-cpu time buckets and average them into a @@ -319,7 +319,7 @@ static bool update_stats(struct psi_group *group) /* avgX= */ now = sched_clock(); - expires = group->next_update; + expires = group->avg_next_update; if (now < expires) goto out; if (now - expires >= psi_period) @@ -332,14 +332,14 @@ static bool update_stats(struct psi_group *group) * But the deltas we sample out of the per-cpu buckets above * are based on the actual time elapsing between clock ticks. */ - group->next_update = expires + ((1 + missed_periods) * psi_period); - period = now - (group->last_update + (missed_periods * psi_period)); - group->last_update = now; + group->avg_next_update = expires + ((1 + missed_periods) * psi_period); + period = now - (group->avg_last_update + (missed_periods * psi_period)); + group->avg_last_update = now; for (s = 0; s < NR_PSI_STATES - 1; s++) { u32 sample; - sample = group->total[s] - group->total_prev[s]; + sample = group->total[s] - group->avg_total[s]; /* * Due to the lockless sampling of the time buckets, * recorded time deltas can slip into the next period, @@ -359,22 +359,22 @@ static bool update_stats(struct psi_group *group) */ if (sample > period) sample = period; - group->total_prev[s] += sample; + group->avg_total[s] += sample; calc_avgs(group->avg[s], missed_periods, sample, period); } out: - mutex_unlock(&group->stat_lock); + mutex_unlock(&group->avgs_lock); return nonidle_total; } -static void psi_update_work(struct work_struct *work) +static void psi_avgs_work(struct work_struct *work) { struct delayed_work *dwork; struct psi_group *group; bool nonidle; dwork = to_delayed_work(work); - group = container_of(dwork, struct psi_group, clock_work); + group = container_of(dwork, struct psi_group, avgs_work); /* * If there is task activity, periodically fold the per-cpu @@ -391,8 +391,9 @@ static void psi_update_work(struct work_struct *work) u64 now; now = sched_clock(); - if (group->next_update > now) - delay = nsecs_to_jiffies(group->next_update - now) + 1; + if (group->avg_next_update > now) + delay = nsecs_to_jiffies( + group->avg_next_update - now) + 1; schedule_delayed_work(dwork, delay); } } @@ -546,13 +547,13 @@ void psi_task_change(struct task_struct *task, int clear, int set) */ if (unlikely((clear & TSK_RUNNING) && (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) == psi_update_work)) + wq_worker_last_func(task) == psi_avgs_work)) wake_clock = false; while ((group = iterate_groups(task, &iter))) { psi_group_change(group, cpu, clear, set); - if (wake_clock && !delayed_work_pending(&group->clock_work)) - schedule_delayed_work(&group->clock_work, PSI_FREQ); + if (wake_clock && !delayed_work_pending(&group->avgs_work)) + schedule_delayed_work(&group->avgs_work, PSI_FREQ); } } @@ -649,7 +650,7 @@ void psi_cgroup_free(struct cgroup *cgroup) if (static_branch_likely(&psi_disabled)) return; - cancel_delayed_work_sync(&cgroup->psi.clock_work); + cancel_delayed_work_sync(&cgroup->psi.avgs_work); free_percpu(cgroup->psi.pcpu); } -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v5 4/7] psi: split update_stats into parts 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan ` (2 preceding siblings ...) 2019-03-08 18:43 ` [PATCH v5 3/7] psi: rename psi fields in preparation for psi trigger addition Suren Baghdasaryan @ 2019-03-08 18:43 ` Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 5/7] psi: track changed states Suren Baghdasaryan ` (3 subsequent siblings) 7 siblings, 0 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan Split update_stats into collect_percpu_times and update_averages for collect_percpu_times to be reused later inside psi monitor. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- kernel/sched/psi.c | 55 +++++++++++++++++++++++++++------------------- 1 file changed, 32 insertions(+), 23 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 4fb4d9913bc8..337a445aefa3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -269,17 +269,13 @@ static void calc_avgs(unsigned long avg[3], int missed_periods, avg[2] = calc_load(avg[2], EXP_300s, pct); } -static bool update_stats(struct psi_group *group) +static bool collect_percpu_times(struct psi_group *group) { u64 deltas[NR_PSI_STATES - 1] = { 0, }; - unsigned long missed_periods = 0; unsigned long nonidle_total = 0; - u64 now, expires, period; int cpu; int s; - mutex_lock(&group->avgs_lock); - /* * Collect the per-cpu time buckets and average them into a * single time sample that is normalized to wallclock time. @@ -317,11 +313,18 @@ static bool update_stats(struct psi_group *group) for (s = 0; s < NR_PSI_STATES - 1; s++) group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL)); + return nonidle_total; +} + +static u64 update_averages(struct psi_group *group, u64 now) +{ + unsigned long missed_periods = 0; + u64 expires, period; + u64 avg_next_update; + int s; + /* avgX= */ - now = sched_clock(); expires = group->avg_next_update; - if (now < expires) - goto out; if (now - expires >= psi_period) missed_periods = div_u64(now - expires, psi_period); @@ -332,7 +335,7 @@ static bool update_stats(struct psi_group *group) * But the deltas we sample out of the per-cpu buckets above * are based on the actual time elapsing between clock ticks. */ - group->avg_next_update = expires + ((1 + missed_periods) * psi_period); + avg_next_update = expires + ((1 + missed_periods) * psi_period); period = now - (group->avg_last_update + (missed_periods * psi_period)); group->avg_last_update = now; @@ -362,9 +365,8 @@ static bool update_stats(struct psi_group *group) group->avg_total[s] += sample; calc_avgs(group->avg[s], missed_periods, sample, period); } -out: - mutex_unlock(&group->avgs_lock); - return nonidle_total; + + return avg_next_update; } static void psi_avgs_work(struct work_struct *work) @@ -372,10 +374,16 @@ static void psi_avgs_work(struct work_struct *work) struct delayed_work *dwork; struct psi_group *group; bool nonidle; + u64 now; dwork = to_delayed_work(work); group = container_of(dwork, struct psi_group, avgs_work); + mutex_lock(&group->avgs_lock); + + now = sched_clock(); + + nonidle = collect_percpu_times(group); /* * If there is task activity, periodically fold the per-cpu * times and feed samples into the running averages. If things @@ -384,18 +392,15 @@ static void psi_avgs_work(struct work_struct *work) * go - see calc_avgs() and missed_periods. */ - nonidle = update_stats(group); - if (nonidle) { - unsigned long delay = 0; - u64 now; - - now = sched_clock(); - if (group->avg_next_update > now) - delay = nsecs_to_jiffies( - group->avg_next_update - now) + 1; - schedule_delayed_work(dwork, delay); + if (now >= group->avg_next_update) + group->avg_next_update = update_averages(group, now); + + schedule_delayed_work(dwork, nsecs_to_jiffies( + group->avg_next_update - now) + 1); } + + mutex_unlock(&group->avgs_lock); } static void record_times(struct psi_group_cpu *groupc, int cpu, @@ -711,7 +716,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP; - update_stats(group); + /* Update averages before reporting them */ + mutex_lock(&group->avgs_lock); + collect_percpu_times(group); + update_averages(group, sched_clock()); + mutex_unlock(&group->avgs_lock); for (full = 0; full < 2 - (res == PSI_CPU); full++) { unsigned long avg[3]; -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v5 5/7] psi: track changed states 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan ` (3 preceding siblings ...) 2019-03-08 18:43 ` [PATCH v5 4/7] psi: split update_stats into parts Suren Baghdasaryan @ 2019-03-08 18:43 ` Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h Suren Baghdasaryan ` (2 subsequent siblings) 7 siblings, 0 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan Introduce changed_states parameter into collect_percpu_times to track the states changed since the last update. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- kernel/sched/psi.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 337a445aefa3..59e4e1f8bc02 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -210,7 +210,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state) } } -static void get_recent_times(struct psi_group *group, int cpu, u32 *times) +static void get_recent_times(struct psi_group *group, int cpu, u32 *times, + u32 *pchanged_states) { struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); u64 now, state_start; @@ -218,6 +219,8 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times) unsigned int seq; u32 state_mask; + *pchanged_states = 0; + /* Snapshot a coherent view of the CPU state */ do { seq = read_seqcount_begin(&groupc->seq); @@ -246,6 +249,8 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times) groupc->times_prev[s] = times[s]; times[s] = delta; + if (delta) + *pchanged_states |= (1 << s); } } @@ -269,10 +274,11 @@ static void calc_avgs(unsigned long avg[3], int missed_periods, avg[2] = calc_load(avg[2], EXP_300s, pct); } -static bool collect_percpu_times(struct psi_group *group) +static void collect_percpu_times(struct psi_group *group, u32 *pchanged_states) { u64 deltas[NR_PSI_STATES - 1] = { 0, }; unsigned long nonidle_total = 0; + u32 changed_states = 0; int cpu; int s; @@ -287,8 +293,11 @@ static bool collect_percpu_times(struct psi_group *group) for_each_possible_cpu(cpu) { u32 times[NR_PSI_STATES]; u32 nonidle; + u32 cpu_changed_states; - get_recent_times(group, cpu, times); + get_recent_times(group, cpu, times, + &cpu_changed_states); + changed_states |= cpu_changed_states; nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]); nonidle_total += nonidle; @@ -313,7 +322,8 @@ static bool collect_percpu_times(struct psi_group *group) for (s = 0; s < NR_PSI_STATES - 1; s++) group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL)); - return nonidle_total; + if (pchanged_states) + *pchanged_states = changed_states; } static u64 update_averages(struct psi_group *group, u64 now) @@ -373,6 +383,7 @@ static void psi_avgs_work(struct work_struct *work) { struct delayed_work *dwork; struct psi_group *group; + u32 changed_states; bool nonidle; u64 now; @@ -383,7 +394,8 @@ static void psi_avgs_work(struct work_struct *work) now = sched_clock(); - nonidle = collect_percpu_times(group); + collect_percpu_times(group, &changed_states); + nonidle = changed_states & (1 << PSI_NONIDLE); /* * If there is task activity, periodically fold the per-cpu * times and feed samples into the running averages. If things @@ -718,7 +730,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) /* Update averages before reporting them */ mutex_lock(&group->avgs_lock); - collect_percpu_times(group); + collect_percpu_times(group, NULL); update_averages(group, sched_clock()); mutex_unlock(&group->avgs_lock); -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan ` (4 preceding siblings ...) 2019-03-08 18:43 ` [PATCH v5 5/7] psi: track changed states Suren Baghdasaryan @ 2019-03-08 18:43 ` Suren Baghdasaryan 2019-03-09 20:49 ` kbuild test robot 2019-03-09 23:12 ` kbuild test robot 2019-03-08 18:43 ` [PATCH v5 7/7] psi: introduce psi monitor Suren Baghdasaryan 2019-03-19 22:51 ` [PATCH v5 0/7] psi: pressure stall monitors v5 Minchan Kim 7 siblings, 2 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan kthread.h can't be included in psi_types.h because it creates a circular inclusion with kthread.h eventually including psi_types.h and complaining on kthread structures not being defined because they are defined further in the kthread.h. Resolve this by removing psi_types.h inclusion from the headers included from kthread.h. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- include/linux/kthread.h | 3 ++- include/linux/sched.h | 1 - kernel/kthread.c | 1 + 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 2c89e60bc752..0f9da966934e 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -4,7 +4,6 @@ /* Simple interface for creating and stopping kernel threads without mess. */ #include <linux/err.h> #include <linux/sched.h> -#include <linux/cgroup.h> __printf(4, 5) struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), @@ -198,6 +197,8 @@ bool kthread_cancel_delayed_work_sync(struct kthread_delayed_work *work); void kthread_destroy_worker(struct kthread_worker *worker); +struct cgroup_subsys_state; + #ifdef CONFIG_BLK_CGROUP void kthread_associate_blkcg(struct cgroup_subsys_state *css); struct cgroup_subsys_state *kthread_blkcg(void); diff --git a/include/linux/sched.h b/include/linux/sched.h index 1549584a1538..20b9f03399a7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -26,7 +26,6 @@ #include <linux/latencytop.h> #include <linux/sched/prio.h> #include <linux/signal_types.h> -#include <linux/psi_types.h> #include <linux/mm_types_task.h> #include <linux/task_io_accounting.h> #include <linux/rseq.h> diff --git a/kernel/kthread.c b/kernel/kthread.c index 5942eeafb9ac..be4e8795561a 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -11,6 +11,7 @@ #include <linux/kthread.h> #include <linux/completion.h> #include <linux/err.h> +#include <linux/cgroup.h> #include <linux/cpuset.h> #include <linux/unistd.h> #include <linux/file.h> -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h 2019-03-08 18:43 ` [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h Suren Baghdasaryan @ 2019-03-09 20:49 ` kbuild test robot 2019-03-09 23:12 ` kbuild test robot 1 sibling, 0 replies; 12+ messages in thread From: kbuild test robot @ 2019-03-09 20:49 UTC (permalink / raw) To: Suren Baghdasaryan Cc: kbuild-all, gregkh, tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan [-- Attachment #1: Type: text/plain, Size: 14046 bytes --] Hi Suren, Thank you for the patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v5.0] [cannot apply to next-20190306] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Suren-Baghdasaryan/psi-pressure-stall-monitors-v5/20190310-024018 config: ia64-allmodconfig (attached as .config) compiler: ia64-linux-gcc (GCC) 8.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=8.2.0 make.cross ARCH=ia64 All errors (new ones prefixed by >>): drivers/spi/spi-rockchip.c: In function 'rockchip_spi_probe': >> drivers/spi/spi-rockchip.c:649:8: error: implicit declaration of function 'devm_request_threaded_irq'; did you mean 'devm_request_region'? [-Werror=implicit-function-declaration] ret = devm_request_threaded_irq(&pdev->dev, ret, rockchip_spi_isr, NULL, ^~~~~~~~~~~~~~~~~~~~~~~~~ devm_request_region >> drivers/spi/spi-rockchip.c:650:4: error: 'IRQF_ONESHOT' undeclared (first use in this function); did you mean 'SA_ONESHOT'? IRQF_ONESHOT, dev_name(&pdev->dev), master); ^~~~~~~~~~~~ SA_ONESHOT drivers/spi/spi-rockchip.c:650:4: note: each undeclared identifier is reported only once for each function it appears in cc1: some warnings being treated as errors vim +649 drivers/spi/spi-rockchip.c 64e36824b addy ke 2014-07-01 592 64e36824b addy ke 2014-07-01 593 static int rockchip_spi_probe(struct platform_device *pdev) 64e36824b addy ke 2014-07-01 594 { 43de979dd Jeffy Chen 2017-08-07 595 int ret; 64e36824b addy ke 2014-07-01 596 struct rockchip_spi *rs; 64e36824b addy ke 2014-07-01 597 struct spi_master *master; 64e36824b addy ke 2014-07-01 598 struct resource *mem; 76b17e6e4 Julius Werner 2015-03-26 599 u32 rsd_nsecs; 64e36824b addy ke 2014-07-01 600 64e36824b addy ke 2014-07-01 601 master = spi_alloc_master(&pdev->dev, sizeof(struct rockchip_spi)); 5dcc44ed9 Addy Ke 2014-07-11 602 if (!master) 64e36824b addy ke 2014-07-01 603 return -ENOMEM; 5dcc44ed9 Addy Ke 2014-07-11 604 64e36824b addy ke 2014-07-01 605 platform_set_drvdata(pdev, master); 64e36824b addy ke 2014-07-01 606 64e36824b addy ke 2014-07-01 607 rs = spi_master_get_devdata(master); 64e36824b addy ke 2014-07-01 608 64e36824b addy ke 2014-07-01 609 /* Get basic io resource and map it */ 64e36824b addy ke 2014-07-01 610 mem = platform_get_resource(pdev, IORESOURCE_MEM, 0); 64e36824b addy ke 2014-07-01 611 rs->regs = devm_ioremap_resource(&pdev->dev, mem); 64e36824b addy ke 2014-07-01 612 if (IS_ERR(rs->regs)) { 64e36824b addy ke 2014-07-01 613 ret = PTR_ERR(rs->regs); c351587e2 Jeffy Chen 2017-06-13 614 goto err_put_master; 64e36824b addy ke 2014-07-01 615 } 64e36824b addy ke 2014-07-01 616 64e36824b addy ke 2014-07-01 617 rs->apb_pclk = devm_clk_get(&pdev->dev, "apb_pclk"); 64e36824b addy ke 2014-07-01 618 if (IS_ERR(rs->apb_pclk)) { 64e36824b addy ke 2014-07-01 619 dev_err(&pdev->dev, "Failed to get apb_pclk\n"); 64e36824b addy ke 2014-07-01 620 ret = PTR_ERR(rs->apb_pclk); c351587e2 Jeffy Chen 2017-06-13 621 goto err_put_master; 64e36824b addy ke 2014-07-01 622 } 64e36824b addy ke 2014-07-01 623 64e36824b addy ke 2014-07-01 624 rs->spiclk = devm_clk_get(&pdev->dev, "spiclk"); 64e36824b addy ke 2014-07-01 625 if (IS_ERR(rs->spiclk)) { 64e36824b addy ke 2014-07-01 626 dev_err(&pdev->dev, "Failed to get spi_pclk\n"); 64e36824b addy ke 2014-07-01 627 ret = PTR_ERR(rs->spiclk); c351587e2 Jeffy Chen 2017-06-13 628 goto err_put_master; 64e36824b addy ke 2014-07-01 629 } 64e36824b addy ke 2014-07-01 630 64e36824b addy ke 2014-07-01 631 ret = clk_prepare_enable(rs->apb_pclk); 43de979dd Jeffy Chen 2017-08-07 632 if (ret < 0) { 64e36824b addy ke 2014-07-01 633 dev_err(&pdev->dev, "Failed to enable apb_pclk\n"); c351587e2 Jeffy Chen 2017-06-13 634 goto err_put_master; 64e36824b addy ke 2014-07-01 635 } 64e36824b addy ke 2014-07-01 636 64e36824b addy ke 2014-07-01 637 ret = clk_prepare_enable(rs->spiclk); 43de979dd Jeffy Chen 2017-08-07 638 if (ret < 0) { 64e36824b addy ke 2014-07-01 639 dev_err(&pdev->dev, "Failed to enable spi_clk\n"); c351587e2 Jeffy Chen 2017-06-13 640 goto err_disable_apbclk; 64e36824b addy ke 2014-07-01 641 } 64e36824b addy ke 2014-07-01 642 30688e4e6 Emil Renner Berthing 2018-10-31 643 spi_enable_chip(rs, false); 64e36824b addy ke 2014-07-01 644 01b59ce5d Emil Renner Berthing 2018-10-31 645 ret = platform_get_irq(pdev, 0); 01b59ce5d Emil Renner Berthing 2018-10-31 646 if (ret < 0) 01b59ce5d Emil Renner Berthing 2018-10-31 647 goto err_disable_spiclk; 01b59ce5d Emil Renner Berthing 2018-10-31 648 01b59ce5d Emil Renner Berthing 2018-10-31 @649 ret = devm_request_threaded_irq(&pdev->dev, ret, rockchip_spi_isr, NULL, 01b59ce5d Emil Renner Berthing 2018-10-31 @650 IRQF_ONESHOT, dev_name(&pdev->dev), master); 01b59ce5d Emil Renner Berthing 2018-10-31 651 if (ret) 01b59ce5d Emil Renner Berthing 2018-10-31 652 goto err_disable_spiclk; 01b59ce5d Emil Renner Berthing 2018-10-31 653 64e36824b addy ke 2014-07-01 654 rs->dev = &pdev->dev; 420b82f84 Emil Renner Berthing 2018-10-31 655 rs->freq = clk_get_rate(rs->spiclk); 64e36824b addy ke 2014-07-01 656 76b17e6e4 Julius Werner 2015-03-26 657 if (!of_property_read_u32(pdev->dev.of_node, "rx-sample-delay-ns", 74b7efa82 Emil Renner Berthing 2018-10-31 658 &rsd_nsecs)) { 74b7efa82 Emil Renner Berthing 2018-10-31 659 /* rx sample delay is expressed in parent clock cycles (max 3) */ 74b7efa82 Emil Renner Berthing 2018-10-31 660 u32 rsd = DIV_ROUND_CLOSEST(rsd_nsecs * (rs->freq >> 8), 74b7efa82 Emil Renner Berthing 2018-10-31 661 1000000000 >> 8); 74b7efa82 Emil Renner Berthing 2018-10-31 662 if (!rsd) { 74b7efa82 Emil Renner Berthing 2018-10-31 663 dev_warn(rs->dev, "%u Hz are too slow to express %u ns delay\n", 74b7efa82 Emil Renner Berthing 2018-10-31 664 rs->freq, rsd_nsecs); 74b7efa82 Emil Renner Berthing 2018-10-31 665 } else if (rsd > CR0_RSD_MAX) { 74b7efa82 Emil Renner Berthing 2018-10-31 666 rsd = CR0_RSD_MAX; 74b7efa82 Emil Renner Berthing 2018-10-31 667 dev_warn(rs->dev, "%u Hz are too fast to express %u ns delay, clamping at %u ns\n", 74b7efa82 Emil Renner Berthing 2018-10-31 668 rs->freq, rsd_nsecs, 74b7efa82 Emil Renner Berthing 2018-10-31 669 CR0_RSD_MAX * 1000000000U / rs->freq); 74b7efa82 Emil Renner Berthing 2018-10-31 670 } 74b7efa82 Emil Renner Berthing 2018-10-31 671 rs->rsd = rsd; 74b7efa82 Emil Renner Berthing 2018-10-31 672 } 76b17e6e4 Julius Werner 2015-03-26 673 64e36824b addy ke 2014-07-01 674 rs->fifo_len = get_fifo_len(rs); 64e36824b addy ke 2014-07-01 675 if (!rs->fifo_len) { 64e36824b addy ke 2014-07-01 676 dev_err(&pdev->dev, "Failed to get fifo length\n"); db7e8d90c Wei Yongjun 2014-07-20 677 ret = -EINVAL; c351587e2 Jeffy Chen 2017-06-13 678 goto err_disable_spiclk; 64e36824b addy ke 2014-07-01 679 } 64e36824b addy ke 2014-07-01 680 64e36824b addy ke 2014-07-01 681 pm_runtime_set_active(&pdev->dev); 64e36824b addy ke 2014-07-01 682 pm_runtime_enable(&pdev->dev); 64e36824b addy ke 2014-07-01 683 64e36824b addy ke 2014-07-01 684 master->auto_runtime_pm = true; 64e36824b addy ke 2014-07-01 685 master->bus_num = pdev->id; 04290192f Emil Renner Berthing 2018-10-31 686 master->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LOOP | SPI_LSB_FIRST; aa099382a Jeffy Chen 2017-06-28 687 master->num_chipselect = ROCKCHIP_SPI_MAX_CS_NUM; 64e36824b addy ke 2014-07-01 688 master->dev.of_node = pdev->dev.of_node; 65498c6ae Emil Renner Berthing 2018-10-31 689 master->bits_per_word_mask = SPI_BPW_MASK(16) | SPI_BPW_MASK(8) | SPI_BPW_MASK(4); 420b82f84 Emil Renner Berthing 2018-10-31 690 master->min_speed_hz = rs->freq / BAUDR_SCKDV_MAX; 420b82f84 Emil Renner Berthing 2018-10-31 691 master->max_speed_hz = min(rs->freq / BAUDR_SCKDV_MIN, MAX_SCLK_OUT); 64e36824b addy ke 2014-07-01 692 64e36824b addy ke 2014-07-01 693 master->set_cs = rockchip_spi_set_cs; 64e36824b addy ke 2014-07-01 694 master->transfer_one = rockchip_spi_transfer_one; 5185a81c0 Brian Norris 2016-07-14 695 master->max_transfer_size = rockchip_spi_max_transfer_size; 2291793cc Andy Shevchenko 2015-02-27 696 master->handle_err = rockchip_spi_handle_err; c863795c4 Jeffy Chen 2017-06-28 697 master->flags = SPI_MASTER_GPIO_SS; 64e36824b addy ke 2014-07-01 698 eee06a9ee Emil Renner Berthing 2018-10-31 699 master->dma_tx = dma_request_chan(rs->dev, "tx"); eee06a9ee Emil Renner Berthing 2018-10-31 700 if (IS_ERR(master->dma_tx)) { 61cadcf46 Shawn Lin 2016-03-09 701 /* Check tx to see if we need defer probing driver */ eee06a9ee Emil Renner Berthing 2018-10-31 702 if (PTR_ERR(master->dma_tx) == -EPROBE_DEFER) { 61cadcf46 Shawn Lin 2016-03-09 703 ret = -EPROBE_DEFER; c351587e2 Jeffy Chen 2017-06-13 704 goto err_disable_pm_runtime; 61cadcf46 Shawn Lin 2016-03-09 705 } 64e36824b addy ke 2014-07-01 706 dev_warn(rs->dev, "Failed to request TX DMA channel\n"); eee06a9ee Emil Renner Berthing 2018-10-31 707 master->dma_tx = NULL; 61cadcf46 Shawn Lin 2016-03-09 708 } 64e36824b addy ke 2014-07-01 709 eee06a9ee Emil Renner Berthing 2018-10-31 710 master->dma_rx = dma_request_chan(rs->dev, "rx"); eee06a9ee Emil Renner Berthing 2018-10-31 711 if (IS_ERR(master->dma_rx)) { eee06a9ee Emil Renner Berthing 2018-10-31 712 if (PTR_ERR(master->dma_rx) == -EPROBE_DEFER) { e4c0e06f9 Shawn Lin 2016-03-31 713 ret = -EPROBE_DEFER; 5de7ed0c9 Dan Carpenter 2016-05-04 714 goto err_free_dma_tx; 64e36824b addy ke 2014-07-01 715 } 64e36824b addy ke 2014-07-01 716 dev_warn(rs->dev, "Failed to request RX DMA channel\n"); eee06a9ee Emil Renner Berthing 2018-10-31 717 master->dma_rx = NULL; 64e36824b addy ke 2014-07-01 718 } 64e36824b addy ke 2014-07-01 719 eee06a9ee Emil Renner Berthing 2018-10-31 720 if (master->dma_tx && master->dma_rx) { eee06a9ee Emil Renner Berthing 2018-10-31 721 rs->dma_addr_tx = mem->start + ROCKCHIP_SPI_TXDR; eee06a9ee Emil Renner Berthing 2018-10-31 722 rs->dma_addr_rx = mem->start + ROCKCHIP_SPI_RXDR; 64e36824b addy ke 2014-07-01 723 master->can_dma = rockchip_spi_can_dma; 64e36824b addy ke 2014-07-01 724 } 64e36824b addy ke 2014-07-01 725 64e36824b addy ke 2014-07-01 726 ret = devm_spi_register_master(&pdev->dev, master); 43de979dd Jeffy Chen 2017-08-07 727 if (ret < 0) { 64e36824b addy ke 2014-07-01 728 dev_err(&pdev->dev, "Failed to register master\n"); c351587e2 Jeffy Chen 2017-06-13 729 goto err_free_dma_rx; 64e36824b addy ke 2014-07-01 730 } 64e36824b addy ke 2014-07-01 731 64e36824b addy ke 2014-07-01 732 return 0; 64e36824b addy ke 2014-07-01 733 c351587e2 Jeffy Chen 2017-06-13 734 err_free_dma_rx: eee06a9ee Emil Renner Berthing 2018-10-31 735 if (master->dma_rx) eee06a9ee Emil Renner Berthing 2018-10-31 736 dma_release_channel(master->dma_rx); 5de7ed0c9 Dan Carpenter 2016-05-04 737 err_free_dma_tx: eee06a9ee Emil Renner Berthing 2018-10-31 738 if (master->dma_tx) eee06a9ee Emil Renner Berthing 2018-10-31 739 dma_release_channel(master->dma_tx); c351587e2 Jeffy Chen 2017-06-13 740 err_disable_pm_runtime: c351587e2 Jeffy Chen 2017-06-13 741 pm_runtime_disable(&pdev->dev); c351587e2 Jeffy Chen 2017-06-13 742 err_disable_spiclk: 64e36824b addy ke 2014-07-01 743 clk_disable_unprepare(rs->spiclk); c351587e2 Jeffy Chen 2017-06-13 744 err_disable_apbclk: 64e36824b addy ke 2014-07-01 745 clk_disable_unprepare(rs->apb_pclk); c351587e2 Jeffy Chen 2017-06-13 746 err_put_master: 64e36824b addy ke 2014-07-01 747 spi_master_put(master); 64e36824b addy ke 2014-07-01 748 64e36824b addy ke 2014-07-01 749 return ret; 64e36824b addy ke 2014-07-01 750 } 64e36824b addy ke 2014-07-01 751 :::::: The code at line 649 was first introduced by commit :::::: 01b59ce5dac856323a0c13c1d51d99a819f32efe spi: rockchip: use irq rather than polling :::::: TO: Emil Renner Berthing <kernel@esmil.dk> :::::: CC: Mark Brown <broonie@kernel.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 53136 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h 2019-03-08 18:43 ` [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h Suren Baghdasaryan 2019-03-09 20:49 ` kbuild test robot @ 2019-03-09 23:12 ` kbuild test robot 1 sibling, 0 replies; 12+ messages in thread From: kbuild test robot @ 2019-03-09 23:12 UTC (permalink / raw) To: Suren Baghdasaryan Cc: kbuild-all, gregkh, tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan [-- Attachment #1: Type: text/plain, Size: 14119 bytes --] Hi Suren, Thank you for the patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v5.0] [cannot apply to next-20190306] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Suren-Baghdasaryan/psi-pressure-stall-monitors-v5/20190310-024018 config: i386-randconfig-a0-201910 (attached as .config) compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/spi/spi-rockchip.c:328:8: error: unknown type name 'irqreturn_t' static irqreturn_t rockchip_spi_isr(int irq, void *dev_id) ^ drivers/spi/spi-rockchip.c: In function 'rockchip_spi_isr': drivers/spi/spi-rockchip.c:343:9: error: 'IRQ_HANDLED' undeclared (first use in this function) return IRQ_HANDLED; ^ drivers/spi/spi-rockchip.c:343:9: note: each undeclared identifier is reported only once for each function it appears in drivers/spi/spi-rockchip.c: In function 'rockchip_spi_probe': >> drivers/spi/spi-rockchip.c:649:2: error: implicit declaration of function 'devm_request_threaded_irq' [-Werror=implicit-function-declaration] ret = devm_request_threaded_irq(&pdev->dev, ret, rockchip_spi_isr, NULL, ^ drivers/spi/spi-rockchip.c:650:4: error: 'IRQF_ONESHOT' undeclared (first use in this function) IRQF_ONESHOT, dev_name(&pdev->dev), master); ^ cc1: some warnings being treated as errors vim +/devm_request_threaded_irq +649 drivers/spi/spi-rockchip.c 64e36824b addy ke 2014-07-01 592 64e36824b addy ke 2014-07-01 593 static int rockchip_spi_probe(struct platform_device *pdev) 64e36824b addy ke 2014-07-01 594 { 43de979dd Jeffy Chen 2017-08-07 595 int ret; 64e36824b addy ke 2014-07-01 596 struct rockchip_spi *rs; 64e36824b addy ke 2014-07-01 597 struct spi_master *master; 64e36824b addy ke 2014-07-01 598 struct resource *mem; 76b17e6e4 Julius Werner 2015-03-26 599 u32 rsd_nsecs; 64e36824b addy ke 2014-07-01 600 64e36824b addy ke 2014-07-01 601 master = spi_alloc_master(&pdev->dev, sizeof(struct rockchip_spi)); 5dcc44ed9 Addy Ke 2014-07-11 602 if (!master) 64e36824b addy ke 2014-07-01 603 return -ENOMEM; 5dcc44ed9 Addy Ke 2014-07-11 604 64e36824b addy ke 2014-07-01 605 platform_set_drvdata(pdev, master); 64e36824b addy ke 2014-07-01 606 64e36824b addy ke 2014-07-01 607 rs = spi_master_get_devdata(master); 64e36824b addy ke 2014-07-01 608 64e36824b addy ke 2014-07-01 609 /* Get basic io resource and map it */ 64e36824b addy ke 2014-07-01 610 mem = platform_get_resource(pdev, IORESOURCE_MEM, 0); 64e36824b addy ke 2014-07-01 611 rs->regs = devm_ioremap_resource(&pdev->dev, mem); 64e36824b addy ke 2014-07-01 612 if (IS_ERR(rs->regs)) { 64e36824b addy ke 2014-07-01 613 ret = PTR_ERR(rs->regs); c351587e2 Jeffy Chen 2017-06-13 614 goto err_put_master; 64e36824b addy ke 2014-07-01 615 } 64e36824b addy ke 2014-07-01 616 64e36824b addy ke 2014-07-01 617 rs->apb_pclk = devm_clk_get(&pdev->dev, "apb_pclk"); 64e36824b addy ke 2014-07-01 618 if (IS_ERR(rs->apb_pclk)) { 64e36824b addy ke 2014-07-01 619 dev_err(&pdev->dev, "Failed to get apb_pclk\n"); 64e36824b addy ke 2014-07-01 620 ret = PTR_ERR(rs->apb_pclk); c351587e2 Jeffy Chen 2017-06-13 621 goto err_put_master; 64e36824b addy ke 2014-07-01 622 } 64e36824b addy ke 2014-07-01 623 64e36824b addy ke 2014-07-01 624 rs->spiclk = devm_clk_get(&pdev->dev, "spiclk"); 64e36824b addy ke 2014-07-01 625 if (IS_ERR(rs->spiclk)) { 64e36824b addy ke 2014-07-01 626 dev_err(&pdev->dev, "Failed to get spi_pclk\n"); 64e36824b addy ke 2014-07-01 627 ret = PTR_ERR(rs->spiclk); c351587e2 Jeffy Chen 2017-06-13 628 goto err_put_master; 64e36824b addy ke 2014-07-01 629 } 64e36824b addy ke 2014-07-01 630 64e36824b addy ke 2014-07-01 631 ret = clk_prepare_enable(rs->apb_pclk); 43de979dd Jeffy Chen 2017-08-07 632 if (ret < 0) { 64e36824b addy ke 2014-07-01 633 dev_err(&pdev->dev, "Failed to enable apb_pclk\n"); c351587e2 Jeffy Chen 2017-06-13 634 goto err_put_master; 64e36824b addy ke 2014-07-01 635 } 64e36824b addy ke 2014-07-01 636 64e36824b addy ke 2014-07-01 637 ret = clk_prepare_enable(rs->spiclk); 43de979dd Jeffy Chen 2017-08-07 638 if (ret < 0) { 64e36824b addy ke 2014-07-01 639 dev_err(&pdev->dev, "Failed to enable spi_clk\n"); c351587e2 Jeffy Chen 2017-06-13 640 goto err_disable_apbclk; 64e36824b addy ke 2014-07-01 641 } 64e36824b addy ke 2014-07-01 642 30688e4e6 Emil Renner Berthing 2018-10-31 643 spi_enable_chip(rs, false); 64e36824b addy ke 2014-07-01 644 01b59ce5d Emil Renner Berthing 2018-10-31 645 ret = platform_get_irq(pdev, 0); 01b59ce5d Emil Renner Berthing 2018-10-31 646 if (ret < 0) 01b59ce5d Emil Renner Berthing 2018-10-31 647 goto err_disable_spiclk; 01b59ce5d Emil Renner Berthing 2018-10-31 648 01b59ce5d Emil Renner Berthing 2018-10-31 @649 ret = devm_request_threaded_irq(&pdev->dev, ret, rockchip_spi_isr, NULL, 01b59ce5d Emil Renner Berthing 2018-10-31 650 IRQF_ONESHOT, dev_name(&pdev->dev), master); 01b59ce5d Emil Renner Berthing 2018-10-31 651 if (ret) 01b59ce5d Emil Renner Berthing 2018-10-31 652 goto err_disable_spiclk; 01b59ce5d Emil Renner Berthing 2018-10-31 653 64e36824b addy ke 2014-07-01 654 rs->dev = &pdev->dev; 420b82f84 Emil Renner Berthing 2018-10-31 655 rs->freq = clk_get_rate(rs->spiclk); 64e36824b addy ke 2014-07-01 656 76b17e6e4 Julius Werner 2015-03-26 657 if (!of_property_read_u32(pdev->dev.of_node, "rx-sample-delay-ns", 74b7efa82 Emil Renner Berthing 2018-10-31 658 &rsd_nsecs)) { 74b7efa82 Emil Renner Berthing 2018-10-31 659 /* rx sample delay is expressed in parent clock cycles (max 3) */ 74b7efa82 Emil Renner Berthing 2018-10-31 660 u32 rsd = DIV_ROUND_CLOSEST(rsd_nsecs * (rs->freq >> 8), 74b7efa82 Emil Renner Berthing 2018-10-31 661 1000000000 >> 8); 74b7efa82 Emil Renner Berthing 2018-10-31 662 if (!rsd) { 74b7efa82 Emil Renner Berthing 2018-10-31 663 dev_warn(rs->dev, "%u Hz are too slow to express %u ns delay\n", 74b7efa82 Emil Renner Berthing 2018-10-31 664 rs->freq, rsd_nsecs); 74b7efa82 Emil Renner Berthing 2018-10-31 665 } else if (rsd > CR0_RSD_MAX) { 74b7efa82 Emil Renner Berthing 2018-10-31 666 rsd = CR0_RSD_MAX; 74b7efa82 Emil Renner Berthing 2018-10-31 667 dev_warn(rs->dev, "%u Hz are too fast to express %u ns delay, clamping at %u ns\n", 74b7efa82 Emil Renner Berthing 2018-10-31 668 rs->freq, rsd_nsecs, 74b7efa82 Emil Renner Berthing 2018-10-31 669 CR0_RSD_MAX * 1000000000U / rs->freq); 74b7efa82 Emil Renner Berthing 2018-10-31 670 } 74b7efa82 Emil Renner Berthing 2018-10-31 671 rs->rsd = rsd; 74b7efa82 Emil Renner Berthing 2018-10-31 672 } 76b17e6e4 Julius Werner 2015-03-26 673 64e36824b addy ke 2014-07-01 674 rs->fifo_len = get_fifo_len(rs); 64e36824b addy ke 2014-07-01 675 if (!rs->fifo_len) { 64e36824b addy ke 2014-07-01 676 dev_err(&pdev->dev, "Failed to get fifo length\n"); db7e8d90c Wei Yongjun 2014-07-20 677 ret = -EINVAL; c351587e2 Jeffy Chen 2017-06-13 678 goto err_disable_spiclk; 64e36824b addy ke 2014-07-01 679 } 64e36824b addy ke 2014-07-01 680 64e36824b addy ke 2014-07-01 681 pm_runtime_set_active(&pdev->dev); 64e36824b addy ke 2014-07-01 682 pm_runtime_enable(&pdev->dev); 64e36824b addy ke 2014-07-01 683 64e36824b addy ke 2014-07-01 684 master->auto_runtime_pm = true; 64e36824b addy ke 2014-07-01 685 master->bus_num = pdev->id; 04290192f Emil Renner Berthing 2018-10-31 686 master->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LOOP | SPI_LSB_FIRST; aa099382a Jeffy Chen 2017-06-28 687 master->num_chipselect = ROCKCHIP_SPI_MAX_CS_NUM; 64e36824b addy ke 2014-07-01 688 master->dev.of_node = pdev->dev.of_node; 65498c6ae Emil Renner Berthing 2018-10-31 689 master->bits_per_word_mask = SPI_BPW_MASK(16) | SPI_BPW_MASK(8) | SPI_BPW_MASK(4); 420b82f84 Emil Renner Berthing 2018-10-31 690 master->min_speed_hz = rs->freq / BAUDR_SCKDV_MAX; 420b82f84 Emil Renner Berthing 2018-10-31 691 master->max_speed_hz = min(rs->freq / BAUDR_SCKDV_MIN, MAX_SCLK_OUT); 64e36824b addy ke 2014-07-01 692 64e36824b addy ke 2014-07-01 693 master->set_cs = rockchip_spi_set_cs; 64e36824b addy ke 2014-07-01 694 master->transfer_one = rockchip_spi_transfer_one; 5185a81c0 Brian Norris 2016-07-14 695 master->max_transfer_size = rockchip_spi_max_transfer_size; 2291793cc Andy Shevchenko 2015-02-27 696 master->handle_err = rockchip_spi_handle_err; c863795c4 Jeffy Chen 2017-06-28 697 master->flags = SPI_MASTER_GPIO_SS; 64e36824b addy ke 2014-07-01 698 eee06a9ee Emil Renner Berthing 2018-10-31 699 master->dma_tx = dma_request_chan(rs->dev, "tx"); eee06a9ee Emil Renner Berthing 2018-10-31 700 if (IS_ERR(master->dma_tx)) { 61cadcf46 Shawn Lin 2016-03-09 701 /* Check tx to see if we need defer probing driver */ eee06a9ee Emil Renner Berthing 2018-10-31 702 if (PTR_ERR(master->dma_tx) == -EPROBE_DEFER) { 61cadcf46 Shawn Lin 2016-03-09 703 ret = -EPROBE_DEFER; c351587e2 Jeffy Chen 2017-06-13 704 goto err_disable_pm_runtime; 61cadcf46 Shawn Lin 2016-03-09 705 } 64e36824b addy ke 2014-07-01 706 dev_warn(rs->dev, "Failed to request TX DMA channel\n"); eee06a9ee Emil Renner Berthing 2018-10-31 707 master->dma_tx = NULL; 61cadcf46 Shawn Lin 2016-03-09 708 } 64e36824b addy ke 2014-07-01 709 eee06a9ee Emil Renner Berthing 2018-10-31 710 master->dma_rx = dma_request_chan(rs->dev, "rx"); eee06a9ee Emil Renner Berthing 2018-10-31 711 if (IS_ERR(master->dma_rx)) { eee06a9ee Emil Renner Berthing 2018-10-31 712 if (PTR_ERR(master->dma_rx) == -EPROBE_DEFER) { e4c0e06f9 Shawn Lin 2016-03-31 713 ret = -EPROBE_DEFER; 5de7ed0c9 Dan Carpenter 2016-05-04 714 goto err_free_dma_tx; 64e36824b addy ke 2014-07-01 715 } 64e36824b addy ke 2014-07-01 716 dev_warn(rs->dev, "Failed to request RX DMA channel\n"); eee06a9ee Emil Renner Berthing 2018-10-31 717 master->dma_rx = NULL; 64e36824b addy ke 2014-07-01 718 } 64e36824b addy ke 2014-07-01 719 eee06a9ee Emil Renner Berthing 2018-10-31 720 if (master->dma_tx && master->dma_rx) { eee06a9ee Emil Renner Berthing 2018-10-31 721 rs->dma_addr_tx = mem->start + ROCKCHIP_SPI_TXDR; eee06a9ee Emil Renner Berthing 2018-10-31 722 rs->dma_addr_rx = mem->start + ROCKCHIP_SPI_RXDR; 64e36824b addy ke 2014-07-01 723 master->can_dma = rockchip_spi_can_dma; 64e36824b addy ke 2014-07-01 724 } 64e36824b addy ke 2014-07-01 725 64e36824b addy ke 2014-07-01 726 ret = devm_spi_register_master(&pdev->dev, master); 43de979dd Jeffy Chen 2017-08-07 727 if (ret < 0) { 64e36824b addy ke 2014-07-01 728 dev_err(&pdev->dev, "Failed to register master\n"); c351587e2 Jeffy Chen 2017-06-13 729 goto err_free_dma_rx; 64e36824b addy ke 2014-07-01 730 } 64e36824b addy ke 2014-07-01 731 64e36824b addy ke 2014-07-01 732 return 0; 64e36824b addy ke 2014-07-01 733 c351587e2 Jeffy Chen 2017-06-13 734 err_free_dma_rx: eee06a9ee Emil Renner Berthing 2018-10-31 735 if (master->dma_rx) eee06a9ee Emil Renner Berthing 2018-10-31 736 dma_release_channel(master->dma_rx); 5de7ed0c9 Dan Carpenter 2016-05-04 737 err_free_dma_tx: eee06a9ee Emil Renner Berthing 2018-10-31 738 if (master->dma_tx) eee06a9ee Emil Renner Berthing 2018-10-31 739 dma_release_channel(master->dma_tx); c351587e2 Jeffy Chen 2017-06-13 740 err_disable_pm_runtime: c351587e2 Jeffy Chen 2017-06-13 741 pm_runtime_disable(&pdev->dev); c351587e2 Jeffy Chen 2017-06-13 742 err_disable_spiclk: 64e36824b addy ke 2014-07-01 743 clk_disable_unprepare(rs->spiclk); c351587e2 Jeffy Chen 2017-06-13 744 err_disable_apbclk: 64e36824b addy ke 2014-07-01 745 clk_disable_unprepare(rs->apb_pclk); c351587e2 Jeffy Chen 2017-06-13 746 err_put_master: 64e36824b addy ke 2014-07-01 747 spi_master_put(master); 64e36824b addy ke 2014-07-01 748 64e36824b addy ke 2014-07-01 749 return ret; 64e36824b addy ke 2014-07-01 750 } 64e36824b addy ke 2014-07-01 751 :::::: The code at line 649 was first introduced by commit :::::: 01b59ce5dac856323a0c13c1d51d99a819f32efe spi: rockchip: use irq rather than polling :::::: TO: Emil Renner Berthing <kernel@esmil.dk> :::::: CC: Mark Brown <broonie@kernel.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 32300 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v5 7/7] psi: introduce psi monitor 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan ` (5 preceding siblings ...) 2019-03-08 18:43 ` [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h Suren Baghdasaryan @ 2019-03-08 18:43 ` Suren Baghdasaryan 2019-03-19 22:51 ` [PATCH v5 0/7] psi: pressure stall monitors v5 Minchan Kim 7 siblings, 0 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-08 18:43 UTC (permalink / raw) To: gregkh Cc: tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team, Suren Baghdasaryan Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- Documentation/accounting/psi.txt | 107 +++++++ include/linux/psi.h | 8 + include/linux/psi_types.h | 82 ++++- kernel/cgroup/cgroup.c | 71 ++++- kernel/sched/psi.c | 494 ++++++++++++++++++++++++++++++- 5 files changed, 742 insertions(+), 20 deletions(-) diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt index b8ca28b60215..4fb40fe94828 100644 --- a/Documentation/accounting/psi.txt +++ b/Documentation/accounting/psi.txt @@ -63,6 +63,110 @@ tracked and exported as well, to allow detection of latency spikes which wouldn't necessarily make a dent in the time averages, or to average trends over custom time frames. +Monitoring for pressure thresholds +================================== + +Users can register triggers and use poll() to be woken up when resource +pressure exceeds certain thresholds. + +A trigger describes the maximum cumulative stall time over a specific +time window, e.g. 100ms of total stall time within any 500ms window to +generate a wakeup event. + +To register a trigger user has to open psi interface file under +/proc/pressure/ representing the resource to be monitored and write the +desired threshold and time window. The open file descriptor should be +used to wait for trigger events using select(), poll() or epoll(). +The following format is used: + +<some|full> <stall amount in us> <time window in us> + +For example writing "some 150000 1000000" into /proc/pressure/memory +would add 150ms threshold for partial memory stall measured within +1sec time window. Writing "full 50000 1000000" into /proc/pressure/io +would add 50ms threshold for full io stall measured within 1sec time window. + +Triggers can be set on more than one psi metric and more than one trigger +for the same psi metric can be specified. However for each trigger a separate +file descriptor is required to be able to poll it separately from others, +therefore for each trigger a separate open() syscall should be made even +when opening the same psi interface file. + +Monitors activate only when system enters stall state for the monitored +psi metric and deactivates upon exit from the stall state. While system is +in the stall state psi signal growth is monitored at a rate of 10 times per +tracking window. + +The kernel accepts window sizes ranging from 500ms to 10s, therefore min +monitoring update interval is 50ms and max is 1s. Min limit is set to +prevent overly frequent polling. Max limit is chosen as a high enough number +after which monitors are most likely not needed and psi averages can be used +instead. + +When activated, psi monitor stays active for at least the duration of one +tracking window to avoid repeated activations/deactivations when system is +bouncing in and out of the stall state. + +Notifications to the userspace are rate-limited to one per tracking window. + +The trigger will de-register when the file descriptor used to define the +trigger is closed. + +Userspace monitor usage example +=============================== + +#include <errno.h> +#include <fcntl.h> +#include <stdio.h> +#include <poll.h> +#include <string.h> +#include <unistd.h> + +/* + * Monitor memory partial stall with 1s tracking window size + * and 150ms threshold. + */ +int main() { + const char trig[] = "some 150000 1000000"; + struct pollfd fds; + int n; + + fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); + if (fds.fd < 0) { + printf("/proc/pressure/memory open error: %s\n", + strerror(errno)); + return 1; + } + fds.events = POLLPRI; + + if (write(fds.fd, trig, strlen(trig) + 1) < 0) { + printf("/proc/pressure/memory write error: %s\n", + strerror(errno)); + return 1; + } + + printf("waiting for events...\n"); + while (1) { + n = poll(&fds, 1, -1); + if (n < 0) { + printf("poll error: %s\n", strerror(errno)); + return 1; + } + if (fds.revents & POLLERR) { + printf("got POLLERR, event source is gone\n"); + return 0; + } + if (fds.revents & POLLPRI) { + printf("event triggered!\n"); + } else { + printf("unknown event received: 0x%x\n", fds.revents); + return 1; + } + } + + return 0; +} + Cgroup2 interface ================= @@ -71,3 +175,6 @@ mounted, pressure stall information is also tracked for tasks grouped into cgroups. Each subdirectory in the cgroupfs mountpoint contains cpu.pressure, memory.pressure, and io.pressure files; the format is the same as the /proc/pressure/ files. + +Per-cgroup psi monitors can be specified and used the same way as +system-wide ones. diff --git a/include/linux/psi.h b/include/linux/psi.h index 7006008d5b72..af892c290116 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -4,6 +4,7 @@ #include <linux/jump_label.h> #include <linux/psi_types.h> #include <linux/sched.h> +#include <linux/poll.h> struct seq_file; struct css_set; @@ -26,6 +27,13 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); int psi_cgroup_alloc(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp); void cgroup_move_task(struct task_struct *p, struct css_set *to); + +struct psi_trigger *psi_trigger_create(struct psi_group *group, + char *buf, size_t nbytes, enum psi_res res); +void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *t); + +__poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, + poll_table *wait); #endif #else /* CONFIG_PSI */ diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 4d1c1f67be18..07aaf9b82241 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -1,8 +1,11 @@ #ifndef _LINUX_PSI_TYPES_H #define _LINUX_PSI_TYPES_H +#include <linux/kthread.h> #include <linux/seqlock.h> #include <linux/types.h> +#include <linux/kref.h> +#include <linux/wait.h> #ifdef CONFIG_PSI @@ -44,6 +47,12 @@ enum psi_states { NR_PSI_STATES = 6, }; +enum psi_aggregators { + PSI_AVGS = 0, + PSI_POLL, + NR_PSI_AGGREGATORS, +}; + struct psi_group_cpu { /* 1st cacheline updated by the scheduler */ @@ -65,7 +74,55 @@ struct psi_group_cpu { /* 2nd cacheline updated by the aggregator */ /* Delta detection against the sampling buckets */ - u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp; + u32 times_prev[NR_PSI_AGGREGATORS][NR_PSI_STATES] + ____cacheline_aligned_in_smp; +}; + +/* PSI growth tracking window */ +struct psi_window { + /* Window size in ns */ + u64 size; + + /* Start time of the current window in ns */ + u64 start_time; + + /* Value at the start of the window */ + u64 start_value; + + /* Value growth in the previous window */ + u64 prev_growth; +}; + +struct psi_trigger { + /* PSI state being monitored by the trigger */ + enum psi_states state; + + /* User-spacified threshold in ns */ + u64 threshold; + + /* List node inside triggers list */ + struct list_head node; + + /* Backpointer needed during trigger destruction */ + struct psi_group *group; + + /* Wait queue for polling */ + wait_queue_head_t event_wait; + + /* Pending event flag */ + int event; + + /* Tracking window */ + struct psi_window win; + + /* + * Time last event was generated. Used for rate-limiting + * events to one per window + */ + u64 last_event_time; + + /* Refcounting to prevent premature destruction */ + struct kref refcount; }; struct psi_group { @@ -79,11 +136,32 @@ struct psi_group { u64 avg_total[NR_PSI_STATES - 1]; u64 avg_last_update; u64 avg_next_update; + + /* Aggregator work control */ struct delayed_work avgs_work; /* Total stall times and sampled pressure averages */ - u64 total[NR_PSI_STATES - 1]; + u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1]; unsigned long avg[NR_PSI_STATES - 1][3]; + + /* Monitor work control */ + atomic_t poll_scheduled; + struct kthread_worker __rcu *poll_kworker; + struct kthread_delayed_work poll_work; + + /* Protects data used by the monitor */ + struct mutex trigger_lock; + + /* Configured polling triggers */ + struct list_head triggers; + u32 nr_triggers[NR_PSI_STATES - 1]; + u32 poll_states; + u64 poll_min_period; + + /* Total stall times at the start of monitor activation */ + u64 polling_total[NR_PSI_STATES - 1]; + u64 polling_next_update; + u64 polling_until; }; #else /* CONFIG_PSI */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index eef24a25bda7..005d7b453fce 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3465,7 +3465,65 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) { return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); } -#endif + +static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, enum psi_res res) +{ + struct psi_trigger *new; + struct cgroup *cgrp; + + cgrp = cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENODEV; + + cgroup_get(cgrp); + cgroup_kn_unlock(of->kn); + + new = psi_trigger_create(&cgrp->psi, buf, nbytes, res); + if (IS_ERR(new)) { + cgroup_put(cgrp); + return PTR_ERR(new); + } + + psi_trigger_replace(&of->priv, new); + + cgroup_put(cgrp); + + return nbytes; +} + +static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_IO); +} + +static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_MEM); +} + +static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); +} + +static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, + poll_table *pt) +{ + return psi_trigger_poll(&of->priv, of->file, pt); +} + +static void cgroup_pressure_release(struct kernfs_open_file *of) +{ + psi_trigger_replace(&of->priv, NULL); +} +#endif /* CONFIG_PSI */ static int cgroup_file_open(struct kernfs_open_file *of) { @@ -4620,18 +4678,27 @@ static struct cftype cgroup_base_files[] = { .name = "io.pressure", .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cgroup_io_pressure_show, + .write = cgroup_io_pressure_write, + .poll = cgroup_pressure_poll, + .release = cgroup_pressure_release, }, { .name = "memory.pressure", .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cgroup_memory_pressure_show, + .write = cgroup_memory_pressure_write, + .poll = cgroup_pressure_poll, + .release = cgroup_pressure_release, }, { .name = "cpu.pressure", .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cgroup_cpu_pressure_show, + .write = cgroup_cpu_pressure_write, + .poll = cgroup_pressure_poll, + .release = cgroup_pressure_release, }, -#endif +#endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 59e4e1f8bc02..f0dfa9190b4d 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -4,6 +4,9 @@ * Copyright (c) 2018 Facebook, Inc. * Author: Johannes Weiner <hannes@cmpxchg.org> * + * Polling support by Suren Baghdasaryan <surenb@google.com> + * Copyright (c) 2018 Google, Inc. + * * When CPU, memory and IO are contended, tasks experience delays that * reduce throughput and introduce latencies into the workload. Memory * and IO contention, in addition, can cause a full loss of forward @@ -129,9 +132,13 @@ #include <linux/seq_file.h> #include <linux/proc_fs.h> #include <linux/seqlock.h> +#include <linux/uaccess.h> #include <linux/cgroup.h> #include <linux/module.h> #include <linux/sched.h> +#include <linux/ctype.h> +#include <linux/file.h> +#include <linux/poll.h> #include <linux/psi.h> #include "sched.h" @@ -156,6 +163,11 @@ __setup("psi=", setup_psi); #define EXP_60s 1981 /* 1/exp(2s/60s) */ #define EXP_300s 2034 /* 1/exp(2s/300s) */ +/* PSI trigger definitions */ +#define WINDOW_MIN_US 500000 /* Min window size is 500ms */ +#define WINDOW_MAX_US 10000000 /* Max window size is 10s */ +#define UPDATES_PER_WINDOW 10 /* 10 updates per window */ + /* Sampling frequency in nanoseconds */ static u64 psi_period __read_mostly; @@ -176,6 +188,17 @@ static void group_init(struct psi_group *group) group->avg_next_update = sched_clock() + psi_period; INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); mutex_init(&group->avgs_lock); + /* Init trigger-related members */ + atomic_set(&group->poll_scheduled, 0); + mutex_init(&group->trigger_lock); + INIT_LIST_HEAD(&group->triggers); + memset(group->nr_triggers, 0, sizeof(group->nr_triggers)); + group->poll_states = 0; + group->poll_min_period = U32_MAX; + memset(group->polling_total, 0, sizeof(group->polling_total)); + group->polling_next_update = ULLONG_MAX; + group->polling_until = 0; + rcu_assign_pointer(group->poll_kworker, NULL); } void __init psi_init(void) @@ -210,7 +233,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state) } } -static void get_recent_times(struct psi_group *group, int cpu, u32 *times, +static void get_recent_times(struct psi_group *group, int cpu, + enum psi_aggregators aggregator, u32 *times, u32 *pchanged_states) { struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); @@ -245,8 +269,8 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times, if (state_mask & (1 << s)) times[s] += now - state_start; - delta = times[s] - groupc->times_prev[s]; - groupc->times_prev[s] = times[s]; + delta = times[s] - groupc->times_prev[aggregator][s]; + groupc->times_prev[aggregator][s] = times[s]; times[s] = delta; if (delta) @@ -274,7 +298,9 @@ static void calc_avgs(unsigned long avg[3], int missed_periods, avg[2] = calc_load(avg[2], EXP_300s, pct); } -static void collect_percpu_times(struct psi_group *group, u32 *pchanged_states) +static void collect_percpu_times(struct psi_group *group, + enum psi_aggregators aggregator, + u32 *pchanged_states) { u64 deltas[NR_PSI_STATES - 1] = { 0, }; unsigned long nonidle_total = 0; @@ -295,7 +321,7 @@ static void collect_percpu_times(struct psi_group *group, u32 *pchanged_states) u32 nonidle; u32 cpu_changed_states; - get_recent_times(group, cpu, times, + get_recent_times(group, cpu, aggregator, times, &cpu_changed_states); changed_states |= cpu_changed_states; @@ -320,7 +346,8 @@ static void collect_percpu_times(struct psi_group *group, u32 *pchanged_states) /* total= */ for (s = 0; s < NR_PSI_STATES - 1; s++) - group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL)); + group->total[aggregator][s] += + div_u64(deltas[s], max(nonidle_total, 1UL)); if (pchanged_states) *pchanged_states = changed_states; @@ -352,7 +379,7 @@ static u64 update_averages(struct psi_group *group, u64 now) for (s = 0; s < NR_PSI_STATES - 1; s++) { u32 sample; - sample = group->total[s] - group->avg_total[s]; + sample = group->total[PSI_AVGS][s] - group->avg_total[s]; /* * Due to the lockless sampling of the time buckets, * recorded time deltas can slip into the next period, @@ -394,7 +421,7 @@ static void psi_avgs_work(struct work_struct *work) now = sched_clock(); - collect_percpu_times(group, &changed_states); + collect_percpu_times(group, PSI_AVGS, &changed_states); nonidle = changed_states & (1 << PSI_NONIDLE); /* * If there is task activity, periodically fold the per-cpu @@ -415,6 +442,187 @@ static void psi_avgs_work(struct work_struct *work) mutex_unlock(&group->avgs_lock); } +/* Trigger tracking window manupulations */ +static void window_reset(struct psi_window *win, u64 now, u64 value, + u64 prev_growth) +{ + win->start_time = now; + win->start_value = value; + win->prev_growth = prev_growth; +} + +/* + * PSI growth tracking window update and growth calculation routine. + * + * This approximates a sliding tracking window by interpolating + * partially elapsed windows using historical growth data from the + * previous intervals. This minimizes memory requirements (by not storing + * all the intermediate values in the previous window) and simplifies + * the calculations. It works well because PSI signal changes only in + * positive direction and over relatively small window sizes the growth + * is close to linear. + */ +static u64 window_update(struct psi_window *win, u64 now, u64 value) +{ + u64 elapsed; + u64 growth; + + elapsed = now - win->start_time; + growth = value - win->start_value; + /* + * After each tracking window passes win->start_value and + * win->start_time get reset and win->prev_growth stores + * the average per-window growth of the previous window. + * win->prev_growth is then used to interpolate additional + * growth from the previous window assuming it was linear. + */ + if (elapsed > win->size) + window_reset(win, now, value, growth); + else { + u32 remaining; + + remaining = win->size - elapsed; + growth += div_u64(win->prev_growth * remaining, win->size); + } + + return growth; +} + +static void init_triggers(struct psi_group *group, u64 now) +{ + struct psi_trigger *t; + + list_for_each_entry(t, &group->triggers, node) + window_reset(&t->win, now, + group->total[PSI_POLL][t->state], 0); + memcpy(group->polling_total, group->total[PSI_POLL], + sizeof(group->polling_total)); + group->polling_next_update = now + group->poll_min_period; +} + +static u64 update_triggers(struct psi_group *group, u64 now) +{ + struct psi_trigger *t; + bool new_stall = false; + u64 *total = group->total[PSI_POLL]; + + /* + * On subsequent updates, calculate growth deltas and let + * watchers know when their specified thresholds are exceeded. + */ + list_for_each_entry(t, &group->triggers, node) { + u64 growth; + + /* Check for stall activity */ + if (group->polling_total[t->state] == total[t->state]) + continue; + + /* + * Multiple triggers might be looking at the same state, + * remember to update group->polling_total[] once we've + * been through all of them. Also remember to extend the + * polling time if we see new stall activity. + */ + new_stall = true; + + /* Calculate growth since last update */ + growth = window_update(&t->win, now, total[t->state]); + if (growth < t->threshold) + continue; + + /* Limit event signaling to once per window */ + if (now < t->last_event_time + t->win.size) + continue; + + /* Generate an event */ + if (cmpxchg(&t->event, 0, 1) == 0) + wake_up_interruptible(&t->event_wait); + t->last_event_time = now; + } + + if (new_stall) + memcpy(group->polling_total, total, + sizeof(group->polling_total)); + + return now + group->poll_min_period; +} + +/* + * Schedule polling if it's not already scheduled. It's safe to call even from + * hotpath because even though kthread_queue_delayed_work takes worker->lock + * spinlock that spinlock is never contended due to poll_scheduled atomic + * preventing such competition. + */ +static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay) +{ + struct kthread_worker *kworker; + + /* Do not reschedule if already scheduled */ + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0) + return; + + rcu_read_lock(); + + kworker = rcu_dereference(group->poll_kworker); + /* + * kworker might be NULL in case psi_trigger_destroy races with + * psi_task_change (hotpath) which can't use locks + */ + if (likely(kworker)) + kthread_queue_delayed_work(kworker, &group->poll_work, delay); + else + atomic_set(&group->poll_scheduled, 0); + + rcu_read_unlock(); +} + +static void psi_poll_work(struct kthread_work *work) +{ + struct kthread_delayed_work *dwork; + struct psi_group *group; + u32 changed_states; + u64 now; + + dwork = container_of(work, struct kthread_delayed_work, work); + group = container_of(dwork, struct psi_group, poll_work); + + atomic_set(&group->poll_scheduled, 0); + + mutex_lock(&group->trigger_lock); + + now = sched_clock(); + + collect_percpu_times(group, PSI_POLL, &changed_states); + + if (changed_states & group->poll_states) { + /* Initialize trigger windows when entering polling mode */ + if (now > group->polling_until) + init_triggers(group, now); + + /* + * Keep the monitor active for at least the duration of the + * minimum tracking window as long as monitor states are + * changing. + */ + group->polling_until = now + + group->poll_min_period * UPDATES_PER_WINDOW; + } + + if (now > group->polling_until) { + group->polling_next_update = ULLONG_MAX; + goto out; + } + + if (now >= group->polling_next_update) + group->polling_next_update = update_triggers(group, now); + + psi_schedule_poll_work(group, + nsecs_to_jiffies(group->polling_next_update - now) + 1); + +out: + mutex_unlock(&group->trigger_lock); +} + static void record_times(struct psi_group_cpu *groupc, int cpu, bool memstall_tick) { @@ -461,8 +669,8 @@ static void record_times(struct psi_group_cpu *groupc, int cpu, groupc->times[PSI_NONIDLE] += delta; } -static void psi_group_change(struct psi_group *group, int cpu, - unsigned int clear, unsigned int set) +static u32 psi_group_change(struct psi_group *group, int cpu, + unsigned int clear, unsigned int set) { struct psi_group_cpu *groupc; unsigned int t, m; @@ -508,6 +716,8 @@ static void psi_group_change(struct psi_group *group, int cpu, groupc->state_mask = state_mask; write_seqcount_end(&groupc->seq); + + return state_mask; } static struct psi_group *iterate_groups(struct task_struct *task, void **iter) @@ -568,7 +778,11 @@ void psi_task_change(struct task_struct *task, int clear, int set) wake_clock = false; while ((group = iterate_groups(task, &iter))) { - psi_group_change(group, cpu, clear, set); + u32 state_mask = psi_group_change(group, cpu, clear, set); + + if (state_mask & group->poll_states) + psi_schedule_poll_work(group, 1); + if (wake_clock && !delayed_work_pending(&group->avgs_work)) schedule_delayed_work(&group->avgs_work, PSI_FREQ); } @@ -669,6 +883,8 @@ void psi_cgroup_free(struct cgroup *cgroup) cancel_delayed_work_sync(&cgroup->psi.avgs_work); free_percpu(cgroup->psi.pcpu); + /* All triggers must be removed by now */ + WARN_ONCE(cgroup->psi.poll_states, "psi: trigger leak\n"); } /** @@ -730,7 +946,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) /* Update averages before reporting them */ mutex_lock(&group->avgs_lock); - collect_percpu_times(group, NULL); + collect_percpu_times(group, PSI_AVGS, NULL); update_averages(group, sched_clock()); mutex_unlock(&group->avgs_lock); @@ -741,7 +957,8 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) for (w = 0; w < 3; w++) avg[w] = group->avg[res * 2 + full][w]; - total = div_u64(group->total[res * 2 + full], NSEC_PER_USEC); + total = div_u64(group->total[PSI_AVGS][res * 2 + full], + NSEC_PER_USEC); seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n", full ? "full" : "some", @@ -784,25 +1001,270 @@ static int psi_cpu_open(struct inode *inode, struct file *file) return single_open(file, psi_cpu_show, NULL); } +struct psi_trigger *psi_trigger_create(struct psi_group *group, + char *buf, size_t nbytes, enum psi_res res) +{ + struct psi_trigger *t; + enum psi_states state; + u32 threshold_us; + u32 window_us; + + if (static_branch_likely(&psi_disabled)) + return ERR_PTR(-EOPNOTSUPP); + + if (sscanf(buf, "some %u %u", &threshold_us, &window_us) == 2) + state = PSI_IO_SOME + res * 2; + else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) == 2) + state = PSI_IO_FULL + res * 2; + else + return ERR_PTR(-EINVAL); + + if (state >= PSI_NONIDLE) + return ERR_PTR(-EINVAL); + + if (window_us < WINDOW_MIN_US || + window_us > WINDOW_MAX_US) + return ERR_PTR(-EINVAL); + + /* Check threshold */ + if (threshold_us == 0 || threshold_us > window_us) + return ERR_PTR(-EINVAL); + + t = kmalloc(sizeof(*t), GFP_KERNEL); + if (!t) + return ERR_PTR(-ENOMEM); + + t->group = group; + t->state = state; + t->threshold = threshold_us * NSEC_PER_USEC; + t->win.size = window_us * NSEC_PER_USEC; + window_reset(&t->win, 0, 0, 0); + + t->event = 0; + t->last_event_time = 0; + init_waitqueue_head(&t->event_wait); + kref_init(&t->refcount); + + mutex_lock(&group->trigger_lock); + + if (!rcu_access_pointer(group->poll_kworker)) { + struct sched_param param = { + .sched_priority = MAX_RT_PRIO - 1, + }; + struct kthread_worker *kworker; + + kworker = kthread_create_worker(0, "psimon"); + if (IS_ERR(kworker)) { + kfree(t); + mutex_unlock(&group->trigger_lock); + return ERR_CAST(kworker); + } + sched_setscheduler(kworker->task, SCHED_FIFO, ¶m); + kthread_init_delayed_work(&group->poll_work, + psi_poll_work); + rcu_assign_pointer(group->poll_kworker, kworker); + } + + list_add(&t->node, &group->triggers); + group->poll_min_period = min(group->poll_min_period, + div_u64(t->win.size, UPDATES_PER_WINDOW)); + group->nr_triggers[t->state]++; + group->poll_states |= (1 << t->state); + + mutex_unlock(&group->trigger_lock); + + return t; +} + +static void psi_trigger_destroy(struct kref *ref) +{ + struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount); + struct psi_group *group = t->group; + struct kthread_worker *kworker_to_destroy = NULL; + + if (static_branch_likely(&psi_disabled)) + return; + + /* + * Wakeup waiters to stop polling. Can happen if cgroup is deleted + * from under a polling process. + */ + wake_up_interruptible(&t->event_wait); + + mutex_lock(&group->trigger_lock); + + if (!list_empty(&t->node)) { + struct psi_trigger *tmp; + u64 period = ULLONG_MAX; + + list_del(&t->node); + group->nr_triggers[t->state]--; + if (!group->nr_triggers[t->state]) + group->poll_states &= ~(1 << t->state); + /* reset min update period for the remaining triggers */ + list_for_each_entry(tmp, &group->triggers, node) + period = min(period, div_u64(tmp->win.size, + UPDATES_PER_WINDOW)); + group->poll_min_period = period; + /* Destroy poll_kworker when the last trigger is destroyed */ + if (group->poll_states == 0) { + group->polling_until = 0; + kworker_to_destroy = rcu_dereference_protected( + group->poll_kworker, + lockdep_is_held(&group->trigger_lock)); + rcu_assign_pointer(group->poll_kworker, NULL); + } + } + + mutex_unlock(&group->trigger_lock); + + /* + * Wait for both *trigger_ptr from psi_trigger_replace and + * poll_kworker RCUs to complete their read-side critical sections + * before destroying the trigger and optionally the poll_kworker + */ + synchronize_rcu(); + /* + * Destroy the kworker after releasing trigger_lock to prevent a + * deadlock while waiting for psi_poll_work to acquire trigger_lock + */ + if (kworker_to_destroy) { + kthread_cancel_delayed_work_sync(&group->poll_work); + kthread_destroy_worker(kworker_to_destroy); + } + kfree(t); +} + +void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new) +{ + struct psi_trigger *old = *trigger_ptr; + + if (static_branch_likely(&psi_disabled)) + return; + + rcu_assign_pointer(*trigger_ptr, new); + if (old) + kref_put(&old->refcount, psi_trigger_destroy); +} + +__poll_t psi_trigger_poll(void **trigger_ptr, + struct file *file, poll_table *wait) +{ + __poll_t ret = DEFAULT_POLLMASK; + struct psi_trigger *t; + + if (static_branch_likely(&psi_disabled)) + return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; + + rcu_read_lock(); + + t = rcu_dereference(*(void __rcu __force **)trigger_ptr); + if (!t) { + rcu_read_unlock(); + return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; + } + kref_get(&t->refcount); + + rcu_read_unlock(); + + poll_wait(file, &t->event_wait, wait); + + if (cmpxchg(&t->event, 1, 0) == 1) + ret |= EPOLLPRI; + + kref_put(&t->refcount, psi_trigger_destroy); + + return ret; +} + +static ssize_t psi_write(struct file *file, const char __user *user_buf, + size_t nbytes, enum psi_res res) +{ + char buf[32]; + size_t buf_size; + struct seq_file *seq; + struct psi_trigger *new; + + if (static_branch_likely(&psi_disabled)) + return -EOPNOTSUPP; + + buf_size = min(nbytes, (sizeof(buf) - 1)); + if (copy_from_user(buf, user_buf, buf_size)) + return -EFAULT; + + buf[buf_size - 1] = '\0'; + + new = psi_trigger_create(&psi_system, buf, nbytes, res); + if (IS_ERR(new)) + return PTR_ERR(new); + + seq = file->private_data; + /* Take seq->lock to protect seq->private from concurrent writes */ + mutex_lock(&seq->lock); + psi_trigger_replace(&seq->private, new); + mutex_unlock(&seq->lock); + + return nbytes; +} + +static ssize_t psi_io_write(struct file *file, const char __user *user_buf, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_IO); +} + +static ssize_t psi_memory_write(struct file *file, const char __user *user_buf, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_MEM); +} + +static ssize_t psi_cpu_write(struct file *file, const char __user *user_buf, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_CPU); +} + +static __poll_t psi_fop_poll(struct file *file, poll_table *wait) +{ + struct seq_file *seq = file->private_data; + + return psi_trigger_poll(&seq->private, file, wait); +} + +static int psi_fop_release(struct inode *inode, struct file *file) +{ + struct seq_file *seq = file->private_data; + + psi_trigger_replace(&seq->private, NULL); + return single_release(inode, file); +} + static const struct file_operations psi_io_fops = { .open = psi_io_open, .read = seq_read, .llseek = seq_lseek, - .release = single_release, + .write = psi_io_write, + .poll = psi_fop_poll, + .release = psi_fop_release, }; static const struct file_operations psi_memory_fops = { .open = psi_memory_open, .read = seq_read, .llseek = seq_lseek, - .release = single_release, + .write = psi_memory_write, + .poll = psi_fop_poll, + .release = psi_fop_release, }; static const struct file_operations psi_cpu_fops = { .open = psi_cpu_open, .read = seq_read, .llseek = seq_lseek, - .release = single_release, + .write = psi_cpu_write, + .poll = psi_fop_poll, + .release = psi_fop_release, }; static int __init psi_proc_init(void) -- 2.21.0.360.g471c308f928-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v5 0/7] psi: pressure stall monitors v5 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan ` (6 preceding siblings ...) 2019-03-08 18:43 ` [PATCH v5 7/7] psi: introduce psi monitor Suren Baghdasaryan @ 2019-03-19 22:51 ` Minchan Kim 2019-03-20 0:03 ` Suren Baghdasaryan 7 siblings, 1 reply; 12+ messages in thread From: Minchan Kim @ 2019-03-19 22:51 UTC (permalink / raw) To: Suren Baghdasaryan Cc: gregkh, tj, lizefan, hannes, axboe, dennis, dennisszhou, mingo, peterz, akpm, corbet, cgroups, linux-mm, linux-doc, linux-kernel, kernel-team On Fri, Mar 08, 2019 at 10:43:04AM -0800, Suren Baghdasaryan wrote: > This is respin of: > https://lwn.net/ml/linux-kernel/20190206023446.177362-1-surenb%40google.com/ > > Android is adopting psi to detect and remedy memory pressure that > results in stuttering and decreased responsiveness on mobile devices. > > Psi gives us the stall information, but because we're dealing with > latencies in the millisecond range, periodically reading the pressure > files to detect stalls in a timely fashion is not feasible. Psi also > doesn't aggregate its averages at a high-enough frequency right now. > > This patch series extends the psi interface such that users can > configure sensitive latency thresholds and use poll() and friends to > be notified when these are breached. > > As high-frequency aggregation is costly, it implements an aggregation > method that is optimized for fast, short-interval averaging, and makes > the aggregation frequency adaptive, such that high-frequency updates > only happen while monitored stall events are actively occurring. > > With these patches applied, Android can monitor for, and ward off, > mounting memory shortages before they cause problems for the user. > For example, using memory stall monitors in userspace low memory > killer daemon (lmkd) we can detect mounting pressure and kill less > important processes before device becomes visibly sluggish. In our > memory stress testing psi memory monitors produce roughly 10x less > false positives compared to vmpressure signals. Having ability to > specify multiple triggers for the same psi metric allows other parts > of Android framework to monitor memory state of the device and act > accordingly. > > The new interface is straight-forward. The user opens one of the > pressure files for writing and writes a trigger description into the > file descriptor that defines the stall state - some or full, and the > maximum stall time over a given window of time. E.g.: > > /* Signal when stall time exceeds 100ms of a 1s window */ > char trigger[] = "full 100000 1000000" > fd = open("/proc/pressure/memory") > write(fd, trigger, sizeof(trigger)) > while (poll() >= 0) { > ... > }; > close(fd); > > When the monitored stall state is entered, psi adapts its aggregation > frequency according to what the configured time window requires in > order to emit event signals in a timely fashion. Once the stalling > subsides, aggregation reverts back to normal. > > The trigger is associated with the open file descriptor. To stop > monitoring, the user only needs to close the file descriptor and the > trigger is discarded. > > Patches 1-6 prepare the psi code for polling support. Patch 7 implements > the adaptive polling logic, the pressure growth detection optimized for > short intervals, and hooks up write() and poll() on the pressure files. > > The patches were developed in collaboration with Johannes Weiner. > > The patches are based on 5.0-rc8 (Merge tag 'drm-next-2019-03-06'). > > Suren Baghdasaryan (7): > psi: introduce state_mask to represent stalled psi states > psi: make psi_enable static > psi: rename psi fields in preparation for psi trigger addition > psi: split update_stats into parts > psi: track changed states > refactor header includes to allow kthread.h inclusion in psi_types.h > psi: introduce psi monitor > > Documentation/accounting/psi.txt | 107 ++++++ > include/linux/kthread.h | 3 +- > include/linux/psi.h | 8 + > include/linux/psi_types.h | 105 +++++- > include/linux/sched.h | 1 - > kernel/cgroup/cgroup.c | 71 +++- > kernel/kthread.c | 1 + > kernel/sched/psi.c | 613 ++++++++++++++++++++++++++++--- > 8 files changed, 833 insertions(+), 76 deletions(-) > > Changes in v5: > - Fixed sparse: error: incompatible types in comparison expression, as per > Andrew > - Changed psi_enable to static, as per Andrew > - Refactored headers to be able to include kthread.h into psi_types.h > without creating a circular inclusion, as per Johannes > - Split psi monitor from aggregator, used RT worker for psi monitoring to > prevent it being starved by other RT threads and memory pressure events > being delayed or lost, as per Minchan and Android Performance Team > - Fixed blockable memory allocation under rcu_read_lock inside > psi_trigger_poll by using refcounting, as per Eva Huang and Minchan > - Misc cleanup and improvements, as per Johannes > > Notes: > 0001-psi-introduce-state_mask-to-represent-stalled-psi-st.patch is unchanged > from the previous version and provided for completeness. Please fix kbuild test bot's warning in 6/7 Other than that, for all patches, Acked-by: Minchan Kim <minchan@kernel.org> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v5 0/7] psi: pressure stall monitors v5 2019-03-19 22:51 ` [PATCH v5 0/7] psi: pressure stall monitors v5 Minchan Kim @ 2019-03-20 0:03 ` Suren Baghdasaryan 0 siblings, 0 replies; 12+ messages in thread From: Suren Baghdasaryan @ 2019-03-20 0:03 UTC (permalink / raw) To: Minchan Kim Cc: Greg Kroah-Hartman, Tejun Heo, lizefan, Johannes Weiner, axboe, dennis, Dennis Zhou, Ingo Molnar, Peter Zijlstra, Andrew Morton, Jonathan Corbet, cgroups, linux-mm, linux-doc, LKML, kernel-team On Tue, Mar 19, 2019 at 3:51 PM Minchan Kim <minchan@kernel.org> wrote: > > On Fri, Mar 08, 2019 at 10:43:04AM -0800, Suren Baghdasaryan wrote: > > This is respin of: > > https://lwn.net/ml/linux-kernel/20190206023446.177362-1-surenb%40google.com/ > > > > Android is adopting psi to detect and remedy memory pressure that > > results in stuttering and decreased responsiveness on mobile devices. > > > > Psi gives us the stall information, but because we're dealing with > > latencies in the millisecond range, periodically reading the pressure > > files to detect stalls in a timely fashion is not feasible. Psi also > > doesn't aggregate its averages at a high-enough frequency right now. > > > > This patch series extends the psi interface such that users can > > configure sensitive latency thresholds and use poll() and friends to > > be notified when these are breached. > > > > As high-frequency aggregation is costly, it implements an aggregation > > method that is optimized for fast, short-interval averaging, and makes > > the aggregation frequency adaptive, such that high-frequency updates > > only happen while monitored stall events are actively occurring. > > > > With these patches applied, Android can monitor for, and ward off, > > mounting memory shortages before they cause problems for the user. > > For example, using memory stall monitors in userspace low memory > > killer daemon (lmkd) we can detect mounting pressure and kill less > > important processes before device becomes visibly sluggish. In our > > memory stress testing psi memory monitors produce roughly 10x less > > false positives compared to vmpressure signals. Having ability to > > specify multiple triggers for the same psi metric allows other parts > > of Android framework to monitor memory state of the device and act > > accordingly. > > > > The new interface is straight-forward. The user opens one of the > > pressure files for writing and writes a trigger description into the > > file descriptor that defines the stall state - some or full, and the > > maximum stall time over a given window of time. E.g.: > > > > /* Signal when stall time exceeds 100ms of a 1s window */ > > char trigger[] = "full 100000 1000000" > > fd = open("/proc/pressure/memory") > > write(fd, trigger, sizeof(trigger)) > > while (poll() >= 0) { > > ... > > }; > > close(fd); > > > > When the monitored stall state is entered, psi adapts its aggregation > > frequency according to what the configured time window requires in > > order to emit event signals in a timely fashion. Once the stalling > > subsides, aggregation reverts back to normal. > > > > The trigger is associated with the open file descriptor. To stop > > monitoring, the user only needs to close the file descriptor and the > > trigger is discarded. > > > > Patches 1-6 prepare the psi code for polling support. Patch 7 implements > > the adaptive polling logic, the pressure growth detection optimized for > > short intervals, and hooks up write() and poll() on the pressure files. > > > > The patches were developed in collaboration with Johannes Weiner. > > > > The patches are based on 5.0-rc8 (Merge tag 'drm-next-2019-03-06'). > > > > Suren Baghdasaryan (7): > > psi: introduce state_mask to represent stalled psi states > > psi: make psi_enable static > > psi: rename psi fields in preparation for psi trigger addition > > psi: split update_stats into parts > > psi: track changed states > > refactor header includes to allow kthread.h inclusion in psi_types.h > > psi: introduce psi monitor > > > > Documentation/accounting/psi.txt | 107 ++++++ > > include/linux/kthread.h | 3 +- > > include/linux/psi.h | 8 + > > include/linux/psi_types.h | 105 +++++- > > include/linux/sched.h | 1 - > > kernel/cgroup/cgroup.c | 71 +++- > > kernel/kthread.c | 1 + > > kernel/sched/psi.c | 613 ++++++++++++++++++++++++++++--- > > 8 files changed, 833 insertions(+), 76 deletions(-) > > > > Changes in v5: > > - Fixed sparse: error: incompatible types in comparison expression, as per > > Andrew > > - Changed psi_enable to static, as per Andrew > > - Refactored headers to be able to include kthread.h into psi_types.h > > without creating a circular inclusion, as per Johannes > > - Split psi monitor from aggregator, used RT worker for psi monitoring to > > prevent it being starved by other RT threads and memory pressure events > > being delayed or lost, as per Minchan and Android Performance Team > > - Fixed blockable memory allocation under rcu_read_lock inside > > psi_trigger_poll by using refcounting, as per Eva Huang and Minchan > > - Misc cleanup and improvements, as per Johannes > > > > Notes: > > 0001-psi-introduce-state_mask-to-represent-stalled-psi-st.patch is unchanged > > from the previous version and provided for completeness. > > Please fix kbuild test bot's warning in 6/7 > Other than that, for all patches, Thanks for the review! Pushed v6 with the fix for the warning: https://lkml.org/lkml/2019/3/19/987 Also fixed a bug introduced in https://lkml.org/lkml/2019/3/8/686 which I discovered while testing (description in the changelog of the new patchset). > > Acked-by: Minchan Kim <minchan@kernel.org> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-03-20 0:04 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-08 18:43 [PATCH v5 0/7] psi: pressure stall monitors v5 Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 1/7] psi: introduce state_mask to represent stalled psi states Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 2/7] psi: make psi_enable static Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 3/7] psi: rename psi fields in preparation for psi trigger addition Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 4/7] psi: split update_stats into parts Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 5/7] psi: track changed states Suren Baghdasaryan 2019-03-08 18:43 ` [PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h Suren Baghdasaryan 2019-03-09 20:49 ` kbuild test robot 2019-03-09 23:12 ` kbuild test robot 2019-03-08 18:43 ` [PATCH v5 7/7] psi: introduce psi monitor Suren Baghdasaryan 2019-03-19 22:51 ` [PATCH v5 0/7] psi: pressure stall monitors v5 Minchan Kim 2019-03-20 0:03 ` Suren Baghdasaryan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).