* [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted @ 2020-05-11 22:55 Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw) To: akpm Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb, mhocko, Jakub Kicinski Tejun describes the problem as follows: When swap runs out, there's an abrupt change in system behavior - the anonymous memory suddenly becomes unmanageable which readily breaks any sort of memory isolation and can bring down the whole system. To avoid that, oomd [1] monitors free swap space and triggers kills when it drops below the specific threshold (e.g. 15%). While this works, it's far from ideal: - Depending on IO performance and total swap size, a given headroom might not be enough or too much. - oomd has to monitor swap depletion in addition to the usual pressure metrics and it currently doesn't consider memory.swap.max. Solve this by adapting parts of the approach that memory.high uses - slow down allocation as the resource gets depleted turning the depletion behavior from abrupt cliff one to gradual degradation observable through memory pressure metric. [1] https://github.com/facebookincubator/oomd v1: https://lore.kernel.org/linux-mm/20200417010617.927266-1-kuba@kernel.org/ Jakub Kicinski (3): mm: prepare for swap over-high accounting and penalty calculation mm: move penalty delay clamping out of calculate_high_delay() mm: automatically penalize tasks with high swap use Documentation/admin-guide/cgroup-v2.rst | 16 +++ include/linux/memcontrol.h | 4 + mm/memcontrol.c | 166 ++++++++++++++++++------ 3 files changed, 147 insertions(+), 39 deletions(-) -- 2.25.4 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation 2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski @ 2020-05-11 22:55 ` Jakub Kicinski 2020-05-12 7:08 ` Michal Hocko 2020-05-11 22:55 ` [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski 2 siblings, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw) To: akpm Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb, mhocko, Jakub Kicinski Slice the memory overage calculation logic a little bit so we can reuse it to apply a similar penalty to the swap. The logic which accesses the memory-specific fields (use and high values) has to be taken out of calculate_high_delay(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- mm/memcontrol.c | 62 ++++++++++++++++++++++++++++--------------------- 1 file changed, 35 insertions(+), 27 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 05dcb72314b5..8a9b671c3249 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work) #define MEMCG_DELAY_PRECISION_SHIFT 20 #define MEMCG_DELAY_SCALING_SHIFT 14 -/* - * Get the number of jiffies that we should penalise a mischievous cgroup which - * is exceeding its memory.high by checking both it and its ancestors. - */ -static unsigned long calculate_high_delay(struct mem_cgroup *memcg, - unsigned int nr_pages) +static u64 calculate_overage(unsigned long usage, unsigned long high) { - unsigned long penalty_jiffies; - u64 max_overage = 0; - - do { - unsigned long usage, high; - u64 overage; + u64 overage; - usage = page_counter_read(&memcg->memory); - high = READ_ONCE(memcg->high); + if (usage <= high) + return 0; - if (usage <= high) - continue; + /* + * Prevent division by 0 in overage calculation by acting as if + * it was a threshold of 1 page + */ + high = max(high, 1UL); - /* - * Prevent division by 0 in overage calculation by acting as if - * it was a threshold of 1 page - */ - high = max(high, 1UL); + overage = usage - high; + overage <<= MEMCG_DELAY_PRECISION_SHIFT; + return div64_u64(overage, high); +} - overage = usage - high; - overage <<= MEMCG_DELAY_PRECISION_SHIFT; - overage = div64_u64(overage, high); +static u64 mem_find_max_overage(struct mem_cgroup *memcg) +{ + u64 overage, max_overage = 0; - if (overage > max_overage) - max_overage = overage; + do { + overage = calculate_overage(page_counter_read(&memcg->memory), + READ_ONCE(memcg->high)); + max_overage = max(overage, max_overage); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); + return max_overage; +} + +/* + * Get the number of jiffies that we should penalise a mischievous cgroup which + * is exceeding its memory.high by checking both it and its ancestors. + */ +static unsigned long calculate_high_delay(struct mem_cgroup *memcg, + unsigned int nr_pages, + u64 max_overage) +{ + unsigned long penalty_jiffies; + if (!max_overage) return 0; @@ -2411,7 +2418,8 @@ void mem_cgroup_handle_over_high(void) * memory.high is breached and reclaim is unable to keep up. Throttle * allocators proactively to slow down excessive growth. */ - penalty_jiffies = calculate_high_delay(memcg, nr_pages); + penalty_jiffies = calculate_high_delay(memcg, nr_pages, + mem_find_max_overage(memcg)); /* * Don't sleep if the amount of jiffies this memcg owes us is so low -- 2.25.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation 2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski @ 2020-05-12 7:08 ` Michal Hocko 2020-05-12 17:28 ` Jakub Kicinski 0 siblings, 1 reply; 15+ messages in thread From: Michal Hocko @ 2020-05-12 7:08 UTC (permalink / raw) To: Jakub Kicinski Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Mon 11-05-20 15:55:14, Jakub Kicinski wrote: > Slice the memory overage calculation logic a little bit so we can > reuse it to apply a similar penalty to the swap. The logic which > accesses the memory-specific fields (use and high values) has to > be taken out of calculate_high_delay(). > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> some recommendations below. > --- > mm/memcontrol.c | 62 ++++++++++++++++++++++++++++--------------------- > 1 file changed, 35 insertions(+), 27 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 05dcb72314b5..8a9b671c3249 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work) > #define MEMCG_DELAY_PRECISION_SHIFT 20 > #define MEMCG_DELAY_SCALING_SHIFT 14 > > -/* > - * Get the number of jiffies that we should penalise a mischievous cgroup which > - * is exceeding its memory.high by checking both it and its ancestors. > - */ > -static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > - unsigned int nr_pages) > +static u64 calculate_overage(unsigned long usage, unsigned long high) the naming is slightly confusing. I would concider the return value to be in memory units rather than time because I would read it as overrage of high. calculate_throttle_penalty would be more clear to me. > { > - unsigned long penalty_jiffies; > - u64 max_overage = 0; > - > - do { > - unsigned long usage, high; > - u64 overage; > + u64 overage; > > - usage = page_counter_read(&memcg->memory); > - high = READ_ONCE(memcg->high); > + if (usage <= high) > + return 0; > > - if (usage <= high) > - continue; > + /* > + * Prevent division by 0 in overage calculation by acting as if > + * it was a threshold of 1 page > + */ > + high = max(high, 1UL); > > - /* > - * Prevent division by 0 in overage calculation by acting as if > - * it was a threshold of 1 page > - */ > - high = max(high, 1UL); > + overage = usage - high; > + overage <<= MEMCG_DELAY_PRECISION_SHIFT; > + return div64_u64(overage, high); > +} > > - overage = usage - high; > - overage <<= MEMCG_DELAY_PRECISION_SHIFT; > - overage = div64_u64(overage, high); > +static u64 mem_find_max_overage(struct mem_cgroup *memcg) This would then become find_high_throttle_penalty > +{ > + u64 overage, max_overage = 0; > > - if (overage > max_overage) > - max_overage = overage; > + do { > + overage = calculate_overage(page_counter_read(&memcg->memory), > + READ_ONCE(memcg->high)); > + max_overage = max(overage, max_overage); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > > + return max_overage; > +} > + > +/* > + * Get the number of jiffies that we should penalise a mischievous cgroup which > + * is exceeding its memory.high by checking both it and its ancestors. > + */ > +static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > + unsigned int nr_pages, > + u64 max_overage) > +{ > + unsigned long penalty_jiffies; > + > if (!max_overage) > return 0; > > @@ -2411,7 +2418,8 @@ void mem_cgroup_handle_over_high(void) > * memory.high is breached and reclaim is unable to keep up. Throttle > * allocators proactively to slow down excessive growth. > */ > - penalty_jiffies = calculate_high_delay(memcg, nr_pages); > + penalty_jiffies = calculate_high_delay(memcg, nr_pages, > + mem_find_max_overage(memcg)); > > /* > * Don't sleep if the amount of jiffies this memcg owes us is so low > -- > 2.25.4 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation 2020-05-12 7:08 ` Michal Hocko @ 2020-05-12 17:28 ` Jakub Kicinski 2020-05-13 8:06 ` Michal Hocko 0 siblings, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2020-05-12 17:28 UTC (permalink / raw) To: Michal Hocko Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Tue, 12 May 2020 09:08:58 +0200 Michal Hocko wrote: > On Mon 11-05-20 15:55:14, Jakub Kicinski wrote: > > Slice the memory overage calculation logic a little bit so we can > > reuse it to apply a similar penalty to the swap. The logic which > > accesses the memory-specific fields (use and high values) has to > > be taken out of calculate_high_delay(). > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > Acked-by: Michal Hocko <mhocko@suse.com> > > some recommendations below. Thank you! > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 05dcb72314b5..8a9b671c3249 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work) > > #define MEMCG_DELAY_PRECISION_SHIFT 20 > > #define MEMCG_DELAY_SCALING_SHIFT 14 > > > > -/* > > - * Get the number of jiffies that we should penalise a mischievous cgroup which > > - * is exceeding its memory.high by checking both it and its ancestors. > > - */ > > -static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > > - unsigned int nr_pages) > > +static u64 calculate_overage(unsigned long usage, unsigned long high) > > the naming is slightly confusing. I would concider the return value > to be in memory units rather than time because I would read it as > overrage of high. calculate_throttle_penalty would be more clear to me. Hm. The unit is the fraction of high. Here is the code, it's quite hard to read in diff form (I should have used --histogram, sorry): static u64 calculate_overage(unsigned long usage, unsigned long high) { u64 overage; if (usage <= high) return 0; /* * Prevent division by 0 in overage calculation by acting as if * it was a threshold of 1 page */ high = max(high, 1UL); overage = usage - high; overage <<= MEMCG_DELAY_PRECISION_SHIFT; return div64_u64(overage, high); } calculate_throttle_penalty() sounds like it returns time. How about something like calc_overage_frac() ? Or calc_overage_perc()? (abbreviating to "calc" so the caller fits on a line) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation 2020-05-12 17:28 ` Jakub Kicinski @ 2020-05-13 8:06 ` Michal Hocko 0 siblings, 0 replies; 15+ messages in thread From: Michal Hocko @ 2020-05-13 8:06 UTC (permalink / raw) To: Jakub Kicinski Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Tue 12-05-20 10:28:19, Jakub Kicinski wrote: > On Tue, 12 May 2020 09:08:58 +0200 Michal Hocko wrote: > > On Mon 11-05-20 15:55:14, Jakub Kicinski wrote: > > > Slice the memory overage calculation logic a little bit so we can > > > reuse it to apply a similar penalty to the swap. The logic which > > > accesses the memory-specific fields (use and high values) has to > > > be taken out of calculate_high_delay(). > > > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > Acked-by: Michal Hocko <mhocko@suse.com> > > > > some recommendations below. > > Thank you! > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 05dcb72314b5..8a9b671c3249 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work) > > > #define MEMCG_DELAY_PRECISION_SHIFT 20 > > > #define MEMCG_DELAY_SCALING_SHIFT 14 > > > > > > -/* > > > - * Get the number of jiffies that we should penalise a mischievous cgroup which > > > - * is exceeding its memory.high by checking both it and its ancestors. > > > - */ > > > -static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > > > - unsigned int nr_pages) > > > +static u64 calculate_overage(unsigned long usage, unsigned long high) > > > > the naming is slightly confusing. I would concider the return value > > to be in memory units rather than time because I would read it as > > overrage of high. calculate_throttle_penalty would be more clear to me. > > Hm. The unit is the fraction of high. Here is the code, it's quite hard > to read in diff form (I should have used --histogram, sorry): Yeah, I have checked the resulting code. > static u64 calculate_overage(unsigned long usage, unsigned long high) > { > u64 overage; > > if (usage <= high) > return 0; > > /* > * Prevent division by 0 in overage calculation by acting as if > * it was a threshold of 1 page > */ > high = max(high, 1UL); > > overage = usage - high; > overage <<= MEMCG_DELAY_PRECISION_SHIFT; > return div64_u64(overage, high); > } > > calculate_throttle_penalty() sounds like it returns time. How about > something like calc_overage_frac() ? Or calc_overage_perc()? > (abbreviating to "calc" so the caller fits on a line) heh, naming is hard and not the most important thing in the world. So if _penalty doesn't really sound good to you then let's just stick with what you've had. I do not really like the _perc/_frac much more because this is more about the implementation of the function than the intention. We shouldn't really care whether the throttling is based on overage scaled linearly (aka fraction) or by other means. The implementation might change in the future. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay() 2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski @ 2020-05-11 22:55 ` Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski 2 siblings, 0 replies; 15+ messages in thread From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw) To: akpm Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb, mhocko, Jakub Kicinski We will want to call calculate_high_delay() twice - once for memory and once for swap, and we should apply the clamp value to sum of the penalties. Clamping has to be applied outside of calculate_high_delay(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- mm/memcontrol.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8a9b671c3249..66dd87bb9e0f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2386,14 +2386,7 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or * larger the current charge patch is than that. */ - penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH; - - /* - * Clamp the max delay per usermode return so as to still keep the - * application moving forwards and also permit diagnostics, albeit - * extremely slowly. - */ - return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES); + return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH; } /* @@ -2421,6 +2414,13 @@ void mem_cgroup_handle_over_high(void) penalty_jiffies = calculate_high_delay(memcg, nr_pages, mem_find_max_overage(memcg)); + /* + * Clamp the max delay per usermode return so as to still keep the + * application moving forwards and also permit diagnostics, albeit + * extremely slowly. + */ + penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES); + /* * Don't sleep if the amount of jiffies this memcg owes us is so low * that it's not even worth doing, in an attempt to be nice to those who -- 2.25.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski @ 2020-05-11 22:55 ` Jakub Kicinski 2020-05-12 7:26 ` Michal Hocko 2020-05-13 8:38 ` Michal Hocko 2 siblings, 2 replies; 15+ messages in thread From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw) To: akpm Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb, mhocko, Jakub Kicinski Add a memory.swap.high knob, which can be used to protect the system from SWAP exhaustion. The mechanism used for penelizing is similar to memory.high penalty (sleep on return to user space), but with a less steep slope. That is not to say that the knob itself is equivalent to memory.high. The objective is more to protect the system from potentially buggy tasks consuming a lot of swap and impacting other tasks, or even bringing the whole system to stand still with complete SWAP exhaustion. Hopefully without the need to find per-task hard limits. Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react. oomd and earlyoom already do killing based on swap exhaustion, and memory.swap.high protection will help implement such userspace oom policies more reliably. Use one counter for number of pages allocated under pressure to save struct task space and avoid two separate hierarchy walks on the hot path. Use swap.high when deciding if swap is full. Perform reclaim and count memory over high events. Signed-off-by: Jakub Kicinski <kuba@kernel.org> -- v2: - add docs, - improve commit message. --- Documentation/admin-guide/cgroup-v2.rst | 16 +++++ include/linux/memcontrol.h | 4 ++ mm/memcontrol.c | 94 +++++++++++++++++++++++-- 3 files changed, 107 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5f12f203822e..c60226daa193 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back. The total amount of swap currently being used by the cgroup and its descendants. + memory.swap.high + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Swap usage throttle limit. If a cgroup's swap usage exceeds + this limit, all its further allocations will be throttled to + allow userspace to implement custom out-of-memory procedures. + + This limit marks a point of no return for the cgroup. It is NOT + designed to manage the amount of swapping a workload does + during regular operation. Compare to memory.swap.max, which + prohibits swapping past a set amount, but lets the cgroup + continue unimpeded as long as other memory can be reclaimed. + + Healthy workloads are not expected to reach this limit. + memory.swap.max A read-write single value file which exists on non-root cgroups. The default is "max". diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b478a4e83297..882bda952a5c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -45,6 +45,7 @@ enum memcg_memory_event { MEMCG_MAX, MEMCG_OOM, MEMCG_OOM_KILL, + MEMCG_SWAP_HIGH, MEMCG_SWAP_MAX, MEMCG_SWAP_FAIL, MEMCG_NR_MEMORY_EVENTS, @@ -212,6 +213,9 @@ struct mem_cgroup { /* Upper bound of normal memory consumption range */ unsigned long high; + /* Upper bound of swap consumption range */ + unsigned long swap_high; + /* Range enforcement for interrupt charges */ struct work_struct high_work; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 66dd87bb9e0f..a3d13b30e3d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2353,12 +2353,34 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg) return max_overage; } +static u64 swap_find_max_overage(struct mem_cgroup *memcg) +{ + u64 overage, max_overage = 0; + struct mem_cgroup *max_cg; + + do { + overage = calculate_overage(page_counter_read(&memcg->swap), + READ_ONCE(memcg->swap_high)); + if (overage > max_overage) { + max_overage = overage; + max_cg = memcg; + } + } while ((memcg = parent_mem_cgroup(memcg)) && + !mem_cgroup_is_root(memcg)); + + if (max_overage) + memcg_memory_event(max_cg, MEMCG_SWAP_HIGH); + + return max_overage; +} + /* * Get the number of jiffies that we should penalise a mischievous cgroup which * is exceeding its memory.high by checking both it and its ancestors. */ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, unsigned int nr_pages, + unsigned char cost_shift, u64 max_overage) { unsigned long penalty_jiffies; @@ -2366,6 +2388,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, if (!max_overage) return 0; + if (cost_shift) + max_overage >>= cost_shift; + /* * We use overage compared to memory.high to calculate the number of * jiffies to sleep (penalty_jiffies). Ideally this value should be @@ -2411,9 +2436,16 @@ void mem_cgroup_handle_over_high(void) * memory.high is breached and reclaim is unable to keep up. Throttle * allocators proactively to slow down excessive growth. */ - penalty_jiffies = calculate_high_delay(memcg, nr_pages, + penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0, mem_find_max_overage(memcg)); + /* + * Make the swap curve more gradual, swap can be considered "cheaper", + * and is allocated in larger chunks. We want the delays to be gradual. + */ + penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2, + swap_find_max_overage(memcg)); + /* * Clamp the max delay per usermode return so as to still keep the * application moving forwards and also permit diagnostics, albeit @@ -2604,12 +2636,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * reclaim, the cost of mismatch is negligible. */ do { - if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) { - /* Don't bother a random interrupted task */ - if (in_interrupt()) { + bool mem_high, swap_high; + + mem_high = page_counter_read(&memcg->memory) > + READ_ONCE(memcg->high); + swap_high = page_counter_read(&memcg->swap) > + READ_ONCE(memcg->swap_high); + + /* Don't bother a random interrupted task */ + if (in_interrupt()) { + if (mem_high) { schedule_work(&memcg->high_work); break; } + continue; + } + + if (mem_high || swap_high) { current->memcg_nr_pages_over_high += batch; set_notify_resume(current); break; @@ -5076,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); memcg->soft_limit = PAGE_COUNTER_MAX; + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); if (parent) { memcg->swappiness = mem_cgroup_swappiness(parent); memcg->oom_kill_disable = parent->oom_kill_disable; @@ -5229,6 +5273,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) page_counter_set_low(&memcg->memory, 0); WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); memcg->soft_limit = PAGE_COUNTER_MAX; + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); memcg_wb_domain_size_changed(memcg); } @@ -7136,10 +7181,13 @@ bool mem_cgroup_swap_full(struct page *page) if (!memcg) return false; - for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) - if (page_counter_read(&memcg->swap) * 2 >= - READ_ONCE(memcg->swap.max)) + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { + unsigned long usage = page_counter_read(&memcg->swap); + + if (usage * 2 >= READ_ONCE(memcg->swap_high) || + usage * 2 >= READ_ONCE(memcg->swap.max)) return true; + } return false; } @@ -7169,6 +7217,30 @@ static u64 swap_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE; } +static int swap_high_show(struct seq_file *m, void *v) +{ + unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high); + + return seq_puts_memcg_tunable(m, high); +} + +static ssize_t swap_high_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long high; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", &high); + if (err) + return err; + + WRITE_ONCE(memcg->swap_high, high); + + return nbytes; +} + static int swap_max_show(struct seq_file *m, void *v) { return seq_puts_memcg_tunable(m, @@ -7196,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + seq_printf(m, "high %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH])); seq_printf(m, "max %lu\n", atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX])); seq_printf(m, "fail %lu\n", @@ -7210,6 +7284,12 @@ static struct cftype swap_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = swap_current_read, }, + { + .name = "swap.high", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = swap_high_show, + .write = swap_high_write, + }, { .name = "swap.max", .flags = CFTYPE_NOT_ON_ROOT, -- 2.25.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski @ 2020-05-12 7:26 ` Michal Hocko 2020-05-12 17:55 ` Jakub Kicinski 2020-05-13 8:38 ` Michal Hocko 1 sibling, 1 reply; 15+ messages in thread From: Michal Hocko @ 2020-05-12 7:26 UTC (permalink / raw) To: Jakub Kicinski Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > Add a memory.swap.high knob, which can be used to protect the system > from SWAP exhaustion. The mechanism used for penelizing is similar > to memory.high penalty (sleep on return to user space), but with > a less steep slope. > > That is not to say that the knob itself is equivalent to memory.high. > The objective is more to protect the system from potentially buggy > tasks consuming a lot of swap and impacting other tasks, or even > bringing the whole system to stand still with complete SWAP > exhaustion. Hopefully without the need to find per-task hard > limits. > > Slowing misbehaving tasks down gradually allows user space oom > killers or other protection mechanisms to react. oomd and earlyoom > already do killing based on swap exhaustion, and memory.swap.high > protection will help implement such userspace oom policies more > reliably. Thanks for adding more information about the usecase and motivation. > > Use one counter for number of pages allocated under pressure > to save struct task space and avoid two separate hierarchy > walks on the hot path. > > Use swap.high when deciding if swap is full. Please be more specific why. > Perform reclaim and count memory over high events. Please expand on this and explain how this is working and why the semantic is subtly different from MEMCG_HIGH. I suspect the reason is that there is no reclaim for the swap so you are only emitting an event on the memcg which is actually throttled. This is in line with memory.high but the difference is that we do reclaim each memcg subtree in the high limit excess. That means that the counter tells us how many times the specific memcg was in excess which would be impossible with your implementation. I would also suggest to explain or ideally even separate the swap penalty scaling logic to a seprate patch. What kind of data it is based on? > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > -- > v2: > - add docs, > - improve commit message. > --- > Documentation/admin-guide/cgroup-v2.rst | 16 +++++ > include/linux/memcontrol.h | 4 ++ > mm/memcontrol.c | 94 +++++++++++++++++++++++-- > 3 files changed, 107 insertions(+), 7 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 5f12f203822e..c60226daa193 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back. > The total amount of swap currently being used by the cgroup > and its descendants. > > + memory.swap.high > + A read-write single value file which exists on non-root > + cgroups. The default is "max". > + > + Swap usage throttle limit. If a cgroup's swap usage exceeds > + this limit, all its further allocations will be throttled to > + allow userspace to implement custom out-of-memory procedures. > + > + This limit marks a point of no return for the cgroup. It is NOT > + designed to manage the amount of swapping a workload does > + during regular operation. Compare to memory.swap.max, which > + prohibits swapping past a set amount, but lets the cgroup > + continue unimpeded as long as other memory can be reclaimed. > + > + Healthy workloads are not expected to reach this limit. > + > memory.swap.max > A read-write single value file which exists on non-root > cgroups. The default is "max". > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index b478a4e83297..882bda952a5c 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -45,6 +45,7 @@ enum memcg_memory_event { > MEMCG_MAX, > MEMCG_OOM, > MEMCG_OOM_KILL, > + MEMCG_SWAP_HIGH, > MEMCG_SWAP_MAX, > MEMCG_SWAP_FAIL, > MEMCG_NR_MEMORY_EVENTS, > @@ -212,6 +213,9 @@ struct mem_cgroup { > /* Upper bound of normal memory consumption range */ > unsigned long high; > > + /* Upper bound of swap consumption range */ > + unsigned long swap_high; > + > /* Range enforcement for interrupt charges */ > struct work_struct high_work; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 66dd87bb9e0f..a3d13b30e3d6 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2353,12 +2353,34 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg) > return max_overage; > } > > +static u64 swap_find_max_overage(struct mem_cgroup *memcg) > +{ > + u64 overage, max_overage = 0; > + struct mem_cgroup *max_cg; > + > + do { > + overage = calculate_overage(page_counter_read(&memcg->swap), > + READ_ONCE(memcg->swap_high)); > + if (overage > max_overage) { > + max_overage = overage; > + max_cg = memcg; > + } > + } while ((memcg = parent_mem_cgroup(memcg)) && > + !mem_cgroup_is_root(memcg)); > + > + if (max_overage) > + memcg_memory_event(max_cg, MEMCG_SWAP_HIGH); > + > + return max_overage; > +} > + > /* > * Get the number of jiffies that we should penalise a mischievous cgroup which > * is exceeding its memory.high by checking both it and its ancestors. > */ > static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > unsigned int nr_pages, > + unsigned char cost_shift, > u64 max_overage) > { > unsigned long penalty_jiffies; > @@ -2366,6 +2388,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > if (!max_overage) > return 0; > > + if (cost_shift) > + max_overage >>= cost_shift; > + > /* > * We use overage compared to memory.high to calculate the number of > * jiffies to sleep (penalty_jiffies). Ideally this value should be > @@ -2411,9 +2436,16 @@ void mem_cgroup_handle_over_high(void) > * memory.high is breached and reclaim is unable to keep up. Throttle > * allocators proactively to slow down excessive growth. > */ > - penalty_jiffies = calculate_high_delay(memcg, nr_pages, > + penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0, > mem_find_max_overage(memcg)); > > + /* > + * Make the swap curve more gradual, swap can be considered "cheaper", > + * and is allocated in larger chunks. We want the delays to be gradual. > + */ > + penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2, > + swap_find_max_overage(memcg)); > + > /* > * Clamp the max delay per usermode return so as to still keep the > * application moving forwards and also permit diagnostics, albeit > @@ -2604,12 +2636,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > * reclaim, the cost of mismatch is negligible. > */ > do { > - if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) { > - /* Don't bother a random interrupted task */ > - if (in_interrupt()) { > + bool mem_high, swap_high; > + > + mem_high = page_counter_read(&memcg->memory) > > + READ_ONCE(memcg->high); > + swap_high = page_counter_read(&memcg->swap) > > + READ_ONCE(memcg->swap_high); > + > + /* Don't bother a random interrupted task */ > + if (in_interrupt()) { > + if (mem_high) { > schedule_work(&memcg->high_work); > break; > } > + continue; > + } > + > + if (mem_high || swap_high) { > current->memcg_nr_pages_over_high += batch; > set_notify_resume(current); > break; > @@ -5076,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > > WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); > memcg->soft_limit = PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); > if (parent) { > memcg->swappiness = mem_cgroup_swappiness(parent); > memcg->oom_kill_disable = parent->oom_kill_disable; > @@ -5229,6 +5273,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) > page_counter_set_low(&memcg->memory, 0); > WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); > memcg->soft_limit = PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); > memcg_wb_domain_size_changed(memcg); > } > > @@ -7136,10 +7181,13 @@ bool mem_cgroup_swap_full(struct page *page) > if (!memcg) > return false; > > - for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) > - if (page_counter_read(&memcg->swap) * 2 >= > - READ_ONCE(memcg->swap.max)) > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { > + unsigned long usage = page_counter_read(&memcg->swap); > + > + if (usage * 2 >= READ_ONCE(memcg->swap_high) || > + usage * 2 >= READ_ONCE(memcg->swap.max)) > return true; > + } > > return false; > } > @@ -7169,6 +7217,30 @@ static u64 swap_current_read(struct cgroup_subsys_state *css, > return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE; > } > > +static int swap_high_show(struct seq_file *m, void *v) > +{ > + unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high); > + > + return seq_puts_memcg_tunable(m, high); > +} > + > +static ssize_t swap_high_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + unsigned long high; > + int err; > + > + buf = strstrip(buf); > + err = page_counter_memparse(buf, "max", &high); > + if (err) > + return err; > + > + WRITE_ONCE(memcg->swap_high, high); > + > + return nbytes; > +} > + > static int swap_max_show(struct seq_file *m, void *v) > { > return seq_puts_memcg_tunable(m, > @@ -7196,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v) > { > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > + seq_printf(m, "high %lu\n", > + atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH])); > seq_printf(m, "max %lu\n", > atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX])); > seq_printf(m, "fail %lu\n", > @@ -7210,6 +7284,12 @@ static struct cftype swap_files[] = { > .flags = CFTYPE_NOT_ON_ROOT, > .read_u64 = swap_current_read, > }, > + { > + .name = "swap.high", > + .flags = CFTYPE_NOT_ON_ROOT, > + .seq_show = swap_high_show, > + .write = swap_high_write, > + }, > { > .name = "swap.max", > .flags = CFTYPE_NOT_ON_ROOT, > -- > 2.25.4 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-12 7:26 ` Michal Hocko @ 2020-05-12 17:55 ` Jakub Kicinski 2020-05-13 8:32 ` Michal Hocko 0 siblings, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2020-05-12 17:55 UTC (permalink / raw) To: Michal Hocko Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > Use swap.high when deciding if swap is full. > > Please be more specific why. How about: Use swap.high when deciding if swap is full to influence ongoing swap reclaim in a best effort manner. > > Perform reclaim and count memory over high events. > > Please expand on this and explain how this is working and why the > semantic is subtly different from MEMCG_HIGH. I suspect the reason > is that there is no reclaim for the swap so you are only emitting an > event on the memcg which is actually throttled. This is in line with > memory.high but the difference is that we do reclaim each memcg subtree > in the high limit excess. That means that the counter tells us how many > times the specific memcg was in excess which would be impossible with > your implementation. Right, with memory all cgroups over high get penalized with the extra reclaim work. For swap we just have the delay, so the event is associated with the worst offender, anything lower didn't really matter. But it's easy enough to change if you prefer. Otherwise I'll just add this to the commit message: Count swap over high events. Note that unlike memory over high events we only count them for the worst offender. This is because the delay penalties for both swap and memory over high are not cumulative, i.e. we use the max delay. > I would also suggest to explain or ideally even separate the swap > penalty scaling logic to a seprate patch. What kind of data it is based > on? It's a hard thing to get production data for since, as we mentioned we don't expect the limit to be hit. It was more of a process of experimentation and finding a gradual slope that "felt right"... Is there a more scientific process we can follow here? We want the delay to be small at first for a first few pages and then grow to make sure we stop the task from going too much over high. The square function works pretty well IMHO. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-12 17:55 ` Jakub Kicinski @ 2020-05-13 8:32 ` Michal Hocko 2020-05-13 18:36 ` Jakub Kicinski 0 siblings, 1 reply; 15+ messages in thread From: Michal Hocko @ 2020-05-13 8:32 UTC (permalink / raw) To: Jakub Kicinski Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > Use swap.high when deciding if swap is full. > > > > Please be more specific why. > > How about: > > Use swap.high when deciding if swap is full to influence ongoing > swap reclaim in a best effort manner. This is still way too vague. The crux is why should we treat hard and high swap limit the same for mem_cgroup_swap_full purpose. Please note that I am not saying this is wrong. I am asking for a more detailed explanation mostly because I would bet that somebody stumbles over this sooner or later. mem_cgroup_swap_full is an odd predicate. It doesn't really want to tell that the swap is really full. I haven't studied the original intention but it is more in line of mem_cgroup_swap_under_pressure based on the current usage to (attempt) scale swap cache size. > > > Perform reclaim and count memory over high events. > > > > Please expand on this and explain how this is working and why the > > semantic is subtly different from MEMCG_HIGH. I suspect the reason > > is that there is no reclaim for the swap so you are only emitting an > > event on the memcg which is actually throttled. This is in line with > > memory.high but the difference is that we do reclaim each memcg subtree > > in the high limit excess. That means that the counter tells us how many > > times the specific memcg was in excess which would be impossible with > > your implementation. > > Right, with memory all cgroups over high get penalized with the extra > reclaim work. For swap we just have the delay, so the event is > associated with the worst offender, anything lower didn't really matter. > > But it's easy enough to change if you prefer. Otherwise I'll just add > this to the commit message: > > Count swap over high events. Note that unlike memory over high events > we only count them for the worst offender. This is because the > delay penalties for both swap and memory over high are not cumulative, > i.e. we use the max delay. Well, memory high penalty is in fact cumulative, because the reclaim would happen for each memcg subtree up the hierarchy. Sure the additional throttling is not cumulative but that is not really that important because the exact amount of throttling is an implementation detail. The swap high is an odd one here because we do not reclaim swap so the cumulative effect of that is 0 and there is only the additional throttling happening. I suspect that your current implementation is exposing an internal implementation to the userspace but considering how the current memory high event is documented high The number of times processes of the cgroup are throttled and routed to perform direct memory reclaim because the high memory boundary was exceeded. For a cgroup whose memory usage is capped by the high limit rather than global memory pressure, this event's occurrences are expected. it talks about throttling rather than excess (like max) so I am not really sure. I believe that it would be much better if both events were more explicit about counting an excess and a throttling is just a side effect of that situation. I do not expect that we will have any form of the swap reclaim anytime soon (if ever) but I fail to see why to creat a small little trap like this now. > > I would also suggest to explain or ideally even separate the swap > > penalty scaling logic to a seprate patch. What kind of data it is based > > on? > > It's a hard thing to get production data for since, as we mentioned we > don't expect the limit to be hit. It was more of a process of > experimentation and finding a gradual slope that "felt right"... > > Is there a more scientific process we can follow here? We want the > delay to be small at first for a first few pages and then grow to make > sure we stop the task from going too much over high. The square > function works pretty well IMHO. If there is no data to showing this to be an improvement then I would just not add an additional scaling factor. Why? Mostly because once we have it there it would be extremely hard to change. MM is full of these little heuristics that are copied over because nobody dares to touch them. If a different scaling is really needed it can always be added later with some data to back that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-13 8:32 ` Michal Hocko @ 2020-05-13 18:36 ` Jakub Kicinski 2020-05-14 7:42 ` Michal Hocko 0 siblings, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2020-05-13 18:36 UTC (permalink / raw) To: Michal Hocko Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > Use swap.high when deciding if swap is full. > > > > > > Please be more specific why. > > > > How about: > > > > Use swap.high when deciding if swap is full to influence ongoing > > swap reclaim in a best effort manner. > > This is still way too vague. The crux is why should we treat hard and > high swap limit the same for mem_cgroup_swap_full purpose. Please > note that I am not saying this is wrong. I am asking for a more > detailed explanation mostly because I would bet that somebody > stumbles over this sooner or later. Stumbles in what way? Isn't it expected for the kernel to take reasonable precautions to avoid hitting limits? We expect the application which breaches swap.high to get terminated by user space OOM, kernel best be careful about approaching that limit, no? > mem_cgroup_swap_full is an odd predicate. It doesn't really want to > tell that the swap is really full. I haven't studied the original > intention but it is more in line of mem_cgroup_swap_under_pressure > based on the current usage to (attempt) scale swap cache size. Perhaps Johannes has some experience here? The 50% means full heuristic predates git :( > > > > Perform reclaim and count memory over high events. > > > > > > Please expand on this and explain how this is working and why the > > > semantic is subtly different from MEMCG_HIGH. I suspect the reason > > > is that there is no reclaim for the swap so you are only emitting > > > an event on the memcg which is actually throttled. This is in > > > line with memory.high but the difference is that we do reclaim > > > each memcg subtree in the high limit excess. That means that the > > > counter tells us how many times the specific memcg was in excess > > > which would be impossible with your implementation. > > > > Right, with memory all cgroups over high get penalized with the > > extra reclaim work. For swap we just have the delay, so the event is > > associated with the worst offender, anything lower didn't really > > matter. > > > > But it's easy enough to change if you prefer. Otherwise I'll just > > add this to the commit message: > > > > Count swap over high events. Note that unlike memory over high > > events we only count them for the worst offender. This is because > > the delay penalties for both swap and memory over high are not > > cumulative, i.e. we use the max delay. > > Well, memory high penalty is in fact cumulative, because the reclaim > would happen for each memcg subtree up the hierarchy. Sure the > additional throttling is not cumulative but that is not really that > important because the exact amount of throttling is an implementation > detail. The swap high is an odd one here because we do not reclaim > swap so the cumulative effect of that is 0 and there is only the > additional throttling happening. I suspect that your current > implementation is exposing an internal implementation to the > userspace but considering how the current memory high event is > documented high > The number of times processes of the cgroup are > throttled and routed to perform direct memory reclaim > because the high memory boundary was exceeded. For a > cgroup whose memory usage is capped by the high limit > rather than global memory pressure, this event's > occurrences are expected. > > it talks about throttling rather than excess (like max) so I am not > really sure. I believe that it would be much better if both events > were more explicit about counting an excess and a throttling is just > a side effect of that situation. > > I do not expect that we will have any form of the swap reclaim anytime > soon (if ever) but I fail to see why to creat a small little trap like > this now. Right, let me adjust then. > > > I would also suggest to explain or ideally even separate the swap > > > penalty scaling logic to a seprate patch. What kind of data it is > > > based on? > > > > It's a hard thing to get production data for since, as we mentioned > > we don't expect the limit to be hit. It was more of a process of > > experimentation and finding a gradual slope that "felt right"... > > > > Is there a more scientific process we can follow here? We want the > > delay to be small at first for a first few pages and then grow to > > make sure we stop the task from going too much over high. The square > > function works pretty well IMHO. > > If there is no data to showing this to be an improvement then I would > just not add an additional scaling factor. Why? Mostly because once we > have it there it would be extremely hard to change. MM is full of > these little heuristics that are copied over because nobody dares to > touch them. If a different scaling is really needed it can always be > added later with some data to back that. Oh, I misunderstood the question, you were asking about the scaling factor.. The allocation of swap is in larger batches, according to my tests, example below (AR - after reclaim, swap overage changes after memory reclaim). mem overage AR swap pages over_high AR | swap overage AR swap pages over at call. \ | | . mem sleep mem pages over_high. \ \ | | / . swap sleep v v v v v v v [ 73.360533] sleep (32/10->67) [-35|13379] 0+253 [ 73.631291] sleep (32/ 3->54) [-18|13430] 0+205 [ 73.851629] sleep (32/22->35) [-20|13443] 0+133 [ 74.021396] sleep (32/ 3->60) [-29|13500] 0+230 [ 74.263355] sleep (32/28->79) [-44|13551] 0+306 [ 74.585689] sleep (32/29->91) [-17|13627] 0+355 [ 74.958675] sleep (32/27->79) [-31|13679] 0+311 [ 75.293021] sleep (32/29->86) [ -9|13750] 0+344 [ 75.654218] sleep (32/22->72) [-24|13800] 0+290 [ 75.962467] sleep (32/22->73) [-39|13865] 0+296 That's for a process slowly leaking memory. Swap gets over the high by about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope I was trying to scale it back. But you make a fair point, someone more knowledgeable can add the heuristic later if it's really needed. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-13 18:36 ` Jakub Kicinski @ 2020-05-14 7:42 ` Michal Hocko 2020-05-14 20:21 ` Johannes Weiner 0 siblings, 1 reply; 15+ messages in thread From: Michal Hocko @ 2020-05-14 7:42 UTC (permalink / raw) To: Jakub Kicinski Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Wed 13-05-20 11:36:23, Jakub Kicinski wrote: > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > > Use swap.high when deciding if swap is full. > > > > > > > > Please be more specific why. > > > > > > How about: > > > > > > Use swap.high when deciding if swap is full to influence ongoing > > > swap reclaim in a best effort manner. > > > > This is still way too vague. The crux is why should we treat hard and > > high swap limit the same for mem_cgroup_swap_full purpose. Please > > note that I am not saying this is wrong. I am asking for a more > > detailed explanation mostly because I would bet that somebody > > stumbles over this sooner or later. > > Stumbles in what way? Reading the code and trying to understand why this particular decision has been made. Because it might be surprising that the hard and high limits are treated same here. > Isn't it expected for the kernel to take reasonable precautions to > avoid hitting limits? Isn't the throttling itself the precautious? How does the swap cache and its control via mem_cgroup_swap_full interact here. See? This is what I am asking to have explained in the changelog. [...] > > > > I would also suggest to explain or ideally even separate the swap > > > > penalty scaling logic to a seprate patch. What kind of data it is > > > > based on? > > > > > > It's a hard thing to get production data for since, as we mentioned > > > we don't expect the limit to be hit. It was more of a process of > > > experimentation and finding a gradual slope that "felt right"... > > > > > > Is there a more scientific process we can follow here? We want the > > > delay to be small at first for a first few pages and then grow to > > > make sure we stop the task from going too much over high. The square > > > function works pretty well IMHO. > > > > If there is no data to showing this to be an improvement then I would > > just not add an additional scaling factor. Why? Mostly because once we > > have it there it would be extremely hard to change. MM is full of > > these little heuristics that are copied over because nobody dares to > > touch them. If a different scaling is really needed it can always be > > added later with some data to back that. > > Oh, I misunderstood the question, you were asking about the scaling > factor.. The allocation of swap is in larger batches, according to > my tests, example below (AR - after reclaim, swap overage changes > after memory reclaim). > mem overage AR > swap pages over_high AR | swap overage AR > swap pages over at call. \ | | . mem sleep > mem pages over_high. \ \ | | / . swap sleep > v v v v v v v > [ 73.360533] sleep (32/10->67) [-35|13379] 0+253 > [ 73.631291] sleep (32/ 3->54) [-18|13430] 0+205 > [ 73.851629] sleep (32/22->35) [-20|13443] 0+133 > [ 74.021396] sleep (32/ 3->60) [-29|13500] 0+230 > [ 74.263355] sleep (32/28->79) [-44|13551] 0+306 > [ 74.585689] sleep (32/29->91) [-17|13627] 0+355 > [ 74.958675] sleep (32/27->79) [-31|13679] 0+311 > [ 75.293021] sleep (32/29->86) [ -9|13750] 0+344 > [ 75.654218] sleep (32/22->72) [-24|13800] 0+290 > [ 75.962467] sleep (32/22->73) [-39|13865] 0+296 > > That's for a process slowly leaking memory. Swap gets over the high by > about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope > I was trying to scale it back. > > But you make a fair point, someone more knowledgeable can add the > heuristic later if it's really needed. Or just make it a separate patch with all that information. This would allow anybody touching that code in the future to understand the initial motivation. I am still not sure this scaling is a good fit in general (e.g. how does it work with THP swapping?) though but this can be discussed separately at least. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-14 7:42 ` Michal Hocko @ 2020-05-14 20:21 ` Johannes Weiner 2020-05-15 7:14 ` Michal Hocko 0 siblings, 1 reply; 15+ messages in thread From: Johannes Weiner @ 2020-05-14 20:21 UTC (permalink / raw) To: Michal Hocko Cc: Jakub Kicinski, akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > On Wed 13-05-20 11:36:23, Jakub Kicinski wrote: > > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > > > Use swap.high when deciding if swap is full. > > > > > > > > > > Please be more specific why. > > > > > > > > How about: > > > > > > > > Use swap.high when deciding if swap is full to influence ongoing > > > > swap reclaim in a best effort manner. > > > > > > This is still way too vague. The crux is why should we treat hard and > > > high swap limit the same for mem_cgroup_swap_full purpose. Please > > > note that I am not saying this is wrong. I am asking for a more > > > detailed explanation mostly because I would bet that somebody > > > stumbles over this sooner or later. > > > > Stumbles in what way? > > Reading the code and trying to understand why this particular decision > has been made. Because it might be surprising that the hard and high > limits are treated same here. I don't quite understand the controversy. The idea behind "swap full" is that as long as the workload has plenty of swap space available and it's not changing its memory contents, it makes sense to generously hold on to copies of data in the swap device, even after the swapin. A later reclaim cycle can drop the page without any IO. Trading disk space for IO. But the only two ways to reclaim a swap slot is when they're faulted in and the references go away, or by scanning the virtual address space like swapoff does - which is very expensive (one could argue it's too expensive even for swapoff, it's often more practical to just reboot). So at some point in the fill level, we have to start freeing up swap slots on fault/swapin. Otherwise we could eventually run out of swap slots while they're filled with copies of data that is also in RAM. We don't want to OOM a workload because its available swap space is filled with redundant cache. That applies to physical swap limits, swap.max, and naturally also to swap.high which is a limit to implement userspace OOM for swap space exhaustion. > > Isn't it expected for the kernel to take reasonable precautions to > > avoid hitting limits? > > Isn't the throttling itself the precautious? How does the swap cache > and its control via mem_cgroup_swap_full interact here. See? This is > what I am asking to have explained in the changelog. It sounds like we need better documentation of what vm_swap_full() and friends are there for. It should have been obvious why swap.high - a limit on available swap space - hooks into it. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-14 20:21 ` Johannes Weiner @ 2020-05-15 7:14 ` Michal Hocko 0 siblings, 0 replies; 15+ messages in thread From: Michal Hocko @ 2020-05-15 7:14 UTC (permalink / raw) To: Johannes Weiner Cc: Jakub Kicinski, akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb On Thu 14-05-20 16:21:30, Johannes Weiner wrote: > On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > On Wed 13-05-20 11:36:23, Jakub Kicinski wrote: > > > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > > > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > > > > Use swap.high when deciding if swap is full. > > > > > > > > > > > > Please be more specific why. > > > > > > > > > > How about: > > > > > > > > > > Use swap.high when deciding if swap is full to influence ongoing > > > > > swap reclaim in a best effort manner. > > > > > > > > This is still way too vague. The crux is why should we treat hard and > > > > high swap limit the same for mem_cgroup_swap_full purpose. Please > > > > note that I am not saying this is wrong. I am asking for a more > > > > detailed explanation mostly because I would bet that somebody > > > > stumbles over this sooner or later. > > > > > > Stumbles in what way? > > > > Reading the code and trying to understand why this particular decision > > has been made. Because it might be surprising that the hard and high > > limits are treated same here. > > I don't quite understand the controversy. I do not think there is any controversy. All I am asking for is a clarification because this is non-intuitive. > The idea behind "swap full" is that as long as the workload has plenty > of swap space available and it's not changing its memory contents, it > makes sense to generously hold on to copies of data in the swap > device, even after the swapin. A later reclaim cycle can drop the page > without any IO. Trading disk space for IO. > > But the only two ways to reclaim a swap slot is when they're faulted > in and the references go away, or by scanning the virtual address space > like swapoff does - which is very expensive (one could argue it's too > expensive even for swapoff, it's often more practical to just reboot). > > So at some point in the fill level, we have to start freeing up swap > slots on fault/swapin. Otherwise we could eventually run out of swap > slots while they're filled with copies of data that is also in RAM. > > We don't want to OOM a workload because its available swap space is > filled with redundant cache. Thanks this is a useful summary. > That applies to physical swap limits, swap.max, and naturally also to > swap.high which is a limit to implement userspace OOM for swap space > exhaustion. > > > > Isn't it expected for the kernel to take reasonable precautions to > > > avoid hitting limits? > > > > Isn't the throttling itself the precautious? How does the swap cache > > and its control via mem_cgroup_swap_full interact here. See? This is > > what I am asking to have explained in the changelog. > > It sounds like we need better documentation of what vm_swap_full() and > friends are there for. It should have been obvious why swap.high - a > limit on available swap space - hooks into it. Agreed. The primary source for a confusion is the naming here. Because vm_swap_full doesn't really try to tell that the swap is full. It merely tries to tell that it is getting full and so duplicated data should be dropped. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use 2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski 2020-05-12 7:26 ` Michal Hocko @ 2020-05-13 8:38 ` Michal Hocko 1 sibling, 0 replies; 15+ messages in thread From: Michal Hocko @ 2020-05-13 8:38 UTC (permalink / raw) To: Jakub Kicinski Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 5f12f203822e..c60226daa193 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back. > The total amount of swap currently being used by the cgroup > and its descendants. > > + memory.swap.high > + A read-write single value file which exists on non-root > + cgroups. The default is "max". > + > + Swap usage throttle limit. If a cgroup's swap usage exceeds > + this limit, all its further allocations will be throttled to > + allow userspace to implement custom out-of-memory procedures. > + > + This limit marks a point of no return for the cgroup. It is NOT > + designed to manage the amount of swapping a workload does > + during regular operation. Compare to memory.swap.max, which > + prohibits swapping past a set amount, but lets the cgroup > + continue unimpeded as long as other memory can be reclaimed. > + > + Healthy workloads are not expected to reach this limit. > + Btw. I forgot to mention that before but you should also add a documentation for the swap high event to this file. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-05-15 7:15 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski 2020-05-12 7:08 ` Michal Hocko 2020-05-12 17:28 ` Jakub Kicinski 2020-05-13 8:06 ` Michal Hocko 2020-05-11 22:55 ` [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski 2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski 2020-05-12 7:26 ` Michal Hocko 2020-05-12 17:55 ` Jakub Kicinski 2020-05-13 8:32 ` Michal Hocko 2020-05-13 18:36 ` Jakub Kicinski 2020-05-14 7:42 ` Michal Hocko 2020-05-14 20:21 ` Johannes Weiner 2020-05-15 7:14 ` Michal Hocko 2020-05-13 8:38 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).