Message ID | 20190201011352.GA14370@chrisdown.name |
---|---|
State | Accepted |
Headers | show |
Series |
|
Related | show |
On Thu 31-01-19 20:13:52, Chris Down wrote: [...] > The current situation goes against both the expectations of users of > memory.high, and our intentions as cgroup v2 developers. In > cgroup-v2.txt, we claim that we will throttle and only under "extreme > conditions" will memory.high protection be breached. Likewise, cgroup v2 > users generally also expect that memory.high should throttle workloads > as they exceed their high threshold. However, as seen above, this isn't > always how it works in practice -- even on banal setups like those with > no swap, or where swap has become exhausted, we can end up with > memory.high being breached and us having no weapons left in our arsenal > to combat runaway growth with, since reclaim is futile. > > It's also hard for system monitoring software or users to tell how bad > the situation is, as "high" events for the memcg may in some cases be > benign, and in others be catastrophic. The current status quo is that we > fail containment in a way that doesn't provide any advance warning that > things are about to go horribly wrong (for example, we are about to > invoke the kernel OOM killer). > > This patch introduces explicit throttling when reclaim is failing to > keep memcg size contained at the memory.high setting. It does so by > applying an exponential delay curve derived from the memcg's overage > compared to memory.high. In the normal case where the memcg is either > below or only marginally over its memory.high setting, no throttling > will be performed. How does this play wit the actual OOM when the user expects oom to resolve the situation because the reclaim is futile and there is nothing reclaimable except for killing a process?
On Fri, Feb 01, 2019 at 08:17:57AM +0100, Michal Hocko wrote: > On Thu 31-01-19 20:13:52, Chris Down wrote: > [...] > > The current situation goes against both the expectations of users of > > memory.high, and our intentions as cgroup v2 developers. In > > cgroup-v2.txt, we claim that we will throttle and only under "extreme > > conditions" will memory.high protection be breached. Likewise, cgroup v2 > > users generally also expect that memory.high should throttle workloads > > as they exceed their high threshold. However, as seen above, this isn't > > always how it works in practice -- even on banal setups like those with > > no swap, or where swap has become exhausted, we can end up with > > memory.high being breached and us having no weapons left in our arsenal > > to combat runaway growth with, since reclaim is futile. > > > > It's also hard for system monitoring software or users to tell how bad > > the situation is, as "high" events for the memcg may in some cases be > > benign, and in others be catastrophic. The current status quo is that we > > fail containment in a way that doesn't provide any advance warning that > > things are about to go horribly wrong (for example, we are about to > > invoke the kernel OOM killer). > > > > This patch introduces explicit throttling when reclaim is failing to > > keep memcg size contained at the memory.high setting. It does so by > > applying an exponential delay curve derived from the memcg's overage > > compared to memory.high. In the normal case where the memcg is either > > below or only marginally over its memory.high setting, no throttling > > will be performed. > > How does this play wit the actual OOM when the user expects oom to > resolve the situation because the reclaim is futile and there is nothing > reclaimable except for killing a process? Hm, can you elaborate on your question a bit? The idea behind memory.high is to throttle allocations long enough for the admin or a management daemon to intervene, but not to trigger the kernel oom killer. It was designed as a replacement for the cgroup1 oom_control, but without the deadlock potential, ptrace problems etc. What we specifically do is to set memory.high and have a daemon (oomd) watch memory.pressure, io.pressure etc. in the group. If pressure exceeds a certain threshold, the daemon kills something. As you know, the kernel OOM killer does not kick in reliably when e.g. page cache is thrashing heavily, since from a kernel POV it's still successfully allocating and reclaiming - meanwhile the workload is spending most its time in page faults. And when the kernel OOM killer does kick in, its selection policy is not very workload-aware. This daemon on the other hand can be configured to 1) kick in reliably when the workload-specific tolerances for slowdowns and latencies are violated (which tends to be way earlier than the kernel oom killer usually kicks in) and 2) know about the workload and all its components to make an informed kill decision. Right now, that throttling mechanism works okay with swap enabled, but we cannot enable swap everywhere, or sometimes run out of swap, and then it breaks down and we run into system OOMs. This patch makes sure memory.high *always* implements the throttling semantics described in cgroup-v2.txt, not just most of the time.
Michal Hocko writes: >How does this play wit the actual OOM when the user expects oom to >resolve the situation because the reclaim is futile and there is nothing >reclaimable except for killing a process? In addition to what Johannes said, this doesn't impede OOM in the case of global system starvation (eg. in the case that all major consumers of memory are allocator throttling). In that case nothing unusual will happen, since the task's state is TASK_KILLABLE rather than TASK_UNINTERRUPTIBLE, and we will exit out of mem_cgroup_handle_over_high as quickly as possible.
[Sorry for a late reply] On Fri 01-02-19 11:12:33, Johannes Weiner wrote: > On Fri, Feb 01, 2019 at 08:17:57AM +0100, Michal Hocko wrote: > > On Thu 31-01-19 20:13:52, Chris Down wrote: > > [...] > > > The current situation goes against both the expectations of users of > > > memory.high, and our intentions as cgroup v2 developers. In > > > cgroup-v2.txt, we claim that we will throttle and only under "extreme > > > conditions" will memory.high protection be breached. Likewise, cgroup v2 > > > users generally also expect that memory.high should throttle workloads > > > as they exceed their high threshold. However, as seen above, this isn't > > > always how it works in practice -- even on banal setups like those with > > > no swap, or where swap has become exhausted, we can end up with > > > memory.high being breached and us having no weapons left in our arsenal > > > to combat runaway growth with, since reclaim is futile. > > > > > > It's also hard for system monitoring software or users to tell how bad > > > the situation is, as "high" events for the memcg may in some cases be > > > benign, and in others be catastrophic. The current status quo is that we > > > fail containment in a way that doesn't provide any advance warning that > > > things are about to go horribly wrong (for example, we are about to > > > invoke the kernel OOM killer). > > > > > > This patch introduces explicit throttling when reclaim is failing to > > > keep memcg size contained at the memory.high setting. It does so by > > > applying an exponential delay curve derived from the memcg's overage > > > compared to memory.high. In the normal case where the memcg is either > > > below or only marginally over its memory.high setting, no throttling > > > will be performed. > > > > How does this play wit the actual OOM when the user expects oom to > > resolve the situation because the reclaim is futile and there is nothing > > reclaimable except for killing a process? > > Hm, can you elaborate on your question a bit? > > The idea behind memory.high is to throttle allocations long enough for > the admin or a management daemon to intervene, but not to trigger the > kernel oom killer. It was designed as a replacement for the cgroup1 > oom_control, but without the deadlock potential, ptrace problems etc. Yes, this makes sense. The high limit reclaim is also documented as a best effort resource guarantee. My understanding is that if the workload cannot be contained within the high limit then the system cannot do much and eventually gives up. Having the full memory unreclaimable is such an example. And there is either the global OOM killer or hard limit OOM killer to trigger to resolve such a situation. [...] Thanks for describing the usecase. > Right now, that throttling mechanism works okay with swap enabled, but > we cannot enable swap everywhere, or sometimes run out of swap, and > then it breaks down and we run into system OOMs. > > This patch makes sure memory.high *always* implements the throttling > semantics described in cgroup-v2.txt, not just most of the time. I am not really opposed to the throttling in the absence of a reclaimable memory. We do that for the regular allocation paths already (should_reclaim_retry). A swapless system with anon memory is very likely to oom too quickly and this sounds like a real problem. But I do not think that we should throttle the allocation to freeze it completely. We should eventually OOM. And that was my question about essentially. How much we can/should throttle to give a high limit events consumer enough time to intervene. I am sorry to still not have time to study the patch more closely but this should be explained in the changelog. Are we talking about seconds/minutes or simply freeze each allocator to death?
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 18f4aefbe0bf..1844a88f1f68 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -65,6 +65,7 @@ #include <linux/lockdep.h> #include <linux/file.h> #include <linux/tracehook.h> +#include <linux/psi.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -2161,12 +2162,68 @@ static void high_work_func(struct work_struct *work) reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL); } +/* + * Clamp the maximum sleep time per allocation batch to 2 seconds. This is + * enough to still cause a significant slowdown in most cases, while still + * allowing diagnostics and tracing to proceed without becoming stuck. + */ +#define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ) + +/* + * When calculating the delay, we use these either side of the exponentiation to + * maintain precision and scale to a reasonable number of jiffies (see the table + * below. + * + * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the + * overage ratio to a delay. + * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the + * proposed penalty in order to reduce to a reasonable number of jiffies, and + * to produce a reasonable delay curve. + * + * MEMCG_DELAY_SCALING_SHIFT just happens to be a number that produces a + * reasonable delay curve compared to precision-adjusted overage, not + * penalising heavily at first, but still making sure that growth beyond the + * limit penalises misbehaviour cgroups by slowing them down exponentially. For + * example, with a high of 100 megabytes: + * + * +-------+------------------------+ + * | usage | time to allocate in ms | + * +-------+------------------------+ + * | 100M | 0 | + * | 101M | 6 | + * | 102M | 25 | + * | 103M | 57 | + * | 104M | 102 | + * | 105M | 159 | + * | 106M | 230 | + * | 107M | 313 | + * | 108M | 409 | + * | 109M | 518 | + * | 110M | 639 | + * | 111M | 774 | + * | 112M | 921 | + * | 113M | 1081 | + * | 114M | 1254 | + * | 115M | 1439 | + * | 116M | 1638 | + * | 117M | 1849 | + * | 118M | 2000 | + * | 119M | 2000 | + * | 120M | 2000 | + * +-------+------------------------+ + */ + #define MEMCG_DELAY_PRECISION_SHIFT 20 + #define MEMCG_DELAY_SCALING_SHIFT 14 + /* * Scheduled by try_charge() to be executed from the userland return path * and reclaims memory over the high limit. */ void mem_cgroup_handle_over_high(void) { + unsigned long usage, high; + unsigned long pflags; + unsigned long penalty_jiffies, overage; unsigned int nr_pages = current->memcg_nr_pages_over_high; struct mem_cgroup *memcg = current->memcg_high_reclaim; @@ -2177,9 +2234,68 @@ void mem_cgroup_handle_over_high(void) memcg = get_mem_cgroup_from_mm(current->mm); reclaim_high(memcg, nr_pages, GFP_KERNEL); - css_put(&memcg->css); current->memcg_high_reclaim = NULL; current->memcg_nr_pages_over_high = 0; + + /* + * memory.high is breached and reclaim is unable to keep up. Throttle + * allocators proactively to slow down excessive growth. + * + * We use overage compared to memory.high to calculate the number of + * jiffies to sleep (penalty_jiffies). Ideally this value should be + * fairly lenient on small overages, and increasingly harsh when the + * memcg in question makes it clear that it has no intention of stopping + * its crazy behaviour, so we exponentially increase the delay based on + * overage amount. + */ + + usage = page_counter_read(&memcg->memory); + high = READ_ONCE(memcg->high); + + if (usage <= high) + goto out; + + overage = ((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT) / high; + penalty_jiffies = ((u64)overage * overage * HZ) + >> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT); + + /* + * Factor in the task's own contribution to the overage, such that four + * N-sized allocations are throttled approximately the same as one + * 4N-sized allocation. + * + * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or + * larger the current charge patch is than that. + */ + penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH; + + /* + * Clamp the max delay per usermode return so as to still keep the + * application moving forwards and also permit diagnostics, albeit + * extremely slowly. + */ + penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES); + + /* + * Don't sleep if the amount of jiffies this memcg owes us is so low + * that it's not even worth doing, in an attempt to be nice to those who + * go only a small amount over their memory.high value and maybe haven't + * been aggressively reclaimed enough yet. + */ + if (penalty_jiffies <= HZ / 100) + goto out; + + /* + * If we exit early, we're guaranteed to die (since + * schedule_timeout_killable sets TASK_KILLABLE). This means we don't + * need to account for any ill-begotten jiffies to pay them off later. + */ + psi_memstall_enter(&pflags); + schedule_timeout_killable(penalty_jiffies); + psi_memstall_leave(&pflags); + +out: + css_put(&memcg->css); } static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,