From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E6B5C4727F for ; Mon, 28 Sep 2020 21:03:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8E520208FE for ; Mon, 28 Sep 2020 21:03:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="Vtkaz3mp" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8E520208FE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B563B6B005C; Mon, 28 Sep 2020 17:03:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B07558E0001; Mon, 28 Sep 2020 17:03:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F4F76B0062; Mon, 28 Sep 2020 17:03:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com [216.40.44.144]) by kanga.kvack.org (Postfix) with ESMTP id 890666B005C for ; Mon, 28 Sep 2020 17:03:56 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 470EC180AD804 for ; Mon, 28 Sep 2020 21:03:56 +0000 (UTC) X-FDA: 77313697272.29.spy59_3d1262b27184 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin29.hostedemail.com (Postfix) with ESMTP id 1E85018086CD9 for ; Mon, 28 Sep 2020 21:03:56 +0000 (UTC) X-HE-Tag: spy59_3d1262b27184 X-Filterd-Recvd-Size: 12051 Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Mon, 28 Sep 2020 21:03:55 +0000 (UTC) Received: by mail-qk1-f193.google.com with SMTP id f142so2392538qke.13 for ; Mon, 28 Sep 2020 14:03:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=SGBe8MpSknH0qUb+TQ90a6ftrB2Ede16bHQ7TXX42Tw=; b=Vtkaz3mp9a/NY/motnU4x+MiinMLrHa+BomCSZqvd/94lUeghj9zTyK0isXJQDPNcZ tIYJgFj8bxYSYVfrutfK7NslWH0FKdNxyQM2TUh7RbX/+k5Aq68v2co3gbvLBbnA+Y9b Q0tM24OtlAKlE2lkaxPXRaXVYKLiz/2HCaYmRDwKYzf1jL7irKZtZV9hEkmR8C6OyFBt tED7c83d22hWrtTlz3MY07Co8qUTaNZB9zUYrG+EE1nW8I1KqoxJQ15erhYZbNPmCJuO 0nVk/EQMhSqQksp50gMeUB2JNwP5iWRpAmi+E5fNP3Fn70jG1HKW6OdaSlS4CwxiJu6W Kmzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=SGBe8MpSknH0qUb+TQ90a6ftrB2Ede16bHQ7TXX42Tw=; b=DeQ2E9d1e3fWYyIwW8dpoK+dbuPMG5fLuBD2OR4cFSz4vjxt3C/Rn7PV0Da5aFjD0g aoUeWcGgrQZmOBkP4fpchisIuijfIqyAIYEFy+B2RvO5Bq/H9enwLjORgWu889pKpjtM bFwhXnntLBiNMd9EaIo1yT0f65oXFxiI6tTWJO506e1Q+ApA7euSq13L0E9yJICUqqA8 76BEiTLfW2ZBEPiCqeb1ZBVRw1HqrzQh2w97H7/hXcAD7zvlAI3FmDzQzEKhRTiewDey Mjx9ZX9oaaZ+veCFSpWJI6rT2Z/SZy17U8cXBgQ4urUx80qlukW0PVgRYeBsrS/z1hY8 km+w== X-Gm-Message-State: AOAM532r96o/x/A6oxr1CstvyT6sJmYmyFw+2wTE3jXu5zv2PdsCI40g ZOcRlPEd/c+W9W26O5TUyCYPbw== X-Google-Smtp-Source: ABdhPJy9DN7eukM57oH76vW7yWC/hkgBmIqLB2xSOGXpfF8h03OXKgaYAK8Ds39zw0xS7+3LjVvUYQ== X-Received: by 2002:a05:620a:1583:: with SMTP id d3mr1317180qkk.495.1601327034521; Mon, 28 Sep 2020 14:03:54 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:3928]) by smtp.gmail.com with ESMTPSA id y46sm2665377qtc.30.2020.09.28.14.03.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Sep 2020 14:03:53 -0700 (PDT) Date: Mon, 28 Sep 2020 17:02:16 -0400 From: Johannes Weiner To: Shakeel Butt Cc: Roman Gushchin , Michal Hocko , Yang Shi , Greg Thelen , David Rientjes , Michal =?iso-8859-1?Q?Koutn=FD?= , Andrew Morton , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface Message-ID: <20200928210216.GA378894@cmpxchg.org> References: <20200909215752.1725525-1-shakeelb@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200909215752.1725525-1-shakeelb@google.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, I apologize for the late reply. The proposed interface has been an ongoing topic and area of experimentation within Facebook as well, which makes it a bit difficult to respond with certainty here. I agree with both your usecases. They apply to us as well. We currently make two small changes to our kernel to solve them. They work okay-ish in our production environment, but they aren't quite there yet, and not ready for upstream. Some thoughts and comments below. On Wed, Sep 09, 2020 at 02:57:52PM -0700, Shakeel Butt wrote: > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use cases: > ---------- > > 1) Per-memcg uswapd: > > Usually applications consists of combination of latency sensitive and > latency tolerant tasks. For example, tasks serving user requests vs > tasks doing data backup for a database application. At the moment the > kernel does not differentiate between such tasks when the application > hits the memcg limits. So, potentially a latency sensitive user facing > task can get stuck in high reclaim and be throttled by the kernel. > > Similarly there are cases of single process applications having two set > of thread pools where threads from one pool have high scheduling > priority and low latency requirement. One concrete example from our > production is the VMM which have high priority low latency thread pool > for the VCPUs while separate thread pool for stats reporting, I/O > emulation, health checks and other managerial operations. The kernel > memory reclaim does not differentiate between VCPU thread or a > non-latency sensitive thread and a VCPU thread can get stuck in high > reclaim. > > One way to resolve this issue is to preemptively trigger the memory > reclaim from a latency tolerant task (uswapd) when the application is > near the limits. Finding 'near the limits' situation is an orthogonal > problem. I don't think a userspace implementation is suitable for this purpose. Kswapd-style background reclaim is beneficial to probably 99% of all workloads. Because doing reclaim inside the execution stream of the workload itself is so unnecessary in a multi-CPU environment, whether the workload is particularly latency sensitive or only cares about overall throughput. In most cases, spare cores are available to do this work concurrently, and the buffer memory required between the workload and the async reclaimer tends to be negligible. Requiring non-trivial userspace participation for such a basic optimization does not seem like a good idea to me. We'd probably end up with four or five hyperscalers having four or five different implementations, and not much user coverage beyond that. I floated this patch before: https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ It's blocked on infrastructure work in the CPU controller that allows accounting CPU cycles spent on behalf of other cgroups. But we need this functionality in other places as well - network, async filesystem encryption, various other stuff bounced to workers. > 2) Proactive reclaim: > > This is a similar to the previous use-case, the difference is instead of > waiting for the application to be near its limit to trigger memory > reclaim, continuously pressuring the memcg to reclaim a small amount of > memory. This gives more accurate and uptodate workingset estimation as > the LRUs are continuously sorted and can potentially provide more > deterministic memory overcommit behavior. The memory overcommit > controller can provide more proactive response to the changing behavior > of the running applications instead of being reactive. This is an important usecase for us as well. And we use it not just to keep the LRUs warm, but to actively sample the workingset size - the true amount of memory required, trimmed of all its unused cache and cold pages that can be read back from disk on demand. For this purpose, we're essentially using memory.high right now. The only modification we make here is adding a memory.high.tmp variant that takes a timeout argument in addition to the limit. This ensures we don't leave an unsafe limit behind if the userspace daemon crashes. We have experienced some of the problems you describe below with it. > Benefit of user space solution: > ------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to centralized the > overhead while for uswapd, it makes more sense for the application to > pay for the cpu of the memory reclaim. Agreed on both counts. > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. This could use some elaboration, I think. > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. That's a valid argument for proactive reclaim, and I agree with it. But not necessarily an argument for which part of the proactive reclaim logic should be in-kernel and which should be in userspace. > Questions: > ---------- > > 1) Why memory.high is not enough? > > memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim as well as uswapd use cases. > However there is a big negative in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. That's something we have run into as well. Async memory.high reclaim helps, but when proactive reclaim does bigger steps and lowers the limit below the async reclaim buffer, the workload can still enter direct reclaim. This is undesirable. > Another issue with memory.high is that it is not delegatable. To > actually use this interface for uswapd, the application has to introduce > another layer of cgroup on whose memory.high it has write access. Fair enough. I would generalize that and say that limiting the maximum container size and driving proactive reclaim are separate jobs, with separate goals, happening at different layers of the system. Sharing a single control knob for that can be a coordination nightmare. > 2) Why uswapd safe from self induced reclaim? > > This is very similar to the scenario of oomd under global memory > pressure. We can use the similar mechanisms to protect uswapd from self > induced reclaim i.e. memory.min and mlock. Agreed. > Interface options: > ------------------ > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. This gets the assigning and attribution of targeted reclaim work right (although it doesn't solve kswapd cycle attribution yet). However, it also ditches the limit semantics, which themselves aren't actually a problem. And indeed I would argue have some upsides. As per above, the kernel knows best when and how much to reclaim to match the allocation rate, since it's in control of the allocation path. To do proactive reclaim with the memory.reclaim interface, you would need to monitor memory consumption closely. Workloads may not allocate anything for hours, and then suddenly allocate gigabytes within seconds. A sudden onset of streaming reads through the filesystem could destroy the workingset measurements, whereas a limit would catch it and do drop-behind (and thus workingset sampling) at the exact rate of allocations. Again I believe something that may be doable as a hyperscale operator, but likely too fragile to get wider applications beyond that. My take is that a proactive reclaim feature, whose goal is never to thrash or punish but to keep the LRUs warm and the workingset trimmed, would ideally have: - a pressure or size target specified by userspace but with enforcement driven inside the kernel from the allocation path - the enforcement work NOT be done synchronously by the workload (something I'd argue we want for *all* memory limits) - the enforcement work ACCOUNTED to the cgroup, though, since it's the cgroup's memory allocations causing the work (again something I'd argue we want in general) - a delegatable knob that is independent of setting the maximum size of a container, as that expresses a different type of policy - if size target, self-limiting (ha) enforcement on a pressure threshold or stop enforcement when the userspace component dies Thoughts?