From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AACF5C433DB for ; Tue, 16 Mar 2021 10:27:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1137165020 for ; Tue, 16 Mar 2021 10:27:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1137165020 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 87D446B006C; Tue, 16 Mar 2021 06:27:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 82D886B006E; Tue, 16 Mar 2021 06:27:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6BACC6B0070; Tue, 16 Mar 2021 06:27:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0061.hostedemail.com [216.40.44.61]) by kanga.kvack.org (Postfix) with ESMTP id 514116B006C for ; Tue, 16 Mar 2021 06:27:01 -0400 (EDT) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 19911180AD837 for ; Tue, 16 Mar 2021 10:27:01 +0000 (UTC) X-FDA: 77925359442.05.1FCC974 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) by imf02.hostedemail.com (Postfix) with ESMTP id 681EA407F8F4 for ; Tue, 16 Mar 2021 10:27:00 +0000 (UTC) Received: by mail-qv1-f43.google.com with SMTP id x16so316480qvk.3 for ; Tue, 16 Mar 2021 03:27:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=V0BTN/Z2jXatx08nLmn2iGZ7O1WuylzrrulrDON6R9k=; b=xCib9ZN+FiOtHqn2f2GIRYRiTffT7uatWQxo/AtFklW7YHFihG2zQ/vTE6gNPQNf+9 vRO50HskMB8FxFuo2e/yIPiiEjRMNR2akfE/kxzq3hVHcZW6gbBwrXGM/nkazzsBxspT IDDzm1pyKpT157y7ETbYQl89HFKamnR3YLguRbBuNUGLzFKgCZyAKpgnaNfuYaC+pVa6 LpmHEpA95pzB9VWM6M9mop0H7Z1xD8Vy9M7yIxp15Me9nGIHP1oJuFRaG4tLl0wyVweM Pvxba1+vGHWQ9Sr05/mHjYuGYAaztxJfRkzf+J9pX0/vBFCFzOae7JoD3GMPcoWagsvR gusg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=V0BTN/Z2jXatx08nLmn2iGZ7O1WuylzrrulrDON6R9k=; b=lGSkUMdYZqUd/TtYvkmSCly6LYQ5n6nce+jcMwbu9NEIbrvehOQMQms2hp0DVEAB9e RCqKWy4mz61F/NJJWdgL8M2cEndh7sIoHKnvtHbLY6V3ROKxesaQemf7qKZvodzlyQh8 gAS1a2GfEvtq9dsOh9OZbD04FvIrKQIB9BdGURX/acZlgKuYhSM1fHODyf9PZ+qPvhi9 L68aTf1rE2ratrFyc/d4Eocrka+NFjJ9KSYdoEPr61hCPaKxepcBqkjTArSlKXjg/f0f fSJ0pKx3c4Fxg5tN0n8X3O+vnguWmkTAajThG0UhefyC9isYiC6EjW+DAoYOgEHkf8l5 x+qA== X-Gm-Message-State: AOAM532Isg4mNxTXyEXyCXKfL6f+HnpembzpWNAgke5GbCrn61Z9FN8m 0M7S/YAR/JzgvsQgHRKlHrEjJQ== X-Google-Smtp-Source: ABdhPJx1Qe12G4sXm6Hq0b39qzz4yiSC2X02eGBokp2+4l6Cg0jAMpo13TlhKMrD8MSowvisz7/VDA== X-Received: by 2002:a0c:f890:: with SMTP id u16mr15165954qvn.21.1615890419711; Tue, 16 Mar 2021 03:26:59 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:7693]) by smtp.gmail.com with ESMTPSA id c19sm14587625qkl.78.2021.03.16.03.26.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Mar 2021 03:26:59 -0700 (PDT) Date: Tue, 16 Mar 2021 06:26:58 -0400 From: Johannes Weiner To: Arjun Roy Cc: akpm@linux-foundation.org, davem@davemloft.net, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, arjunroy@google.com, shakeelb@google.com, edumazet@google.com, soheil@google.com, kuba@kernel.org, mhocko@kernel.org, shy828301@gmail.com, guro@fb.com Subject: Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy Message-ID: References: <20210316041645.144249-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210316041645.144249-1-arjunroy.kdev@gmail.com> X-Stat-Signature: aaebfp6s68zf6p336nkqaxrym8kmbwt4 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 681EA407F8F4 Received-SPF: none (cmpxchg.org>: No applicable sender policy available) receiver=imf02; identity=mailfrom; envelope-from=""; helo=mail-qv1-f43.google.com; client-ip=209.85.219.43 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1615890420-526489 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On Mon, Mar 15, 2021 at 09:16:45PM -0700, Arjun Roy wrote: > From: Arjun Roy > > TCP zerocopy receive is used by high performance network applications > to further scale. For RX zerocopy, the memory containing the network > data filled by the network driver is directly mapped into the address > space of high performance applications. To keep the TLB cost low, > these applications unmap the network memory in big batches. So, this > memory can remain mapped for long time. This can cause a memory > isolation issue as this memory becomes unaccounted after getting > mapped into the application address space. This patch adds the memcg > accounting for such memory. > > Accounting the network memory comes with its own unique challenges. > The high performance NIC drivers use page pooling to reuse the pages > to eliminate/reduce expensive setup steps like IOMMU. These drivers > keep an extra reference on the pages and thus we can not depend on the > page reference for the uncharging. The page in the pool may keep a > memcg pinned for arbitrary long time or may get used by other memcg. The page pool knows when a page is unmapped again and becomes available for recycling, right? Essentially the 'free' phase of that private allocator. That's where the uncharge should be done. For one, it's more aligned with the usual memcg charge lifetime rules. But also it doesn't add what is essentially a private driver callback to the generic file unmapping path. Finally, this will eliminate the need for making up a new charge type (MEMCG_DATA_SOCK) and allow using the standard kmem charging API. > This patch decouples the uncharging of the page from the refcnt and > associates it with the map count i.e. the page gets uncharged when the > last address space unmaps it. Now the question is, what if the driver > drops its reference while the page is still mapped? That is fine as > the address space also holds a reference to the page i.e. the > reference count can not drop to zero before the map count. > > Signed-off-by: Arjun Roy > Co-developed-by: Shakeel Butt > Signed-off-by: Shakeel Butt > Signed-off-by: Eric Dumazet > Signed-off-by: Soheil Hassas Yeganeh > --- > > Changelog since v1: > - Pages accounted for in this manner are now tracked via MEMCG_SOCK. > - v1 allowed for a brief period of double-charging, now we have a > brief period of under-charging to avoid undue memory pressure. I'm afraid we'll have to go back to v1. Let's address the issues raised with it: 1. The NR_FILE_MAPPED accounting. It is longstanding Linux behavior that driver pages mapped into userspace are accounted as file pages, because userspace is actually doing mmap() against a driver file/fd (as opposed to an anon mmap). That is how they show up in vmstat, in meminfo, and in the per process stats. There is no reason to make memcg deviate from this. If we don't like it, it should be taken on by changing vm_insert_page() - not trick rmap into thinking these arent memcg pages and then fixing it up with additional special-cased accounting callbacks. v1 did this right, it charged the pages the way we handle all other userspace pages: before rmap, and then let the generic VM code do the accounting for us with the cgroup-aware vmstat infrastructure. 2. The double charging. Could you elaborate how much we're talking about in any given batch? Is this a problem worth worrying about? The way I see it, any conflict here is caused by the pages being counted in the SOCK counter already, but not actually *tracked* on a per page basis. If it's worth addressing, we should look into fixing the root cause over there first if possible, before trying to work around it here. The newly-added GFP_NOFAIL is especially worrisome. The pages should be charged before we make promises to userspace, not be force-charged when it's too late. We have sk context when charging the inserted pages. Can we uncharge MEMCG_SOCK after each batch of inserts? That's only 32 pages worth of overcharging, so not more than the regular charge batch memcg is using. An even better way would be to do charge stealing where we reuse the existing MEMCG_SOCK charges and don't have to get any new ones at all - just set up page->memcg and remove the charge from the sk. But yeah, it depends a bit if this is a practical concern. Thanks, Johannes