From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_DKIMWL_WL_MED,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36940C04AB6 for ; Wed, 29 May 2019 02:07:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D2C4E20B1F for ; Wed, 29 May 2019 02:07:07 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="tyrAmgJz" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D2C4E20B1F Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5660B6B0272; Tue, 28 May 2019 22:07:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 516E36B0276; Tue, 28 May 2019 22:07:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4069C6B027F; Tue, 28 May 2019 22:07:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f199.google.com (mail-pl1-f199.google.com [209.85.214.199]) by kanga.kvack.org (Postfix) with ESMTP id 01EA86B0272 for ; Tue, 28 May 2019 22:07:07 -0400 (EDT) Received: by mail-pl1-f199.google.com with SMTP id g11so452160plt.23 for ; Tue, 28 May 2019 19:07:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:date:from:to:cc:subject :in-reply-to:message-id:references:user-agent:mime-version; bh=k/uph6k0VSxFUt6vW23NqyCeXeXQzEMPpEsp1IyAYdM=; b=SOnoKaCRb44aOR3q/gbb1cZ78jkUrgLKGuDkbvMjMpskv0/gRitLiTdFdrwlTLtfHz wGCTWQJGqVnueSS+T+9LqsgicBCRNgSgawe5XFx6knipkYVHLueaV2QJAzQ6vJ1sZGCY axKv6Uo4pyEZmeGn3h2VmcBSPox0oMJCvNsd6LOVERFVX6naIO1PmesZmqBD6JZQI+bK 1dn/C2qzYtHscq9Cv36SixNOq3TlQmG0nSoXpQKj53d4rYcxBKF4WcU5H7WNBsUFoN24 UOoAs0+4JwQEbq8tkXSW/6ObPRts0PUffVxz1k/1KYfb5hrdW3/vq8znI4zo1/iJ8IAo 35sQ== X-Gm-Message-State: APjAAAXKlv7M+K8gva0pc3g97TbtCXXttY7q3jpWzzTbFNlTxTrMZEbs isFCt/fk9OSLBwqib8WhVnCcR5H6DCmgnuqC9zUch620tgNS5wjmeB491x2EKpHCtpb1imRWMKv fGbUjktga8yEy8u3uNf+wQGZcIX3kl32c4sjmmLwKP+/lGfCUVORBjU2PRr3V0N8ZYg== X-Received: by 2002:aa7:82d6:: with SMTP id f22mr109677pfn.151.1559095626474; Tue, 28 May 2019 19:07:06 -0700 (PDT) X-Received: by 2002:aa7:82d6:: with SMTP id f22mr109615pfn.151.1559095625394; Tue, 28 May 2019 19:07:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559095625; cv=none; d=google.com; s=arc-20160816; b=Bj9q5oGzecTtUl8QRrVAM3XsWciOCY1QiqtVHHUZ0W9vyNwFZCmKO1awxPsSWTNauI eq34IOaq66Ie/l8Dv8SKHE6uQINJFlHWf7Q5a8Nyl/yVxin5TU1OblvatNyQBeiPRapa i8lhTxbbvarAgwMeZ3ofUDLPVLzC47lgGsVbpxc5uuntfqkOB50N5X3UgKPvL8gcVcgj BgHedeZ+kdJ2MH41evbj7gZSPtibBOXviz6TqUzPzfJYgBnrnA+oLH23PGuZ2oOf44dP SnicqjnxnJ1EuonBINYiWQ1ukaG8JBEwXrpcl332dymX4MJD2DieZ9T/yfk2z3U1graP G+8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:user-agent:references:message-id:in-reply-to:subject :cc:to:from:date:dkim-signature; bh=k/uph6k0VSxFUt6vW23NqyCeXeXQzEMPpEsp1IyAYdM=; b=Kz3L6HlfbEhZj5xb7ihBFgYvpWEyessZY7cwlmxQGLZex3cxDXy0sOV5v8yDzixsLN Nnc3cdgcPCQDT4JwvUbs2iUqVCsPONQaxwhSYFs8WsW3mlpMqSPCtanor9IkzHZUCGKG f4TgccKmxcjwbzW3quVQrm7CR0UwAVM0GVthecVvGiaoGM7qg+kJr4sA0fVhhg55YJ7Y b+8oipjMK5+VTZ5+TB8yeJnFNhXI+tuIT8a62tPYq0r5n539nqxaiNGdyDABy3cbqOjB eGlrXOJ9LLByc8vAbhCwRIBAqaGbbIGxuIK4M3t0sODIixHMdVow8cqu2XfsB1YILIw/ Lhvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tyrAmgJz; spf=pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id b5sor19627082ple.15.2019.05.28.19.07.05 for (Google Transport Security); Tue, 28 May 2019 19:07:05 -0700 (PDT) Received-SPF: pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tyrAmgJz; spf=pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=k/uph6k0VSxFUt6vW23NqyCeXeXQzEMPpEsp1IyAYdM=; b=tyrAmgJzpZqcjc3vBI1Kw8+OX3zVsiBwZ9+K0m4cxBIt5H5QPyo40KU++IQXdQyrAr 5R4JxTMSIZXCF1lDAsTFph51tkhFwwHCdCR4czEIqiH4HVAtnVQ5EUUaQIcv8hXUexw6 nqZQ38a6Xd2RzSRQ7GK5n9ImdFKOcIWZbSUtBjKhsa44nA6Sr6j6KLjtmRqyTRnJ3u2B ho7VQBDynkDQwu5biBGl/WOVgod3pqdsAJ6/boOt5HVFMiC6IMZV2Bgw7EI53NGrHDWI RVuTqnvdVsOhlpErJ5K5q6hU+ZOwspFkkHWkZWYy2sEY3VeRvLf2q8LZB/7L6Gcfn6SR Omvg== X-Google-Smtp-Source: APXvYqybhE51tYlXSXFeLLyMrHntdA396Ff4w4Uc+K8kWaM6Xff9spvB8kQdJb2DbhT9dJbTk0CE9g== X-Received: by 2002:a17:902:e60a:: with SMTP id cm10mr127625731plb.316.1559095619659; Tue, 28 May 2019 19:06:59 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id h123sm17427729pfe.80.2019.05.28.19.06.58 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 May 2019 19:06:58 -0700 (PDT) Date: Tue, 28 May 2019 19:06:57 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrea Arcangeli cc: Andrew Morton , Mel Gorman , Michal Hocko , Vlastimil Babka , Zi Yan , Stefan Priebe - Profihost AG , "Kirill A. Shutemov" , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] Revert "mm, thp: restore node-local hugepage allocations" In-Reply-To: <20190524202931.GB11202@redhat.com> Message-ID: References: <20190503223146.2312-1-aarcange@redhat.com> <20190503223146.2312-3-aarcange@redhat.com> <20190520153621.GL18914@techsingularity.net> <20190523175737.2fb5b997df85b5d117092b5b@linux-foundation.org> <20190524202931.GB11202@redhat.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 24 May 2019, Andrea Arcangeli wrote: > > > We are going in circles, *yes* there is a problem for potential swap > > > storms today because of the poor interaction between memory compaction and > > > directed reclaim but this is a result of a poor API that does not allow > > > userspace to specify that its workload really will span multiple sockets > > > so faulting remotely is the best course of action. The fix is not to > > > cause regressions for others who have implemented a userspace stack that > > > is based on the past 3+ years of long standing behavior or for specialized > > > workloads where it is known that it spans multiple sockets so we want some > > > kind of different behavior. We need to provide a clear and stable API to > > > define these terms for the page allocator that is independent of any > > > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > > > workload dependent. > > > > um, who is going to do this work? > > That's a good question. It's going to be a not simple patch to > backport to -stable: it'll be intrusive and it will affect > mm/page_alloc.c significantly so it'll reject heavy. I wouldn't > consider it -stable material at least in the short term, it will > require some testing. > Hi Andrea, I'm not sure what patch you're referring to, unfortunately. The above comment was referring to APIs that are made available to userspace to define when to fault locally vs remotely and what the preference should be for any form of compaction or reclaim to achieve that. Today we have global enabling options, global defrag settings, enabling prctls, and madvise options. The point it makes is that whether a specific workload fits into a single socket is workload dependant and thus we are left with prctls and madvise options. The prctl either enables thp or it doesn't, it is not interesting here; the madvise is overloaded in four different ways (enabling, stalling at fault, collapsability, defrag) so it's not surprising that continuing to overload it for existing users will cause undesired results. It makes an argument that we need a clear and stable means of defining the behavior, not changing the 4+ year behavior and giving those who regress no workaround. > This is why applying a simple fix that avoids the swap storms (and the > swap-less pathological THP regression for vfio device assignment GUP > pinning) is preferable before adding an alloc_pages_multi_order (or > equivalent) so that it'll be the allocator that will decide when > exactly to fallback from 2M to 4k depending on the NUMA distance and > memory availability during the zonelist walk. The basic idea is to > call alloc_pages just once (not first for 2M and then for 4k) and > alloc_pages will decide which page "order" to return. > The commit description doesn't mention the swap storms that you're trying to fix, it's probably better to describe that again and why it is not beneficial to swap unless an entire pageblock can become free or memory compaction has indicated that additional memory freeing would allow migration to make an entire pageblock free. I understand that's a invasive code change, but merging this patch changes the 4+ year behavior that started here: commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1 Author: Aneesh Kumar K.V Date: Wed Feb 11 15:27:12 2015 -0800 mm/thp: allocate transparent hugepages on local node And that commit's description describes quite well the regression that we encounter if we remove __GFP_THISNODE here. That's because the access latency regression is much more substantial than what was reported for Naples in your changelog. In the interest of making forward progress, can we agree that swapping from the local node *never* makes sense unless we can show that an entire pageblock can become free or that it enables memory compaction to migrate memory that can make an entire pageblock free? Are you reporting swap storms for the local node when one of these is true? > > Implementing a new API doesn't help existing userspace which is hurting > > from the problem which this patch addresses. > > Yes, we can't change all apps that may not fit in a single NUMA > node. Currently it's unsafe to turn "transparent_hugepages/defrag = > always" or the bad behavior can then materialize also outside of > MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived > allocations (i.e. guest physical memory) like qemu are affected even > with the default "defrag = madvise". Those apps are using > MADV_HUGEPAGE for more than 3 years and they are widely used and open > source of course. > I continue to reiterate that the 4+ year long standing behavior of MADV_HUGEPAGE is overloaded; you are anticipating a specific behavior for workloads that do not fit in a single NUMA node whereas other users developed in the past four years are anticipating a different behavior. I'm trying to propose solutions that can not cause regressions for any user, such as the prctl() example that is inherited across fork, and can be used to define the behavior. This could be a very trivial extension to prctl(PR_SET_THP_DISABLE) or it could be more elaborate as an addition. This would be set by any thread that forks qemu and can define that the workload prefers remote hugepages because it spans more than one node. Certainly we should agree that the majority of Linux workloads do not span more than one socket. However, it *may* be possible to define this as a global thp setting since most machines that run large guests are only running large guests so the default machine-level policy can reflect that.