From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD0F1C43334 for ; Thu, 14 Jul 2022 18:55:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2F579401B8; Thu, 14 Jul 2022 14:55:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BDEB09401A5; Thu, 14 Jul 2022 14:55:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA6A89401B8; Thu, 14 Jul 2022 14:55:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9BF529401A5 for ; Thu, 14 Jul 2022 14:55:07 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5FDBF343C6 for ; Thu, 14 Jul 2022 18:55:07 +0000 (UTC) X-FDA: 79686607854.30.12FA5E3 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf28.hostedemail.com (Postfix) with ESMTP id C8C9AC0085 for ; Thu, 14 Jul 2022 18:55:06 +0000 (UTC) Received: by mail-pl1-f182.google.com with SMTP id k19so1224037pll.5 for ; Thu, 14 Jul 2022 11:55:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=OdYYmyc3rYP0et5VwzVffvqLEo7PJ8iu7Iuh4m/86pI=; b=TlolYBFouxTGKnliGKbX0R23zb08uk14t/wukdS2yBgryLAg26btmRdKq0jHdIJvBg DRbqY0jLSNJSo3ws8jCS310iwJ+Z9x2+UG5FPnz3JoYv8tCe7J4CkTHOep7tH9zaB9Cm dvQFKYPQ1fuT/tpaIAk/oRiZPItgXAlchUE77wsQFwsDy2YVKmORQJPjVzuCcKav8egf CY8m/BoiJ5fUcpU54r+sT2egKtHJb23g80ld0FKWSa5wyxlUxLjSnZjFQh60CH67/YhC gHc0jfqEIN5P4lgGO1Yil+C5yy8zLSzDdPz3KUMhBjotf6NcJlc5FzW/8yb3o5qbiqhd G3yQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=OdYYmyc3rYP0et5VwzVffvqLEo7PJ8iu7Iuh4m/86pI=; b=YckEbT949ndD52d1pvcHkH1ZN1Wh8FoVK+5qxPD7M9yJsHrTXeJU6lrsMG2NodoAR/ 686HINAhoJFfIsjjE9VFoUD6D3rrTj2xiYqlx98vrAxy6r4H116txQgI0T4i/2IDGo4p But5BNBUCZiXp/KM4/qCrR+cLXus/U34smQ4/W6OwZssO+BLK2tWdXYgKeNmpdTWcG9O UG76RRr67sIc6QtguDcC3tUUuEn2HhtaFX3wzLQx7x5fVCRrtB2YdUWKbe7IhiyIoq9G vdxmgkbNRFFqlsS3SG3WFGs4hq17p0xYp7G/9UzkixRi9NBvToFT+4xLyYX3fZ091NfW 3waw== X-Gm-Message-State: AJIora9XjPuaEhepqoCkDyJ7HFt9TfgCoxE3cWwlgPVipN/FlVKkiSeW NG/kTsp6d0hFw8wXPPMu12mpYA== X-Google-Smtp-Source: AGRyM1v4H8v1Z92PqBlIhbipXVUmxkYL4y4+uqe5citYn+2RAZ8mvPskGpI+CtXkylymY12OVoWBQA== X-Received: by 2002:a17:90b:1e4d:b0:1f0:462b:b573 with SMTP id pi13-20020a17090b1e4d00b001f0462bb573mr11450743pjb.164.1657824905403; Thu, 14 Jul 2022 11:55:05 -0700 (PDT) Received: from google.com (55.212.185.35.bc.googleusercontent.com. [35.185.212.55]) by smtp.gmail.com with ESMTPSA id b2-20020a170902d50200b0016c78f9f024sm1807583plg.104.2022.07.14.11.55.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Jul 2022 11:55:04 -0700 (PDT) Date: Thu, 14 Jul 2022 11:55:01 -0700 From: Zach O'Keefe To: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org Cc: Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer , Souptick Joarder Subject: Re: [RFC] mm: userspace hugepage collapse: file/shmem semantics Message-ID: References: <20220706235936.2197195-1-zokeefe@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20220706235936.2197195-1-zokeefe@google.com> ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657824907; a=rsa-sha256; cv=none; b=n5+iQshT++2IDmWn6EphXXomJoF7iYLs22g9d1s9XE/46FhAA1+wf2oDyYZ/jqBD4ZiVTT FKlSqTDybJwkVvnf7vp3Zv1DH8VNFMzrM7mOmXkllVtD+MjUPXYPi0eUQIJG/c9QC+PBRR 0HVPZCiVXM83knQL9nYkU8b2TvAU620= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=TlolYBFo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of zokeefe@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=zokeefe@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657824907; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OdYYmyc3rYP0et5VwzVffvqLEo7PJ8iu7Iuh4m/86pI=; b=Lbc0ZY+w9GBOnfhBBzZDR13L4bHoDGF1dARs0Wlna97r2rzHi5wJqnJ1Q8gP6H0YMUgyle SHS162+wC5CKlKj1ahIyV+l8yoNnlbd/LJrVZu25E09sasIzmbWysZOwWLbpGlY7InLISx o1XZ6GV+cNI/ff8dO4OzwLTHiB6KT3E= X-Stat-Signature: 34spzqzer887wf8g9z3qfigyo3bt9cc4 X-Rspam-User: X-Rspamd-Queue-Id: C8C9AC0085 Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=TlolYBFo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of zokeefe@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Rspamd-Server: rspam11 X-HE-Tag: 1657824906-432289 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey All, There are still a couple interface topics (capabilities for process_madvise(2), errnos) to iron out, but for the most part the behavior and semantics of MADV_COLLAPSE on anonymous memory seems to be ironed out. Thanks for everyone's time and effort contributing to that effort. Looking forward, I'd like to align on the semantics of file/shmem so seal MADV_COLLAPSE behavior. This is what I'd propose for an initial man-page-like description of MADV_COLLAPSE for madvise(2), to paint a full-picture view: ---8<--- Perform a best-effort synchronous collapse of the native pages mapped by the memory range into Transparent Hugepages (THPs). MADV_COLLAPSE operates on the current state of memory for the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future. However, for file/shmem memory, other mappings of this file extent may be queued and processed later by khugepaged to attempt to update their pagetables to map the hugepage by a PMD. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of the specified memory. All non-resident pages covered by the range will first be swapped/faulted-in, before being copied onto a freshly allocated hugepage. If the native pages compose the same PTE-mapped hugepage, and are suitably aligned, the collapse may happen in-place. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory. MADV_COLLAPSE is independent of any THP sysfs setting, both in terms of determining THP eligibility, and allocation semantics. The VMA must not be marked VM_NOHUGEPAGE, VM_HUGETLB**, VM_IO, VM_DONTEXPAND, VM_MIXEDMAP, or VM_PFNMAP, nor can it be stack memory or DAX-backed. The process must not have PR_SET_THP_DISABLE set. For file-backed memory, the file must either be (1) not open for write, and the mapping must be executable, or (2) the backing filesystem must support large pages. Allocation for the new hugepage may enter direct reclaim and/or compaction, regardless of VMA flags. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On successful return, all hugepage-aligned/sized memory regions provided will be mapped by PMDs. Note that this doesn’t guarantee anything about other possible mappings of the memory. Note that many failures might have occurred, since the operation may continue to collapse in the event collapse of a single hugepage-sized/aligned region fails. MADV_COLLAPSE is only available if the kernel was configured with CONFIGURE_TRANSPARENT_HUGEPAGE, and file/shmem support additionally require CONFIG_READ_ONLY_THP_FOR_FS and CONFIG_SHMEM. ---8<--- ** Might change with HugeTLB high-granularity mappings[1]. There are a few new items of note here: 1) PMD-mapped on success MADV_COLLAPSE ultimately wants memory mapped by PMDs, and so I propose we should always try to actually do the page table updates. For file/shmem, this means two things: (a) adding support to handle compound pages (both pte-mapped hugepages and non-HPAGE_PMD_ORDER compound pages), and (b) doing a final PMD install before returning, and not relying on subsequent fault. This makes the semantics of file/shmem the same as anonymous. I call out (a), since there was an existing debate about this, and so I want to ensure we are aligned[1]. Note that (b), along with presenting a consistent interface to users, also has real-world usecases too, where relying on fault is difficult (for example, shmem + UFFDIO_REGISTER_MODE_MINOR-managed memory). Also note that for (b), I'm proposing to only do the synchronous PMD install for the memory range provided - the page table collapse of other mappings of the memory can be deferred until later (by khugepaged). 2) folio timing && file non-writable, executable mapping I just want to align on some timing due to ongoing folio work. Currently, the requirement to be able to collapse file/shmem memory is that the file not be opened for write anywhere, and that the mapping is executable, but we'd eventually like to support filesystems that claim mapping_large_folio_support()[2]. Is it acceptable that future MADV_COLLAPSE works for either mapping_large_folio_support() or the old conditions? Alternatively, should MADV_COLLAPSE only support mapping_large_folio_support() filesystems from the onset? (I believe shmem and xfs are the only current users) 3) (shmem) sysfs settings and huge= tmpfs mount Should we ignore /sys/kernel/mm/transparent_hugepage/shmem_enabled, similar to how we ignore /sys/kernel/mm/transparent_hugepage/enabled for anon/file? Does that include "deny"? This choice is (partially) coupled with tmpfs huge= mount option. I think today, things work if we ignore this. However, I don't want to back us into a corner if we ever want to allow MADV_COLLAPSE to work on writeable shmem mappings one day (or any other incompatibility I'm unaware of). One option, if in (2) we chose to allow the old conditions, then we could ignore shmem_enabled in the non-writable, executable case - otherwise defer to "if the filesystem supports it", where we would then respect huge=. I think those are the important points. Am I missing anything? Thanks again everyone for taking the time to read and discuss, Best, Zach [1] https://lore.kernel.org/linux-mm/20220624173656.2033256-23-jthoughton@google.com/ [2] https://lore.kernel.org/linux-mm/YpGbnbi44JqtRg+n@casper.infradead.org/