From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F697C433F5 for ; Fri, 30 Sep 2022 01:35:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 773718D0002; Thu, 29 Sep 2022 21:35:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7238C8D0001; Thu, 29 Sep 2022 21:35:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 575278D0002; Thu, 29 Sep 2022 21:35:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 448D08D0001 for ; Thu, 29 Sep 2022 21:35:53 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 193A6ABDC5 for ; Fri, 30 Sep 2022 01:35:53 +0000 (UTC) X-FDA: 79967035386.15.34738F6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 8E4F414000C for ; Fri, 30 Sep 2022 01:35:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664501751; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=apDMaTh5Z0mhMiQR6csH9hQJkGTkpalOtF3hMWGKJ/I=; b=TOaq0fmTM3CY2wpM+zsQPLB0gP5l8yMHEE+G5j0hubQsHGW8rL/miusIZUl2AOLPrK22op 83jDFCxL/SR1xzhneFEi8iARUcEXNapKZTC4Jwg/21oIKhVbFsdbd9j6anR0dn1Eg07m3p Av6UORkExNsf6h7w7Fi3X4zDpB59D7Y= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-529-X1hNo5TxOSi11ZuUcTHvvA-1; Thu, 29 Sep 2022 21:35:48 -0400 X-MC-Unique: X1hNo5TxOSi11ZuUcTHvvA-1 Received: by mail-qv1-f72.google.com with SMTP id lq8-20020a0562145b8800b004ad7229e4e9so2040141qvb.6 for ; Thu, 29 Sep 2022 18:35:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=id5jS9GIl5YUAGyi9JHeiLoaOFFBXFXBzzRQ0BDhWCQ=; b=FJs4xzUC0hxtbfVlgwkdYQVi+pcDISQx/m+jAuYRiKBre1P497BI0M9DWAutdPwl89 BHNYF8WshJKK7uKmIrDGeeriuylrV21w78n/wFTK6FxHXRy8Kwf0FiJZI4TPbUI2J6gN rsG/UT7sipA4XtFAWzm7miyw63hoGGmnyW76AA6kC2kC85hTqJAs5L4G5jhR6AzHGzi3 Jj8PjtFMiKzQ37+O3yQ9jJDpYuTu+EVnYPcNVWQgON4adiG4MIKiB9VT16V3jKGAtjS/ dMybx5ENDJwvDsss+YPJHSv4fveGpSI0K+bH9HJVgvSPAjd9rVTDijelwC027gsTnlAe uXSw== X-Gm-Message-State: ACrzQf2hudZItboWcgxUV16ciLa8HwxX2X8UzIOzBG+e6/GVxI3bsTLV OsaTOr4MB8hI75Go65qpfadZijDdTcQSIzHsnzHG7v31Aq/VZSK05cuS4sxekGEMnQCqYE5Os8q OtxLAlrh5VYI= X-Received: by 2002:a05:622a:1883:b0:35b:b7fc:24ac with SMTP id v3-20020a05622a188300b0035bb7fc24acmr4972656qtc.182.1664501747915; Thu, 29 Sep 2022 18:35:47 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7hwGFnM0XdWDVcazOEqGszR1x6gBn90V179XMk0fqa7T91LVp01wdQt1g89vtZwvruAPkiqA== X-Received: by 2002:a05:622a:1883:b0:35b:b7fc:24ac with SMTP id v3-20020a05622a188300b0035bb7fc24acmr4972628qtc.182.1664501747607; Thu, 29 Sep 2022 18:35:47 -0700 (PDT) Received: from xz-m1.local (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id c26-20020ac8519a000000b0035d4344a389sm660594qtn.94.2022.09.29.18.35.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Sep 2022 18:35:46 -0700 (PDT) Date: Thu, 29 Sep 2022 21:35:45 -0400 From: Peter Xu To: Mike Kravetz Cc: Hugh Dickins , Axel Rasmussen , Yang Shi , Matthew Wilcox , syzbot , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, llvm@lists.linux.dev, nathan@kernel.org, ndesaulniers@google.com, songmuchun@bytedance.com, syzkaller-bugs@googlegroups.com, trix@redhat.com Subject: Re: [syzbot] general protection fault in PageHeadHuge Message-ID: References: <7693a84-bdc2-27b5-2695-d0fe8566571f@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="6BJDNHv8LiVvXy5W" Content-Disposition: inline ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664501752; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=apDMaTh5Z0mhMiQR6csH9hQJkGTkpalOtF3hMWGKJ/I=; b=rXv9tAdc0lJmuSLXgFoNAJtrIZ7L5NTqre9/U1hvIhi9kD3yhd2cJzN8skPLbKHc2/fvAX ucDg0b9Y1xwyw9Qx4mykWIt+fYJt8zHKFduJdB8SPooxSJwvqMS4dHhXm8AQSBiwln814D pxK3Z9sDWq4wVuHZOW/WxT+sIHNEYCo= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TOaq0fmT; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664501752; a=rsa-sha256; cv=none; b=HrcVwft0hca1voF1A3nXDREEHkYKYZIQreZZzMJQLOp0o4b4vqoEEAsmgZCjsJKI7hQ64k PpHwowo8/9tC/lfcXE1JsldzGMBIBy6sfy8Y06VExveIb1SWtKz9e6Bk8tDuYsm1kxq149 WTgjxsKMkQapCyXkfbnUFKY4ATlmWaA= X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 8E4F414000C Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TOaq0fmT; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: h3axdnrejbxpnesrqryn8kzka1a3kf7d X-HE-Tag: 1664501752-515466 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --6BJDNHv8LiVvXy5W Content-Type: text/plain; charset=utf-8 Content-Disposition: inline On Thu, Sep 29, 2022 at 09:29:54PM -0400, Peter Xu wrote: > Hi, Mike, > > On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote: > > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make > > any difference. Suggestions would be appreciated as cache/tlb/??? flushing > > issues take me a while to figure out. > > It seems the UFFDIO_COPY for hugetlb is the major issue here, in that for > private mappings we don't inject the page cache. > > I think it makes sense in that e.g. we don't want to allow a private > mapping to be able to write to the page cache. But afaict that's not > correct because UFFDIO_COPY resolves exactly page faults in page cache > layer for file backed memories. So what we should do is inject page cache > but mark the page RO, waiting for a coming CoW if needed. > > I'll attach one patch fix that will start to inject the page into page > cache for UFFDIO_COPY+hugetlb even if mapping is private. Another test > patch is also added because otherwise the private hugetlb selftest won't > work after the fix applied - in the selftest we used to use DONTNEED to > drop the private mapping, but IMHO that's not enough, we need to drop the > page cache too (after the fix). I've also have the test patch attached. > > Feel free to try out with the two patches applied. It started to work for > me for current issue. > > I didn't yet post them out yet because after I applied the two patches I > found other issues - the reserved pages are messed up and leaked. I'll > keep looking tomorrow on the leak issue, but please also let me know if you > figured anything suspecious as I know you're definitely must more fluent on > the reservation code. > > And that's not the only issue I found - shmem can have other issues > regarding private mappings; shmem does it right on the page cache insertion > but not the rest I think.. I'll look into them one by one. It's quite > interesting to dig multiple things out of the write check symptons.. The patches.. -- Peter Xu --6BJDNHv8LiVvXy5W Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0001-mm-hugetlb-Insert-page-cache-for-UFFDIO_COPY-even-if.patch" >From 1255fabde4f950bef4751ef337791b93f5373a1f Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 28 Sep 2022 17:44:00 -0400 Subject: [PATCH 1/2] mm/hugetlb: Insert page cache for UFFDIO_COPY even if private Content-type: text/plain UFFDIO_COPY resolves page fault in page cache layer for file backed memories on shmem and hugetlbfs. It also means for each UFFDIO_COPY we should inject the new page into page cache no matter whether it's private or shared mappings. We used to not do that probably because for private mappings we should not allow the page cache be written for the private mapped process. However it can be done by removing the write bit (as what this patch does) so that CoW will trigger properly for the page cache. Leaving the page cache empty could lead to below sequence: (1) map hugetlb privately, register with uffd missing+wp (2) read page, trigger MISSING event with READ (3) UFFDIO_COPY(wp=1) resolve page fault, keep wr-protected (4) write page, trigger MISSING event again (because page cache missing!) with WRITE This behavior existed since the initial commit of hugetlb MISSING mode support, which is commit 1c9e8def43a3 ("userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings", 2017-02-22). In most cases it should be fine as long as the hugetlb page/pte will be stable (e.g., no wr-protect, no MADV_DONTNEED, ...). However for any reason if a further page fault is triggered, it could cause issue. Recently due to the newly introduced uffd-wp on hugetlbfs and also a recent locking rework from Mike, we can easily fail userfaultfd kselftest with hugetlb private mappings. One further step is we can do early CoW if the private mapping is writable, but let's leave that for later. Cc: Andrea Arcangeli Cc: Mike Kravetz Cc: Mike Rapoport Cc: Andrew Morton Signed-off-by: Peter Xu --- mm/hugetlb.c | 28 ++++++++-------------------- 1 file changed, 8 insertions(+), 20 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9679fe519b90..a43fc6852f27 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5933,14 +5933,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, int ret = -ENOMEM; struct page *page; int writable; - bool page_in_pagecache = false; if (is_continue) { ret = -EFAULT; page = find_lock_page(mapping, idx); if (!page) goto out; - page_in_pagecache = true; } else if (!*pagep) { /* If a page already exists, then it's UFFDIO_COPY for * a non-missing case. Return -EEXIST. @@ -6014,8 +6012,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, */ __SetPageUptodate(page); - /* Add shared, newly allocated pages to the page cache. */ - if (vm_shared && !is_continue) { + /* Add newly allocated pages to the page cache for UFFDIO_COPY. */ + if (!is_continue) { size = i_size_read(mapping->host) >> huge_page_shift(h); ret = -EFAULT; if (idx >= size) @@ -6030,7 +6028,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, ret = hugetlb_add_to_page_cache(page, mapping, idx); if (ret) goto out_release_nounlock; - page_in_pagecache = true; } ptl = huge_pte_lock(h, dst_mm, dst_pte); @@ -6044,18 +6041,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) goto out_release_unlock; - if (page_in_pagecache) { - page_dup_file_rmap(page, true); - } else { - ClearHPageRestoreReserve(page); - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); - } + page_dup_file_rmap(page, true); /* - * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY - * with wp flag set, don't set pte write bit. + * For either: (1) a non-shared VMA, or (2) UFFDIO_COPY with wp + * flag set, don't set pte write bit. */ - if (wp_copy || (is_continue && !vm_shared)) + if (wp_copy || !vm_shared) writable = 0; else writable = dst_vma->vm_flags & VM_WRITE; @@ -6083,18 +6075,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, spin_unlock(ptl); if (!is_continue) SetHPageMigratable(page); - if (vm_shared || is_continue) - unlock_page(page); + unlock_page(page); ret = 0; out: return ret; out_release_unlock: spin_unlock(ptl); - if (vm_shared || is_continue) - unlock_page(page); + unlock_page(page); out_release_nounlock: - if (!page_in_pagecache) - restore_reserve_on_error(h, dst_vma, dst_addr, page); put_page(page); goto out; } -- 2.32.0 --6BJDNHv8LiVvXy5W Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0002-selftests-vm-Use-memfd-for-hugetlb-tests.patch" >From 046ba2d1f5a74d86c6546a4f1bc8081f7c0fd3f8 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 29 Sep 2022 11:55:26 -0400 Subject: [PATCH 2/2] selftests/vm: Use memfd for hugetlb tests Content-type: text/plain We already used memfd for shmem test, move it forward with hugetlb too so that we don't need user to specify the hugetlb file path explicitly when running hugetlb shared tests. More importantly, this patch will start to correctly release hugetlb pages using fallocate(FALLOC_FL_PUNCH_HOLE) even for private mappings, because for private mappings MADV_DONTNEED may not be enough to test UFFDIO_COPY which only zap the pgtables but not evicting the page caches. Here unfortunately we need to cache the ptr<->offset relationship for hugetlb for using fallocate() by adding mem_fd_[src|dst]_offset, because that's the only way to evict the page cache for private mappings. IOW MADV_REMOVE doesn't work. Signed-off-by: Peter Xu --- tools/testing/selftests/vm/userfaultfd.c | 98 +++++++++++++----------- 1 file changed, 53 insertions(+), 45 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 74babdbc02e5..4561e9d443e4 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -93,10 +93,11 @@ static volatile bool test_uffdio_zeropage_eexist = true; static bool test_uffdio_wp = true; /* Whether to test uffd minor faults */ static bool test_uffdio_minor = false; - static bool map_shared; -static int shm_fd; -static int huge_fd; +static int mem_fd; +/* File offset for area_src/area_dst upon mem_fd. Needed for hole punching */ +static off_t mem_fd_src_offset; +static off_t mem_fd_dst_offset; static unsigned long long *count_verify; static int uffd = -1; static int uffd_flags, finished, *pipefd; @@ -247,48 +248,59 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset) { } +static off_t ptr_to_offset(char *ptr) +{ + if (ptr == area_src) + return mem_fd_src_offset; + else + return mem_fd_dst_offset; +} + static void hugetlb_release_pages(char *rel_area) { + off_t size = nr_pages * page_size, offset = ptr_to_offset(rel_area); + if (!map_shared) { if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) err("madvise(MADV_DONTNEED) failed"); - } else { - if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) - err("madvise(MADV_REMOVE) failed"); - } + + /* Evict page cache explibitly for private mappings */ + if (fallocate(mem_fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + offset, size)) + err("fallocate(FALLOC_FL_PUNCH_HOLE) failed"); + } else { + if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) + err("madvise(MADV_REMOVE) failed"); + } } static void hugetlb_allocate_area(void **alloc_area, bool is_src) { + off_t size = nr_pages * page_size; + off_t offset = is_src ? 0 : size; void *area_alias = NULL; char **alloc_area_alias; - if (!map_shared) - *alloc_area = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | - (is_src ? 0 : MAP_NORESERVE), - -1, - 0); - else - *alloc_area = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_SHARED | - (is_src ? 0 : MAP_NORESERVE), - huge_fd, - is_src ? 0 : nr_pages * page_size); + *alloc_area = mmap(NULL, size, PROT_READ | PROT_WRITE, + (map_shared ? MAP_SHARED : MAP_PRIVATE) | + (is_src ? 0 : MAP_NORESERVE), + mem_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of hugetlbfs file failed"); + /* + * Only hugetlb needs to cache ptr<->offset because it needs to + * fallocate(FALLOC_FL_PUNCH_HOLE). + */ + if (is_src) + mem_fd_src_offset = offset; + else + mem_fd_dst_offset = offset; + if (map_shared) { - area_alias = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_SHARED, - huge_fd, - is_src ? 0 : nr_pages * page_size); + area_alias = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_SHARED, mem_fd, offset); if (area_alias == MAP_FAILED) err("mmap of hugetlb file alias failed"); } @@ -334,14 +346,14 @@ static void shmem_allocate_area(void **alloc_area, bool is_src) } *alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, - shm_fd, offset); + mem_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed"); if (test_collapse && *alloc_area != p) err("mmap of memfd failed at %p", p); area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, - shm_fd, offset); + mem_fd, offset); if (area_alias == MAP_FAILED) err("mmap of memfd alias failed"); if (test_collapse && area_alias != p_alias) @@ -1629,7 +1641,7 @@ static int userfaultfd_stress(void) /* prepare next bounce */ swap(area_src, area_dst); - + swap(mem_fd_src_offset, mem_fd_dst_offset); swap(area_src_alias, area_dst_alias); uffd_stats_report(uffd_stats, nr_cpus); @@ -1821,21 +1833,17 @@ int main(int argc, char **argv) } nr_pages = nr_pages_per_cpu * nr_cpus; - if (test_type == TEST_HUGETLB && map_shared) { - if (argc < 5) - usage(); - huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755); - if (huge_fd < 0) - err("Open of %s failed", argv[4]); - if (ftruncate(huge_fd, 0)) - err("ftruncate %s to size 0 failed", argv[4]); - } else if (test_type == TEST_SHMEM) { - shm_fd = memfd_create(argv[0], 0); - if (shm_fd < 0) + if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) { + unsigned int memfd_flags = 0; + + if (test_type == TEST_HUGETLB) + memfd_flags = MFD_HUGETLB; + mem_fd = memfd_create(argv[0], memfd_flags); + if (mem_fd < 0) err("memfd_create"); - if (ftruncate(shm_fd, nr_pages * page_size * 2)) + if (ftruncate(mem_fd, nr_pages * page_size * 2)) err("ftruncate"); - if (fallocate(shm_fd, + if (fallocate(mem_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, nr_pages * page_size * 2)) err("fallocate"); -- 2.32.0 --6BJDNHv8LiVvXy5W--