From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 73AF61363 for ; Fri, 30 Sep 2022 01:35:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664501752; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=apDMaTh5Z0mhMiQR6csH9hQJkGTkpalOtF3hMWGKJ/I=; b=bMNIDb2VgRt/3ImgL93Urx3jHWlJEg6B7b349DFgM05tHHljyH4xbI4d1QDA66c2O2ghgd 4AD7KoQv72X7Q7jIIuEsS7ZmdKpJOJMPFc/1Sh3B91N6JS/RU5mkeVY8NjNDkIKoCmIsxl 5nb9fnnYvEFRr/17i5AfZzup4EnsMOg= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-320-EW7-7i59OwKKURdyLfKLEw-1; Thu, 29 Sep 2022 21:35:48 -0400 X-MC-Unique: EW7-7i59OwKKURdyLfKLEw-1 Received: by mail-qv1-f71.google.com with SMTP id lc12-20020a0562142c0c00b004afa6c72f9fso2042265qvb.5 for ; Thu, 29 Sep 2022 18:35:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=id5jS9GIl5YUAGyi9JHeiLoaOFFBXFXBzzRQ0BDhWCQ=; b=jV7TdkRmbn3MYFXjX6b850zyy5+aUV3L7A0+8/Jn6QlN/STcDnvmqsccFgCDXnbnuJ EPP06husF7m5mNP5YPxCDs33FDpWTFOBu2eSS81E9Ym88NSWqZMVjAM3zs9K4/QF0Zp0 K8+lUXTn+f7mznDEzMJTOfpTtIIlxcCROTQ62fg6WtHHz0u7IdWTT+44K/xk/MobH3wa h6ZqdYqDfj8H2+8MTsMtvzcEfCt2WFXZFmsdfKDeeu1W0n9yTeayRclQbBhUA4LCfVqZ EpLeG8kKC6AI1qYR+ywI+2MEagIrKAkABhU4RefDf7qyx0KhWEIY3ESQKv7BfL4bhZDh F+hA== X-Gm-Message-State: ACrzQf1+GuZR4DQtloqAHF85EaX5wMlFBbLUmcDx7XEJIUNx5TInVKC5 iHK3cjBgMm/KlXwpBNIE//+KqyLdjlOngC9uVV8AOlGmOu/27GGP26R/oB7UTR57ZPTbiMP7Zbv uFZzUuZBGVWCaNw== X-Received: by 2002:a05:622a:1883:b0:35b:b7fc:24ac with SMTP id v3-20020a05622a188300b0035bb7fc24acmr4972661qtc.182.1664501747915; Thu, 29 Sep 2022 18:35:47 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7hwGFnM0XdWDVcazOEqGszR1x6gBn90V179XMk0fqa7T91LVp01wdQt1g89vtZwvruAPkiqA== X-Received: by 2002:a05:622a:1883:b0:35b:b7fc:24ac with SMTP id v3-20020a05622a188300b0035bb7fc24acmr4972628qtc.182.1664501747607; Thu, 29 Sep 2022 18:35:47 -0700 (PDT) Received: from xz-m1.local (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id c26-20020ac8519a000000b0035d4344a389sm660594qtn.94.2022.09.29.18.35.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Sep 2022 18:35:46 -0700 (PDT) Date: Thu, 29 Sep 2022 21:35:45 -0400 From: Peter Xu To: Mike Kravetz Cc: Hugh Dickins , Axel Rasmussen , Yang Shi , Matthew Wilcox , syzbot , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, llvm@lists.linux.dev, nathan@kernel.org, ndesaulniers@google.com, songmuchun@bytedance.com, syzkaller-bugs@googlegroups.com, trix@redhat.com Subject: Re: [syzbot] general protection fault in PageHeadHuge Message-ID: References: <7693a84-bdc2-27b5-2695-d0fe8566571f@google.com> Precedence: bulk X-Mailing-List: llvm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="6BJDNHv8LiVvXy5W" Content-Disposition: inline --6BJDNHv8LiVvXy5W Content-Type: text/plain; charset=utf-8 Content-Disposition: inline On Thu, Sep 29, 2022 at 09:29:54PM -0400, Peter Xu wrote: > Hi, Mike, > > On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote: > > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make > > any difference. Suggestions would be appreciated as cache/tlb/??? flushing > > issues take me a while to figure out. > > It seems the UFFDIO_COPY for hugetlb is the major issue here, in that for > private mappings we don't inject the page cache. > > I think it makes sense in that e.g. we don't want to allow a private > mapping to be able to write to the page cache. But afaict that's not > correct because UFFDIO_COPY resolves exactly page faults in page cache > layer for file backed memories. So what we should do is inject page cache > but mark the page RO, waiting for a coming CoW if needed. > > I'll attach one patch fix that will start to inject the page into page > cache for UFFDIO_COPY+hugetlb even if mapping is private. Another test > patch is also added because otherwise the private hugetlb selftest won't > work after the fix applied - in the selftest we used to use DONTNEED to > drop the private mapping, but IMHO that's not enough, we need to drop the > page cache too (after the fix). I've also have the test patch attached. > > Feel free to try out with the two patches applied. It started to work for > me for current issue. > > I didn't yet post them out yet because after I applied the two patches I > found other issues - the reserved pages are messed up and leaked. I'll > keep looking tomorrow on the leak issue, but please also let me know if you > figured anything suspecious as I know you're definitely must more fluent on > the reservation code. > > And that's not the only issue I found - shmem can have other issues > regarding private mappings; shmem does it right on the page cache insertion > but not the rest I think.. I'll look into them one by one. It's quite > interesting to dig multiple things out of the write check symptons.. The patches.. -- Peter Xu --6BJDNHv8LiVvXy5W Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0001-mm-hugetlb-Insert-page-cache-for-UFFDIO_COPY-even-if.patch" >From 1255fabde4f950bef4751ef337791b93f5373a1f Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 28 Sep 2022 17:44:00 -0400 Subject: [PATCH 1/2] mm/hugetlb: Insert page cache for UFFDIO_COPY even if private Content-type: text/plain UFFDIO_COPY resolves page fault in page cache layer for file backed memories on shmem and hugetlbfs. It also means for each UFFDIO_COPY we should inject the new page into page cache no matter whether it's private or shared mappings. We used to not do that probably because for private mappings we should not allow the page cache be written for the private mapped process. However it can be done by removing the write bit (as what this patch does) so that CoW will trigger properly for the page cache. Leaving the page cache empty could lead to below sequence: (1) map hugetlb privately, register with uffd missing+wp (2) read page, trigger MISSING event with READ (3) UFFDIO_COPY(wp=1) resolve page fault, keep wr-protected (4) write page, trigger MISSING event again (because page cache missing!) with WRITE This behavior existed since the initial commit of hugetlb MISSING mode support, which is commit 1c9e8def43a3 ("userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings", 2017-02-22). In most cases it should be fine as long as the hugetlb page/pte will be stable (e.g., no wr-protect, no MADV_DONTNEED, ...). However for any reason if a further page fault is triggered, it could cause issue. Recently due to the newly introduced uffd-wp on hugetlbfs and also a recent locking rework from Mike, we can easily fail userfaultfd kselftest with hugetlb private mappings. One further step is we can do early CoW if the private mapping is writable, but let's leave that for later. Cc: Andrea Arcangeli Cc: Mike Kravetz Cc: Mike Rapoport Cc: Andrew Morton Signed-off-by: Peter Xu --- mm/hugetlb.c | 28 ++++++++-------------------- 1 file changed, 8 insertions(+), 20 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9679fe519b90..a43fc6852f27 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5933,14 +5933,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, int ret = -ENOMEM; struct page *page; int writable; - bool page_in_pagecache = false; if (is_continue) { ret = -EFAULT; page = find_lock_page(mapping, idx); if (!page) goto out; - page_in_pagecache = true; } else if (!*pagep) { /* If a page already exists, then it's UFFDIO_COPY for * a non-missing case. Return -EEXIST. @@ -6014,8 +6012,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, */ __SetPageUptodate(page); - /* Add shared, newly allocated pages to the page cache. */ - if (vm_shared && !is_continue) { + /* Add newly allocated pages to the page cache for UFFDIO_COPY. */ + if (!is_continue) { size = i_size_read(mapping->host) >> huge_page_shift(h); ret = -EFAULT; if (idx >= size) @@ -6030,7 +6028,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, ret = hugetlb_add_to_page_cache(page, mapping, idx); if (ret) goto out_release_nounlock; - page_in_pagecache = true; } ptl = huge_pte_lock(h, dst_mm, dst_pte); @@ -6044,18 +6041,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) goto out_release_unlock; - if (page_in_pagecache) { - page_dup_file_rmap(page, true); - } else { - ClearHPageRestoreReserve(page); - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); - } + page_dup_file_rmap(page, true); /* - * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY - * with wp flag set, don't set pte write bit. + * For either: (1) a non-shared VMA, or (2) UFFDIO_COPY with wp + * flag set, don't set pte write bit. */ - if (wp_copy || (is_continue && !vm_shared)) + if (wp_copy || !vm_shared) writable = 0; else writable = dst_vma->vm_flags & VM_WRITE; @@ -6083,18 +6075,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, spin_unlock(ptl); if (!is_continue) SetHPageMigratable(page); - if (vm_shared || is_continue) - unlock_page(page); + unlock_page(page); ret = 0; out: return ret; out_release_unlock: spin_unlock(ptl); - if (vm_shared || is_continue) - unlock_page(page); + unlock_page(page); out_release_nounlock: - if (!page_in_pagecache) - restore_reserve_on_error(h, dst_vma, dst_addr, page); put_page(page); goto out; } -- 2.32.0 --6BJDNHv8LiVvXy5W Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0002-selftests-vm-Use-memfd-for-hugetlb-tests.patch" >From 046ba2d1f5a74d86c6546a4f1bc8081f7c0fd3f8 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 29 Sep 2022 11:55:26 -0400 Subject: [PATCH 2/2] selftests/vm: Use memfd for hugetlb tests Content-type: text/plain We already used memfd for shmem test, move it forward with hugetlb too so that we don't need user to specify the hugetlb file path explicitly when running hugetlb shared tests. More importantly, this patch will start to correctly release hugetlb pages using fallocate(FALLOC_FL_PUNCH_HOLE) even for private mappings, because for private mappings MADV_DONTNEED may not be enough to test UFFDIO_COPY which only zap the pgtables but not evicting the page caches. Here unfortunately we need to cache the ptr<->offset relationship for hugetlb for using fallocate() by adding mem_fd_[src|dst]_offset, because that's the only way to evict the page cache for private mappings. IOW MADV_REMOVE doesn't work. Signed-off-by: Peter Xu --- tools/testing/selftests/vm/userfaultfd.c | 98 +++++++++++++----------- 1 file changed, 53 insertions(+), 45 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 74babdbc02e5..4561e9d443e4 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -93,10 +93,11 @@ static volatile bool test_uffdio_zeropage_eexist = true; static bool test_uffdio_wp = true; /* Whether to test uffd minor faults */ static bool test_uffdio_minor = false; - static bool map_shared; -static int shm_fd; -static int huge_fd; +static int mem_fd; +/* File offset for area_src/area_dst upon mem_fd. Needed for hole punching */ +static off_t mem_fd_src_offset; +static off_t mem_fd_dst_offset; static unsigned long long *count_verify; static int uffd = -1; static int uffd_flags, finished, *pipefd; @@ -247,48 +248,59 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset) { } +static off_t ptr_to_offset(char *ptr) +{ + if (ptr == area_src) + return mem_fd_src_offset; + else + return mem_fd_dst_offset; +} + static void hugetlb_release_pages(char *rel_area) { + off_t size = nr_pages * page_size, offset = ptr_to_offset(rel_area); + if (!map_shared) { if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) err("madvise(MADV_DONTNEED) failed"); - } else { - if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) - err("madvise(MADV_REMOVE) failed"); - } + + /* Evict page cache explibitly for private mappings */ + if (fallocate(mem_fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + offset, size)) + err("fallocate(FALLOC_FL_PUNCH_HOLE) failed"); + } else { + if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) + err("madvise(MADV_REMOVE) failed"); + } } static void hugetlb_allocate_area(void **alloc_area, bool is_src) { + off_t size = nr_pages * page_size; + off_t offset = is_src ? 0 : size; void *area_alias = NULL; char **alloc_area_alias; - if (!map_shared) - *alloc_area = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | - (is_src ? 0 : MAP_NORESERVE), - -1, - 0); - else - *alloc_area = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_SHARED | - (is_src ? 0 : MAP_NORESERVE), - huge_fd, - is_src ? 0 : nr_pages * page_size); + *alloc_area = mmap(NULL, size, PROT_READ | PROT_WRITE, + (map_shared ? MAP_SHARED : MAP_PRIVATE) | + (is_src ? 0 : MAP_NORESERVE), + mem_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of hugetlbfs file failed"); + /* + * Only hugetlb needs to cache ptr<->offset because it needs to + * fallocate(FALLOC_FL_PUNCH_HOLE). + */ + if (is_src) + mem_fd_src_offset = offset; + else + mem_fd_dst_offset = offset; + if (map_shared) { - area_alias = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_SHARED, - huge_fd, - is_src ? 0 : nr_pages * page_size); + area_alias = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_SHARED, mem_fd, offset); if (area_alias == MAP_FAILED) err("mmap of hugetlb file alias failed"); } @@ -334,14 +346,14 @@ static void shmem_allocate_area(void **alloc_area, bool is_src) } *alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, - shm_fd, offset); + mem_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed"); if (test_collapse && *alloc_area != p) err("mmap of memfd failed at %p", p); area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, - shm_fd, offset); + mem_fd, offset); if (area_alias == MAP_FAILED) err("mmap of memfd alias failed"); if (test_collapse && area_alias != p_alias) @@ -1629,7 +1641,7 @@ static int userfaultfd_stress(void) /* prepare next bounce */ swap(area_src, area_dst); - + swap(mem_fd_src_offset, mem_fd_dst_offset); swap(area_src_alias, area_dst_alias); uffd_stats_report(uffd_stats, nr_cpus); @@ -1821,21 +1833,17 @@ int main(int argc, char **argv) } nr_pages = nr_pages_per_cpu * nr_cpus; - if (test_type == TEST_HUGETLB && map_shared) { - if (argc < 5) - usage(); - huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755); - if (huge_fd < 0) - err("Open of %s failed", argv[4]); - if (ftruncate(huge_fd, 0)) - err("ftruncate %s to size 0 failed", argv[4]); - } else if (test_type == TEST_SHMEM) { - shm_fd = memfd_create(argv[0], 0); - if (shm_fd < 0) + if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) { + unsigned int memfd_flags = 0; + + if (test_type == TEST_HUGETLB) + memfd_flags = MFD_HUGETLB; + mem_fd = memfd_create(argv[0], memfd_flags); + if (mem_fd < 0) err("memfd_create"); - if (ftruncate(shm_fd, nr_pages * page_size * 2)) + if (ftruncate(mem_fd, nr_pages * page_size * 2)) err("ftruncate"); - if (fallocate(shm_fd, + if (fallocate(mem_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, nr_pages * page_size * 2)) err("fallocate"); -- 2.32.0 --6BJDNHv8LiVvXy5W--