* [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW @ 2022-08-08 7:32 David Hildenbrand 2022-08-08 16:02 ` David Hildenbrand ` (4 more replies) 0 siblings, 5 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-08 7:32 UTC (permalink / raw) To: linux-kernel Cc: linux-mm, David Hildenbrand, stable, Linus Torvalds, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space. Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone. And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590). To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened. Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW. Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually. For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA. This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it. Note 1: We don't check for the PTE being dirty because it doesn't matter for making a "was COWed" decision anymore, and whoever modifies the page has to set the page dirty either way. Note 2: Kernels before extended uffd-wp support and before PageAnonExclusive (< 5.19) can simply revert the problematic commit instead and be safe regarding UFFDIO_CONTINUE. A backport to v5.19 requires minor adjustments due to lack of vma_soft_dirty_enabled(). Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") Cc: <stable@vger.kernel.org> # 5.16+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- Against upstream from yesterday instead of v5.19 because I wanted to reference the mprotect commit IDs and can_change_pte_writable(), and I wanted to directly use vma_soft_dirty_enabled(). I have a working reproducer that I'll post to oss-security in one week. Of course, that reproducer no longer triggers with that commit and my ptrace testing indicated that FOLL_FORCE seems to continue working as expected. --- include/linux/mm.h | 1 - mm/gup.c | 62 +++++++++++++++++++++++++++++----------------- mm/huge_memory.c | 45 +++++++++++++++++---------------- 3 files changed, 63 insertions(+), 45 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 18e01474cf6b..2222ed598112 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2885,7 +2885,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ #define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ diff --git a/mm/gup.c b/mm/gup.c index 732825157430..7a0b207f566f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -478,14 +478,34 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, return -EEXIST; } -/* - * FOLL_FORCE can write to even unwritable pte's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) -{ - return pte_write(pte) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); +/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */ +static inline bool can_follow_write_pte(pte_t pte, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) +{ + if (pte_write(pte)) + return true; + if (!(flags & FOLL_FORCE)) + return false; + + /* + * See check_vma_flags(): only COW mappings need that special + * "force" handling when they lack VM_WRITE. + */ + if (vma->vm_flags & VM_WRITE) + return false; + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); + + /* + * See can_change_pte_writable(): we broke COW and could map the page + * writable if we have an exclusive anonymous page and a write-fault + * isn't require for other reasons. + */ + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) + return false; + return !userfaultfd_pte_wp(vma, pte); } static struct page *follow_page_pte(struct vm_area_struct *vma, @@ -528,12 +548,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } if ((flags & FOLL_NUMA) && pte_protnone(pte)) goto no_page; - if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) { - pte_unmap_unlock(ptep, ptl); - return NULL; - } page = vm_normal_page(vma, address, pte); + + /* + * We only care about anon pages in can_follow_write_pte() and don't + * have to worry about pte_devmap() because they are never anon. + */ + if ((flags & FOLL_WRITE) && + !can_follow_write_pte(pte, page, vma, flags)) { + page = NULL; + goto out; + } + if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* * Only return device mapping pages in the FOLL_GET or FOLL_PIN @@ -986,17 +1013,6 @@ static int faultin_page(struct vm_area_struct *vma, return -EBUSY; } - /* - * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when - * necessary, even if maybe_mkwrite decided not to set pte_write. We - * can thus safely do subsequent page lookups as if they were reads. - * But only do so when looping for pte_write is futile: in some cases - * userspace may also be wanting to write to the gotten user page, - * which a read fault here might prevent (a readonly page might get - * reCOWed by userspace write). - */ - if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) - *flags |= FOLL_COW; return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8a7c1b344abe..352b5220e95e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1040,12 +1040,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, assert_spin_locked(pmd_lockptr(mm, pmd)); - /* - * When we COW a devmap PMD entry, we split it into PTEs, so we should - * not be in this function with `flags & FOLL_COW` set. - */ - WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) @@ -1395,14 +1389,23 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) return VM_FAULT_FALLBACK; } -/* - * FOLL_FORCE can write to even unwritable pmd's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +/* See can_follow_write_pte() on FOLL_FORCE details. */ +static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) { - return pmd_write(pmd) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); + if (pmd_write(pmd)) + return true; + if (!(flags & FOLL_FORCE)) + return false; + if (vma->vm_flags & VM_WRITE) + return false; + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd)) + return false; + return !userfaultfd_huge_pmd_wp(vma, pmd); } struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, @@ -1411,12 +1414,16 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned int flags) { struct mm_struct *mm = vma->vm_mm; - struct page *page = NULL; + struct page *page; assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) - goto out; + page = pmd_page(*pmd); + VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + + if ((flags & FOLL_WRITE) && + !can_follow_write_pmd(*pmd, page, vma, flags)) + return NULL; /* Avoid dumping huge zero page */ if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) @@ -1424,10 +1431,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, /* Full NUMA hinting faults to serialise migration in fault paths */ if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) - goto out; - - page = pmd_page(*pmd); - VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + return NULL; if (!pmd_write(*pmd) && gup_must_unshare(flags, page)) return ERR_PTR(-EMLINK); @@ -1444,7 +1448,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page); -out: return page; } base-commit: 1612c382ffbdf1f673caec76502b1c00e6d35363 -- 2.35.3 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-08 7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand @ 2022-08-08 16:02 ` David Hildenbrand 2022-08-09 18:27 ` Linus Torvalds ` (3 subsequent siblings) 4 siblings, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-08 16:02 UTC (permalink / raw) To: linux-kernel Cc: linux-mm, stable, Linus Torvalds, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe, Nadav Amit On 08.08.22 09:32, David Hildenbrand wrote: > Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know > that FOLL_FORCE can be possibly dangerous, especially if there are races > that can be exploited by user space. > > Right now, it would be sufficient to have some code that sets a PTE of > a R/O-mapped shared page dirty, in order for it to erroneously become > writable by FOLL_FORCE. The implications of setting a write-protected PTE > dirty might not be immediately obvious to everyone. > > And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set > pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map > a shmem page R/O while marking the pte dirty. This can be used by > unprivileged user space to modify tmpfs/shmem file content even if the user > does not have write permissions to the file -- Dirty COW restricted to > tmpfs/shmem (CVE-2022-2590). > > To fix such security issues for good, the insight is that we really only > need that fancy retry logic (FOLL_COW) for COW mappings that are not > writable (!VM_WRITE). And in a COW mapping, we really only broke COW if > we have an exclusive anonymous page mapped. If we have something else > mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), > we have to trigger a write fault to break COW. If we don't find an > exclusive anonymous page when we retry, we have to trigger COW breaking > once again because something intervened. > > Let's move away from this mandatory-retry + dirty handling and rely on > our PageAnonExclusive() flag for making a similar decision, to use the > same COW logic as in other kernel parts here as well. In case we stumble > over a PTE in a COW mapping that does not map an exclusive anonymous page, > COW was not properly broken and we have to trigger a fake write-fault to > break COW. > > Just like we do in can_change_pte_writable() added via > commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive > anonymous pages when changing protection") and commit 76aefad628aa > ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take > care of softdirty and uffd-wp manually. > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > to fail instead of silently granting write access and bypassing the > userspace fault handler. Note that FOLL_FORCE is not only used for debug > access, but also triggered by applications without debug intentions, for > example, when pinning pages via RDMA. > > This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are > affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. > > Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So > let's just get rid of it. I have to add here: "Thanks to Nadav Amit for pointing out that the pte_dirty() check in FOLL_FORCE code is problematic and might be exploitable." -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-08 7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand 2022-08-08 16:02 ` David Hildenbrand @ 2022-08-09 18:27 ` Linus Torvalds 2022-08-09 18:45 ` David Hildenbrand 2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds ` (2 subsequent siblings) 4 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 18:27 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe I'm still reading through this, but STOP DOING THIS On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); Using BUG_ON() for debugging is simply not ok. And saying "but it's just a VM_BUG_ON()" does not change *anything*. At least Fedora enables that unconditionally for normal people, it is not some kind of "only VM people do this". Really. BUG_ON() IS NOT FOR DEBUGGING. Stop it. Now. If you have a condition that must not happen, you either write that condition into the code, or - if you are convinced it cannot happen - you make it a WARN_ON_ONCE() so that people can report it to you. The BUG_ON() will just make the machine die. And for the facebooks and googles of the world, the WARN_ON() will be sufficient. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 18:27 ` Linus Torvalds @ 2022-08-09 18:45 ` David Hildenbrand 2022-08-09 18:59 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 18:45 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On 09.08.22 20:27, Linus Torvalds wrote: > I'm still reading through this, but > > STOP DOING THIS > > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> >> + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); > > Using BUG_ON() for debugging is simply not ok. > > And saying "but it's just a VM_BUG_ON()" does not change *anything*. > At least Fedora enables that unconditionally for normal people, it is > not some kind of "only VM people do this". > > Really. BUG_ON() IS NOT FOR DEBUGGING. I totally agree with BUG_ON ... but if I get talked to in all-caps on a Thursday evening and feel like I just touched the forbidden fruit, I have to ask for details about VM_BUG_ON nowadays. VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some kind of debugging at least to me. I *know* that Fedora enables it and I *know* that this will make Fedora crash. I know why Fedora enables this debug option, but it somewhat destorys the whole purpose of VM_BUG_ON kind of nowadays? For this case, this condition will never trigger and I consider it much more a hint to the reader that we can rest assured that this condition holds. And on production systems, it will get optimized out. Should we forbid any new usage of VM_BUG_ON just like we mostly do with BUG_ON? > > Stop it. Now. > > If you have a condition that must not happen, you either write that > condition into the code, or - if you are convinced it cannot happen - > you make it a WARN_ON_ONCE() so that people can report it to you. I can just turn that into a WARN_ON_ONCE() or even a VM_WARN_ON_ONCE(). -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 18:45 ` David Hildenbrand @ 2022-08-09 18:59 ` Linus Torvalds 2022-08-09 19:07 ` Jason Gunthorpe 2022-08-11 7:13 ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 18:59 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Tue, Aug 9, 2022 at 11:45 AM David Hildenbrand <david@redhat.com> wrote: > > I totally agree with BUG_ON ... but if I get talked to in all-caps on a > Thursday evening and feel like I just touched the forbidden fruit, I > have to ask for details about VM_BUG_ON nowadays. > > VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some > kind of debugging at least to me. I *know* that Fedora enables it and I > *know* that this will make Fedora crash. No. VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no different, the only difference is "we can make the code smaller because these are less important". The only possible case where BUG_ON can validly be used is "I have some fundamental data corruption and cannot possibly return an error". This kind of "I don't think this can happen" is _never_ an excuse for it. Honestly, 99% of our existing BUG_ON() ones are completely bogus, and left-over debug code that wasn't removed because they never triggered. I've several times considered just using a coccinelle script to remove every single BUG_ON() (and VM_BUG_ON()) as simply bogus. Because they are pure noise. I just tried to find a valid BUG_ON() that would make me go "yeah, that's actually worth it", and couldn't really find one. Yeah, there are several ones in the scheduler that make me go "ok, if that triggers, the machine is dead anyway", so in that sense there are certainly BUG_ON()s that don't _hurt_. But as a very good approximation, the rule is "absolutely no new BUG_ON() calls _ever_". Because I really cannot see a single case where "proper error handling and WARN_ON_ONCE()" isn't the right thing. Now, that said, there is one very valid sub-form of BUG_ON(): BUILD_BUG_ON() is absolutely 100% fine. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 18:59 ` Linus Torvalds @ 2022-08-09 19:07 ` Jason Gunthorpe 2022-08-09 19:21 ` Linus Torvalds 2022-08-09 21:16 ` David Laight 2022-08-11 7:13 ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar 1 sibling, 2 replies; 33+ messages in thread From: Jason Gunthorpe @ 2022-08-09 19:07 UTC (permalink / raw) To: Linus Torvalds Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote: > But as a very good approximation, the rule is "absolutely no new > BUG_ON() calls _ever_". Because I really cannot see a single case > where "proper error handling and WARN_ON_ONCE()" isn't the right > thing. Parallel to this discussion I've had ones where people more or less say Since BUG_ON crashes the machine and Linus says that crashing the machine is bad, WARN_ON will also crash the machine if you set the panic_on_warn parameter, so it is also bad, thus we shouldn't use anything. I've generally maintained that people who set the panic_on_warn *want* these crashes, because that is the entire point of it. So we should use WARN_ON with an error recovery for "can't happen" assertions like these. I think it is what you are saying here. Jason ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 19:07 ` Jason Gunthorpe @ 2022-08-09 19:21 ` Linus Torvalds 2022-08-09 21:16 ` David Laight 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 19:21 UTC (permalink / raw) To: Jason Gunthorpe Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard On Tue, Aug 9, 2022 at 12:09 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > Since BUG_ON crashes the machine and Linus says that crashing the > machine is bad, WARN_ON will also crash the machine if you set the > panic_on_warn parameter, so it is also bad, thus we shouldn't use > anything. If you set 'panic_on_warn' you get to keep both pieces when something breaks. The thing is, there are people who *do* want to stop immediately when something goes wrong in the kernel. Anybody doing large-scale virtualization presumably has all the infrastructure to get debug info out of the virtual environment. And people who run controlled loads in big server machine setups and have a MIS department to manage said machines typically also prefer for a machine to just crash over continuing. So in those situations, a dead machine is still a dead machine, but you get the information out, and panic_on_warn is fine, because panic and reboot is fine. And yes, that's actually a fairly common case. Things like syzkaller etc *wants* to abort on the first warning, because that's kind of the point. But while that kind of virtualized automation machinery is very very common, and is a big deal, it's by no means the only deal, and the most important thing to the point where nothing else matters. And if you are *not* in a farm, and if you are *not* using virtualization, a dead machine is literally a useless brick. Nobody has serial lines on individual machines any more. In most cases, the hardware literally doesn't even exist any more. So in that situation, you really cannot afford to take the approach of "just kill the machine". If you are on a laptop and are doing power management code, you generally cannot do that in a virtual environment, and you already have enough problems with suspend and resume being hard to debug, without people also going "oh, let's just BUG_ON() and kill the machine". Because the other side of that "we have a lot of machine farms doing automated testing" is that those machine farms do not generally find a lot of the exciting cases. Almost every single merge window, I end up having to bisect and report an oops or a WARN_ON(), because I actually run on real hardware. And said problem was never seen in linux-next. So we have two very different cases: the "virtual machine with good logging where a dead machine is fine" - use 'panic_on_warn'. And the actual real hardware with real drivers, running real loads by users. Both are valid. But the second case means that BUG_ON() is basically _never_ valid. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* RE: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 19:07 ` Jason Gunthorpe 2022-08-09 19:21 ` Linus Torvalds @ 2022-08-09 21:16 ` David Laight 1 sibling, 0 replies; 33+ messages in thread From: David Laight @ 2022-08-09 21:16 UTC (permalink / raw) To: 'Jason Gunthorpe', Linus Torvalds Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard From: Jason Gunthorpe > Sent: 09 August 2022 20:08 > > On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote: > > > But as a very good approximation, the rule is "absolutely no new > > BUG_ON() calls _ever_". Because I really cannot see a single case > > where "proper error handling and WARN_ON_ONCE()" isn't the right > > thing. > > Parallel to this discussion I've had ones where people more or less > say > > Since BUG_ON crashes the machine and Linus says that crashing the > machine is bad, WARN_ON will also crash the machine if you set the > panic_on_warn parameter, so it is also bad, thus we shouldn't use > anything. > > I've generally maintained that people who set the panic_on_warn *want* > these crashes, because that is the entire point of it. So we should > use WARN_ON with an error recovery for "can't happen" assertions like > these. I think it is what you are saying here. They don't necessarily want the crashes, it is more the people who built the distribution think they want the crashes. I have had issues with a customer system (with our drivers) randomly locking up. Someone had decided that 'PANIC_ON_OOPS' was a good idea but hadn't enabled anything to actually take the dump. So instead of a diagnosable problem (and a 'doh' moment) you get several weeks of head scratching and a very annoyed user. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() 2022-08-09 18:59 ` Linus Torvalds 2022-08-09 19:07 ` Jason Gunthorpe @ 2022-08-11 7:13 ` Ingo Molnar 2022-08-11 20:43 ` Linus Torvalds 1 sibling, 1 reply; 33+ messages in thread From: Ingo Molnar @ 2022-08-11 7:13 UTC (permalink / raw) To: Linus Torvalds Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman * Linus Torvalds <torvalds@linux-foundation.org> wrote: > I just tried to find a valid BUG_ON() that would make me go "yeah, that's > actually worth it", and couldn't really find one. Yeah, there are several > ones in the scheduler that make me go "ok, if that triggers, the machine > is dead anyway", so in that sense there are certainly BUG_ON()s that > don't _hurt_. That's a mostly accidental, historical accumulation of BUG_ON()s - I believe we can change all of them to WARN_ON() via the patch below. As far as the scheduler is concerned, we don't need any BUG_ON()s. [ This assumes that printk() itself is atomic and non-recursive wrt. the scheduler in these code paths ... ] Thanks, Ingo ===============> From: Ingo Molnar <mingo@kernel.org> Date: Thu, 11 Aug 2022 08:54:52 +0200 Subject: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON(). There's one exception: WARN_ON() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON(!lock_task_sighand(p, &flags))) + return; Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/autogroup.c | 3 ++- kernel/sched/core.c | 2 +- kernel/sched/cpupri.c | 2 +- kernel/sched/deadline.c | 26 +++++++++++++------------- kernel/sched/fair.c | 10 +++++----- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 6 +++--- 7 files changed, 26 insertions(+), 25 deletions(-) diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index 4ebaf97f7bd8..13f6b6da35a0 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c @@ -161,7 +161,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag) struct task_struct *t; unsigned long flags; - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON(!lock_task_sighand(p, &flags))) + return; prev = p->signal->autogroup; if (prev == ag) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d3d61cbb6b3c..f84206bf42cd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2328,7 +2328,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, rq = cpu_rq(new_cpu); rq_lock(rq, rf); - BUG_ON(task_cpu(p) != new_cpu); + WARN_ON(task_cpu(p) != new_cpu); activate_task(rq, p, 0); check_preempt_curr(rq, p, 0); diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index fa9ce9d83683..9f719e4ea081 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, int task_pri = convert_prio(p->prio); int idx, cpu; - BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES); + WARN_ON(task_pri >= CPUPRI_NR_PRIORITIES); for (idx = 0; idx < task_pri; idx++) { diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 1d9c90958baa..fb234077c317 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -310,7 +310,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw) { struct rq *rq; - BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV); + WARN_ON(p->dl.flags & SCHED_FLAG_SUGOV); if (task_on_rq_queued(p)) return; @@ -607,7 +607,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p) { struct rb_node *leftmost; - BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks)); + WARN_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks)); leftmost = rb_add_cached(&p->pushable_dl_tasks, &rq->dl.pushable_dl_tasks_root, @@ -684,7 +684,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * Failed to find any suitable CPU. * The task will never come back! */ - BUG_ON(dl_bandwidth_enabled()); + WARN_ON(dl_bandwidth_enabled()); /* * If admission control is disabled we @@ -830,7 +830,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - BUG_ON(pi_of(dl_se)->dl_runtime <= 0); + WARN_ON(pi_of(dl_se)->dl_runtime <= 0); /* * This could be the case for a !-dl task that is boosted. @@ -1616,7 +1616,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); - BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node)); + WARN_ON(!RB_EMPTY_NODE(&dl_se->rb_node)); rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less); @@ -1640,7 +1640,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) static void enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) { - BUG_ON(on_dl_rq(dl_se)); + WARN_ON(on_dl_rq(dl_se)); update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags); @@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq) return NULL; dl_se = pick_next_dl_entity(dl_rq); - BUG_ON(!dl_se); + WARN_ON(!dl_se); p = dl_task_of(dl_se); return p; @@ -2277,12 +2277,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq) p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root)); - BUG_ON(rq->cpu != task_cpu(p)); - BUG_ON(task_current(rq, p)); - BUG_ON(p->nr_cpus_allowed <= 1); + WARN_ON(rq->cpu != task_cpu(p)); + WARN_ON(task_current(rq, p)); + WARN_ON(p->nr_cpus_allowed <= 1); - BUG_ON(!task_on_rq_queued(p)); - BUG_ON(!dl_task(p)); + WARN_ON(!task_on_rq_queued(p)); + WARN_ON(!dl_task(p)); return p; } @@ -2492,7 +2492,7 @@ static void set_cpus_allowed_dl(struct task_struct *p, struct root_domain *src_rd; struct rq *rq; - BUG_ON(!dl_task(p)); + WARN_ON(!dl_task(p)); rq = task_rq(p); src_rd = rq->rd; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da388657d5ac..00c01b3232b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (!join) return; - BUG_ON(irqs_disabled()); + WARN_ON(irqs_disabled()); double_lock_irq(&my_grp->lock, &grp->lock); for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { @@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; find_matching_se(&se, &pse); - BUG_ON(!pse); + WARN_ON(!pse); cse_is_idle = se_is_idle(se); pse_is_idle = se_is_idle(pse); @@ -8159,7 +8159,7 @@ static void attach_task(struct rq *rq, struct task_struct *p) { lockdep_assert_rq_held(rq); - BUG_ON(task_rq(p) != rq); + WARN_ON(task_rq(p) != rq); activate_task(rq, p, ENQUEUE_NOCLOCK); check_preempt_curr(rq, p, 0); } @@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out_balanced; } - BUG_ON(busiest == env.dst_rq); + WARN_ON(busiest == env.dst_rq); schedstat_add(sd->lb_imbalance[idle], env.imbalance); @@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data) * we need to fix it. Originally reported by * Bjorn Helgaas on a 128-CPU setup. */ - BUG_ON(busiest_rq == target_rq); + WARN_ON(busiest_rq == target_rq); /* Search for an sd spanning us and the target CPU. */ rcu_read_lock(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 054b6711e961..acf9f5ce0c4a 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -843,7 +843,7 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + WARN_ON(want); balanced: /* * Disable all the borrow logic by pretending we have inf diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b0bf2287dd9d..8e5df3bc3483 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2699,8 +2699,8 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) __acquires(rq1->lock) __acquires(rq2->lock) { - BUG_ON(!irqs_disabled()); - BUG_ON(rq1 != rq2); + WARN_ON(!irqs_disabled()); + WARN_ON(rq1 != rq2); raw_spin_rq_lock(rq1); __acquire(rq2->lock); /* Fake it out ;) */ double_rq_clock_clear_update(rq1, rq2); @@ -2716,7 +2716,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2) __releases(rq1->lock) __releases(rq2->lock) { - BUG_ON(rq1 != rq2); + WARN_ON(rq1 != rq2); raw_spin_rq_unlock(rq1); __release(rq2->lock); } ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() 2022-08-11 7:13 ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar @ 2022-08-11 20:43 ` Linus Torvalds 2022-08-11 21:28 ` Matthew Wilcox 2022-08-12 9:29 ` [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() Ingo Molnar 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2022-08-11 20:43 UTC (permalink / raw) To: Ingo Molnar Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman On Thu, Aug 11, 2022 at 12:13 AM Ingo Molnar <mingo@kernel.org> wrote: > > By using a WARN_ON() we at least give the user a chance to report > any bugs triggered here - instead of getting silent hangs. > > None of these WARN_ON()s are supposed to trigger, ever - so we ignore > cases where a NULL check is done via a BUG_ON() and we let a NULL > pointer through after a WARN_ON(). May I suggest going one step further, and making these WARN_ON_ONCE() instead. From personal experience, once some scheduler bug (or task struct corruption) happens, ti often *keeps* happening, and the logs just fill up with more and more data, to the point where you lose sight of the original report (and the machine can even get unusable just from the logging). WARN_ON_ONCE() can help that situation. Now, obviously (a) WARN_ON_ONCE *can* also result in less information, and maybe there are situations where having more - possibly different - cases of the same thing triggering could be useful. (b) WARN_ON_ONCE historically generated a bit bigger code than WARN_ON simply due to the extra "did this already trigger" check. I *think* (b) is no longer true, and it's just a flag these days, but I didn't actually check. so it's not like there aren't potential downsides, but in general I think the sanest and most natural thing is to have BUG_ON() translate to WARN_ON_ONCE(). For the "reboot-on-warn" people, it ends up being the same thing. And for the rest of us, the "give me *one* warning" can end up making the reporting a lot easier. Obviously, with the "this never actually happens", the whole "once or many times" is kind of moot. But if it never happens at all, to the point where it doesn't even add a chance of helping debugging, maybe the whole test should be removed entirely... Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() 2022-08-11 20:43 ` Linus Torvalds @ 2022-08-11 21:28 ` Matthew Wilcox 2022-08-11 23:22 ` Jason Gunthorpe 2022-08-12 9:29 ` [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() Ingo Molnar 1 sibling, 1 reply; 33+ messages in thread From: Matthew Wilcox @ 2022-08-11 21:28 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Vlastimil Babka, John Hubbard, Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote: > May I suggest going one step further, and making these WARN_ON_ONCE() instead. > > >From personal experience, once some scheduler bug (or task struct > corruption) happens, ti often *keeps* happening, and the logs just > fill up with more and more data, to the point where you lose sight of > the original report (and the machine can even get unusable just from > the logging). I've been thinking about magically turning all the WARN_ON_ONCE() into (effectively) WARN_ON_RATELIMIT(). I had some patches in that direction a while ago but never got round to tidying them up for submission. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() 2022-08-11 21:28 ` Matthew Wilcox @ 2022-08-11 23:22 ` Jason Gunthorpe 2022-08-14 1:10 ` John Hubbard 0 siblings, 1 reply; 33+ messages in thread From: Jason Gunthorpe @ 2022-08-11 23:22 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Ingo Molnar, David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Vlastimil Babka, John Hubbard, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman On Thu, Aug 11, 2022 at 10:28:27PM +0100, Matthew Wilcox wrote: > On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote: > > May I suggest going one step further, and making these WARN_ON_ONCE() instead. > > > > >From personal experience, once some scheduler bug (or task struct > > corruption) happens, ti often *keeps* happening, and the logs just > > fill up with more and more data, to the point where you lose sight of > > the original report (and the machine can even get unusable just from > > the logging). > > I've been thinking about magically turning all the WARN_ON_ONCE() into > (effectively) WARN_ON_RATELIMIT(). I had some patches in that direction > a while ago but never got round to tidying them up for submission. I often wonder if we have a justification for WARN_ON to even exist, I see a lot of pressure to make things into WARN_ON_ONCE based on the logic that spamming makes it useless.. Maybe a global limit of 10 warn ons per minute or something would be interesting? Jason ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() 2022-08-11 23:22 ` Jason Gunthorpe @ 2022-08-14 1:10 ` John Hubbard 0 siblings, 0 replies; 33+ messages in thread From: John Hubbard @ 2022-08-14 1:10 UTC (permalink / raw) To: Jason Gunthorpe, Matthew Wilcox Cc: Linus Torvalds, Ingo Molnar, David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Vlastimil Babka, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman On 8/11/22 16:22, Jason Gunthorpe wrote: > On Thu, Aug 11, 2022 at 10:28:27PM +0100, Matthew Wilcox wrote: >> On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote: >>> May I suggest going one step further, and making these WARN_ON_ONCE() instead. >>> >>> >From personal experience, once some scheduler bug (or task struct >>> corruption) happens, ti often *keeps* happening, and the logs just >>> fill up with more and more data, to the point where you lose sight of >>> the original report (and the machine can even get unusable just from >>> the logging). >> >> I've been thinking about magically turning all the WARN_ON_ONCE() into >> (effectively) WARN_ON_RATELIMIT(). I had some patches in that direction >> a while ago but never got round to tidying them up for submission. If you do that, I'd like to suggest that you avoid using magic here, but instead just rename at the call sites. Because: First and foremost, something named WARN_ON_ONCE() clearly has a solemn responsibility to warn exactly "once times"! :) Second, it's not yet clear (or is it?) that WARN_ON_ONCE() is always worse than rate limiting. It's a trade-off, rather than a clear win for either case, in my experience. The _ONCE variant can get overwritten if the kernel log wraps, but the _RATELIMIT on the other hand, may be excessive. And finally, if it *is* agreed on here that WARN_ON_RATELIMIT() is always better than WARN_ON_ONCE(), then there is still no harm in spending a patch or two (coccinelle...) to rename WARN_ON_ONCE() --> WARN_ON_RATELIMIT(), so that we end up with accurate names. > > I often wonder if we have a justification for WARN_ON to even exist, I > see a lot of pressure to make things into WARN_ON_ONCE based on the > logic that spamming makes it useless.. Agreed. WARN_ON_ONCE() or WARN_ON_RATELIMIT(), take your pick. But not WARN_ON_EVERY_TIME()--that usually causes a serious problems in the logs. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() 2022-08-11 20:43 ` Linus Torvalds 2022-08-11 21:28 ` Matthew Wilcox @ 2022-08-12 9:29 ` Ingo Molnar [not found] ` <20220815144143.zjsiamw5y22bvgki@suse.de> 1 sibling, 1 reply; 33+ messages in thread From: Ingo Molnar @ 2022-08-12 9:29 UTC (permalink / raw) To: Linus Torvalds Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Aug 11, 2022 at 12:13 AM Ingo Molnar <mingo@kernel.org> wrote: > > > > By using a WARN_ON() we at least give the user a chance to report > > any bugs triggered here - instead of getting silent hangs. > > > > None of these WARN_ON()s are supposed to trigger, ever - so we ignore > > cases where a NULL check is done via a BUG_ON() and we let a NULL > > pointer through after a WARN_ON(). > > May I suggest going one step further, and making these WARN_ON_ONCE() instead. Ok, agreed. > From personal experience, once some scheduler bug (or task struct > corruption) happens, ti often *keeps* happening, and the logs just fill > up with more and more data, to the point where you lose sight of the > original report (and the machine can even get unusable just from the > logging). > > WARN_ON_ONCE() can help that situation. Yeah, true. > Now, obviously > > (a) WARN_ON_ONCE *can* also result in less information, and maybe there > are situations where having more - possibly different - cases of the same > thing triggering could be useful. None of these warnings are supposed to happen absent serious random data corruption, ever™, so if against expectations they do happen once, somewhere, and its brevity makes it hard to figure out, we can still reconsider on a case by case basis & increase verbosity. > (b) WARN_ON_ONCE historically generated a bit bigger code than > WARN_ON simply due to the extra "did this already trigger" check. > > I *think* (b) is no longer true, and it's just a flag these days, but I > didn't actually check. Yeah, so on architectures that implement a smart __WARN_FLAGS() primitive, such as x86, overhead is pretty good: #define __WARN_FLAGS(flags) \ do { \ __auto_type __flags = BUGFLAG_WARNING|(flags); \ instrumentation_begin(); \ _BUG_FLAGS(ASM_UD2, __flags, ASM_REACHABLE); \ instrumentation_end(); \ } while (0) For them the runtime overhead of a WARN_ON_ONCE() is basically just the check itself: # no checks: ffffffff810a3ae0 <check_preempt_curr>: ffffffff810a3ae0: 48 8b 87 20 09 00 00 mov 0x920(%rdi),%rax ffffffff810a3ae7: 48 8b 8e 90 02 00 00 mov 0x290(%rsi),%rcx ffffffff810a3aee: 53 push %rbx ffffffff810a3aef: 48 89 fb mov %rdi,%rbx ffffffff810a3af2: 48 3b 88 90 02 00 00 cmp 0x290(%rax),%rcx # Single-branch WARN_ON_ONCE(): # # void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) # { # + WARN_ON_ONCE(!rq); # + # if (p->sched_class == rq->curr->sched_class) # rq->curr->sched_class->check_preempt_curr(rq, p, flags); # else if (sched_class_above(p->sched_class, rq->curr->sched_class)) ffffffff810a3ae0 <check_preempt_curr>: ffffffff810a3ae0: 53 push %rbx ffffffff810a3ae1: 48 89 fb mov %rdi,%rbx ffffffff810a3ae4: | 48 85 ff test %rdi,%rdi ffffffff810a3ae7: | 74 69 je ffffffff810a3b52 <check_preempt_curr+0x72> ffffffff810a3ae9: 48 8b 83 20 09 00 00 mov 0x920(%rbx),%rax ffffffff810a3af0: 48 8b 8e 90 02 00 00 mov 0x290(%rsi),%rcx ffffffff810a3af7: 48 3b 88 90 02 00 00 cmp 0x290(%rax),%rcx ... ffffffff810a3b50: eb d1 jmp ffffffff810a3b23 <check_preempt_curr+0x43> # tail-call ffffffff810a3b52: | 0f 0b ud2 So it's a test instruction and an unlikely-forward-branch, plus a UD2 trap squeezed into the function epilogue, often hidden by alignment holes. As lightweight as it gets, and no in-line penalty from being a 'once' warning. > so it's not like there aren't potential downsides, but in general I think > the sanest and most natural thing is to have BUG_ON() translate to > WARN_ON_ONCE(). > > For the "reboot-on-warn" people, it ends up being the same thing. And for > the rest of us, the "give me *one* warning" can end up making the > reporting a lot easier. Agreed - updated v2 patch attached. > Obviously, with the "this never actually happens", the whole "once or > many times" is kind of moot. But if it never happens at all, to the point > where it doesn't even add a chance of helping debugging, maybe the whole > test should be removed entirely... Yeah. Thanks, Ingo ======================================== From: Ingo Molnar <mingo@kernel.org> Date: Thu, 11 Aug 2022 08:54:52 +0200 Subject: [PATCH] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com --- kernel/sched/autogroup.c | 3 ++- kernel/sched/core.c | 2 +- kernel/sched/cpupri.c | 2 +- kernel/sched/deadline.c | 26 +++++++++++++------------- kernel/sched/fair.c | 10 +++++----- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 6 +++--- 7 files changed, 26 insertions(+), 25 deletions(-) diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index 4ebaf97f7bd8..991fc9002535 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c @@ -161,7 +161,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag) struct task_struct *t; unsigned long flags; - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; prev = p->signal->autogroup; if (prev == ag) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee28253c9ac0..813687a5f5cf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2328,7 +2328,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, rq = cpu_rq(new_cpu); rq_lock(rq, rf); - BUG_ON(task_cpu(p) != new_cpu); + WARN_ON_ONCE(task_cpu(p) != new_cpu); activate_task(rq, p, 0); check_preempt_curr(rq, p, 0); diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index fa9ce9d83683..a286e726eb4b 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, int task_pri = convert_prio(p->prio); int idx, cpu; - BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES); + WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES); for (idx = 0; idx < task_pri; idx++) { diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 0ab79d819a0d..962b169b05cf 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -310,7 +310,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw) { struct rq *rq; - BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV); + WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV); if (task_on_rq_queued(p)) return; @@ -607,7 +607,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p) { struct rb_node *leftmost; - BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks)); + WARN_ON_ONCE(!RB_EMPTY_NODE(&p->pushable_dl_tasks)); leftmost = rb_add_cached(&p->pushable_dl_tasks, &rq->dl.pushable_dl_tasks_root, @@ -684,7 +684,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * Failed to find any suitable CPU. * The task will never come back! */ - BUG_ON(dl_bandwidth_enabled()); + WARN_ON_ONCE(dl_bandwidth_enabled()); /* * If admission control is disabled we @@ -830,7 +830,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - BUG_ON(pi_of(dl_se)->dl_runtime <= 0); + WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0); /* * This could be the case for a !-dl task that is boosted. @@ -1616,7 +1616,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); - BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node)); + WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node)); rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less); @@ -1640,7 +1640,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) static void enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) { - BUG_ON(on_dl_rq(dl_se)); + WARN_ON_ONCE(on_dl_rq(dl_se)); update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags); @@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq) return NULL; dl_se = pick_next_dl_entity(dl_rq); - BUG_ON(!dl_se); + WARN_ON_ONCE(!dl_se); p = dl_task_of(dl_se); return p; @@ -2277,12 +2277,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq) p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root)); - BUG_ON(rq->cpu != task_cpu(p)); - BUG_ON(task_current(rq, p)); - BUG_ON(p->nr_cpus_allowed <= 1); + WARN_ON_ONCE(rq->cpu != task_cpu(p)); + WARN_ON_ONCE(task_current(rq, p)); + WARN_ON_ONCE(p->nr_cpus_allowed <= 1); - BUG_ON(!task_on_rq_queued(p)); - BUG_ON(!dl_task(p)); + WARN_ON_ONCE(!task_on_rq_queued(p)); + WARN_ON_ONCE(!dl_task(p)); return p; } @@ -2492,7 +2492,7 @@ static void set_cpus_allowed_dl(struct task_struct *p, struct root_domain *src_rd; struct rq *rq; - BUG_ON(!dl_task(p)); + WARN_ON_ONCE(!dl_task(p)); rq = task_rq(p); src_rd = rq->rd; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 914096c5b1ae..28f10dccd194 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (!join) return; - BUG_ON(irqs_disabled()); + WARN_ON_ONCE(irqs_disabled()); double_lock_irq(&my_grp->lock, &grp->lock); for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { @@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; find_matching_se(&se, &pse); - BUG_ON(!pse); + WARN_ON_ONCE(!pse); cse_is_idle = se_is_idle(se); pse_is_idle = se_is_idle(pse); @@ -8159,7 +8159,7 @@ static void attach_task(struct rq *rq, struct task_struct *p) { lockdep_assert_rq_held(rq); - BUG_ON(task_rq(p) != rq); + WARN_ON_ONCE(task_rq(p) != rq); activate_task(rq, p, ENQUEUE_NOCLOCK); check_preempt_curr(rq, p, 0); } @@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out_balanced; } - BUG_ON(busiest == env.dst_rq); + WARN_ON_ONCE(busiest == env.dst_rq); schedstat_add(sd->lb_imbalance[idle], env.imbalance); @@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data) * we need to fix it. Originally reported by * Bjorn Helgaas on a 128-CPU setup. */ - BUG_ON(busiest_rq == target_rq); + WARN_ON_ONCE(busiest_rq == target_rq); /* Search for an sd spanning us and the target CPU. */ rcu_read_lock(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 55f39c8f4203..2936fe55cef7 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -843,7 +843,7 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + WARN_ON_ONCE(want); balanced: /* * Disable all the borrow logic by pretending we have inf diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e26688d387ae..7a44dceeb50a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2709,8 +2709,8 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) __acquires(rq1->lock) __acquires(rq2->lock) { - BUG_ON(!irqs_disabled()); - BUG_ON(rq1 != rq2); + WARN_ON_ONCE(!irqs_disabled()); + WARN_ON_ONCE(rq1 != rq2); raw_spin_rq_lock(rq1); __acquire(rq2->lock); /* Fake it out ;) */ double_rq_clock_clear_update(rq1, rq2); @@ -2726,7 +2726,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2) __releases(rq1->lock) __releases(rq2->lock) { - BUG_ON(rq1 != rq2); + WARN_ON_ONCE(rq1 != rq2); raw_spin_rq_unlock(rq1); __release(rq2->lock); } ^ permalink raw reply related [flat|nested] 33+ messages in thread
[parent not found: <20220815144143.zjsiamw5y22bvgki@suse.de>]
* Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() [not found] ` <20220815144143.zjsiamw5y22bvgki@suse.de> @ 2022-08-15 22:12 ` John Hubbard 2022-08-21 11:28 ` Ingo Molnar 1 sibling, 0 replies; 33+ messages in thread From: John Hubbard @ 2022-08-15 22:12 UTC (permalink / raw) To: Mel Gorman, Ingo Molnar Cc: Linus Torvalds, David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt On 8/15/22 07:41, Mel Gorman wrote: > For the rest, I didn't see obvious recovery paths that would allow the > system to run predictably. Any of them firing will have unpredictable > consequences (e.g. move_queued_task firing would be fun if it was a per-cpu > kthread). Depending on which warning triggers, the remaining life of the > system may be very short but maybe long enough to be logged even if system > locks up shortly afterwards. Hi Mel, Are you basically saying, "WARN_ON*() seems acceptable in mm/, because we can at least get the problem logged before it locks up, probably"? Or are you implying that we should instead introduce and use some new PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log then reboot" behavior? thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() [not found] ` <20220815144143.zjsiamw5y22bvgki@suse.de> 2022-08-15 22:12 ` John Hubbard @ 2022-08-21 11:28 ` Ingo Molnar 1 sibling, 0 replies; 33+ messages in thread From: Ingo Molnar @ 2022-08-21 11:28 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt * Mel Gorman <mgorman@suse.de> wrote: > For the rest, I didn't see obvious recovery paths that would allow the > system to run predictably. Any of them firing will have unpredictable > consequences (e.g. move_queued_task firing would be fun if it was a > per-cpu kthread). Depending on which warning triggers, the remaining life > of the system may be very short but maybe long enough to be logged even > if system locks up shortly afterwards. Correct. I'd prefer to keep all these warnings 'simple' - i.e. no attempted recovery & control flow, unless we ever expect these to trigger. I.e. instead of adding a 'goto' I'd prefer if we removed most of the ones you highlighted. But wanted to keep this first patch simple. Thanks, Ingo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-08 7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand 2022-08-08 16:02 ` David Hildenbrand 2022-08-09 18:27 ` Linus Torvalds @ 2022-08-09 18:40 ` Linus Torvalds 2022-08-09 18:48 ` Jason Gunthorpe 2022-08-09 18:48 ` Linus Torvalds 2022-08-09 20:00 ` Linus Torvalds 4 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 18:40 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > to fail instead of silently granting write access and bypassing the > userspace fault handler. Note that FOLL_FORCE is not only used for debug > access, but also triggered by applications without debug intentions, for > example, when pinning pages via RDMA. So this made me go "Whaa?" I didn't even realize that the media drivers and rdma used FOLL_FORCE. That's just completely bogus. Why do they do that? It seems to be completely bogus, and seems to have no actual valid reason for it. Looking through the history, it goes back to the original code submission in 2006, and doesn't have a mention of why. I think the original reason was that the code didn't have pinning, so it used "do a write" as a pin mechanism - even for reads. IOW, I think the non-ptrace use of FOLL_FORCE should just be removed. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds @ 2022-08-09 18:48 ` Jason Gunthorpe 2022-08-09 18:53 ` David Hildenbrand 2022-08-09 19:07 ` Linus Torvalds 0 siblings, 2 replies; 33+ messages in thread From: Jason Gunthorpe @ 2022-08-09 18:48 UTC (permalink / raw) To: Linus Torvalds Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > > to fail instead of silently granting write access and bypassing the > > userspace fault handler. Note that FOLL_FORCE is not only used for debug > > access, but also triggered by applications without debug intentions, for > > example, when pinning pages via RDMA. > > So this made me go "Whaa?" > > I didn't even realize that the media drivers and rdma used FOLL_FORCE. > > That's just completely bogus. > > Why do they do that? > > It seems to be completely bogus, and seems to have no actual valid > reason for it. Looking through the history, it goes back to the > original code submission in 2006, and doesn't have a mention of why. It is because of all this madness with COW. Lets say an app does: buf = mmap(MAP_PRIVATE) rdma_pin_for_read(buf) buf[0] = 1 Then the store to buf[0] will COW the page and the pin will become decoherent. The purpose of the FORCE is to force COW to happen early so this is avoided. Sadly there are real apps that end up working this way, eg because they are using buffer in .data or something. I've hoped David's new work on this provided some kind of path to a proper solution, as the need is very similar to all the other places where we need to ensure there is no possiblity of future COW. So, these usage can be interpreted as a FOLL flag we don't have - some kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive sort of idea. Jason ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 18:48 ` Jason Gunthorpe @ 2022-08-09 18:53 ` David Hildenbrand 2022-08-09 19:07 ` Linus Torvalds 1 sibling, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 18:53 UTC (permalink / raw) To: Jason Gunthorpe, Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard On 09.08.22 20:48, Jason Gunthorpe wrote: > On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote: >> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> For example, a write() via /proc/self/mem to a uffd-wp-protected range has >>> to fail instead of silently granting write access and bypassing the >>> userspace fault handler. Note that FOLL_FORCE is not only used for debug >>> access, but also triggered by applications without debug intentions, for >>> example, when pinning pages via RDMA. >> >> So this made me go "Whaa?" >> >> I didn't even realize that the media drivers and rdma used FOLL_FORCE. >> >> That's just completely bogus. >> >> Why do they do that? >> >> It seems to be completely bogus, and seems to have no actual valid >> reason for it. Looking through the history, it goes back to the >> original code submission in 2006, and doesn't have a mention of why. > > It is because of all this madness with COW. > > Lets say an app does: > > buf = mmap(MAP_PRIVATE) > rdma_pin_for_read(buf) > buf[0] = 1 > > Then the store to buf[0] will COW the page and the pin will become > decoherent. > > The purpose of the FORCE is to force COW to happen early so this is > avoided. > > Sadly there are real apps that end up working this way, eg because > they are using buffer in .data or something. > > I've hoped David's new work on this provided some kind of path to a > proper solution, as the need is very similar to all the other places > where we need to ensure there is no possiblity of future COW. > > So, these usage can be interpreted as a FOLL flag we don't have - some > kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive > sort of idea. Thanks Jason for the explanation. I have patches in the works to no longer use FOLL_FORCE|FOLL_WRITE for taking a reliable longerterm R/O pin in a MAP_PRIVATE mapping. The patches are mostly done (and comparably simple), I merely deferred sending them out because I stumbled over this issue first. Some ugly corner cases of MAP_SHARED remain, but for most prominent use cases, my upcoming patches should allow us to just have longterm R/O pins working as expected. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 18:48 ` Jason Gunthorpe 2022-08-09 18:53 ` David Hildenbrand @ 2022-08-09 19:07 ` Linus Torvalds 2022-08-09 19:20 ` David Hildenbrand 1 sibling, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 19:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > It is because of all this madness with COW. Yes, yes, but we have the proper long-term pinning now with PG_anon_exclusive, and it actually gets the pinning right not just over COW, but even over a fork - which that early write never did. David, I thought all of that got properly merged? Is there something still missing? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 19:07 ` Linus Torvalds @ 2022-08-09 19:20 ` David Hildenbrand 0 siblings, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 19:20 UTC (permalink / raw) To: Linus Torvalds, Jason Gunthorpe Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard On 09.08.22 21:07, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> It is because of all this madness with COW. > > Yes, yes, but we have the proper long-term pinning now with > PG_anon_exclusive, and it actually gets the pinning right not just > over COW, but even over a fork - which that early write never did. > > David, I thought all of that got properly merged? Is there something > still missing? The only thing to get R/O longterm pins in MAP_PRIVATE correct that's missing is that we have to break COW when taking a R/O longterm pin when *not* finding an anon page inside a private mapping. Regarding anon pages I am not aware of issues (due to PG_anon_exclusive). If anybody here wants to stare at a webpage, the following commit explains the rough idea for MAP_PRIVATE: https://github.com/davidhildenbrand/linux/commit/cd7989fb76d2513c86f01e6f7a74415eee5d3150 Once we have that in place, we can mostly get rid of FOLL_FORCE|FOLL_WRITE for R/O longterm pins. There are some corner cases though that need some additional thought which i am still working on. FS-handled COW in MAP_SHARED mappings is just nasty (hello DAX). (the wrong use of FOLL_GET instead of FOLL_PIN for O_DIRECT and friends still persists, but that's a different thing to handle and it's only problematic with concurrent fork() IIRC) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-08 7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand ` (2 preceding siblings ...) 2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds @ 2022-08-09 18:48 ` Linus Torvalds 2022-08-09 19:09 ` David Hildenbrand 2022-08-09 20:00 ` Linus Torvalds 4 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 18:48 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > to fail instead of silently granting write access and bypassing the > userspace fault handler. This, btw, just makes me go "uffd-wp is broken garbage" once more. It also makes me go "if uffd-wp can disallow ptrace writes, then why doesn't regular write protect do it"? IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that absolutely must go away), but I get the strong feelign that we instead should try to get rid of FOLL_FORCE entirely. If some other user action can stop FOLL_FORCE anyway, then why do we support it at all? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 18:48 ` Linus Torvalds @ 2022-08-09 19:09 ` David Hildenbrand 0 siblings, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 19:09 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On 09.08.22 20:48, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> >> For example, a write() via /proc/self/mem to a uffd-wp-protected range has >> to fail instead of silently granting write access and bypassing the >> userspace fault handler. > > This, btw, just makes me go "uffd-wp is broken garbage" once more. > > It also makes me go "if uffd-wp can disallow ptrace writes, then why > doesn't regular write protect do it"? I remember that it's not just uffd-wp, it's also ordinary userfaultfd if we have no page mapped, because we'd have to drop the mmap lock in order to notify user space about the fault and wait for a resolution. IIUC, we cannot tell what exactly user-space will do as a response to a user write fault here (for example, QEMU VM snapshots have to copy page content away such that the VM snapshot remains consistent and we won't corrupt the snapshot), so we have to back off and fail the GUP. I'd say, for ptrace that's even the right thing to do because one might deadlock waiting on the user space thread that handles faults ... but that's a little off-topic to this fix here. I'm just trying to keep the semantics unchanged, as weird as they might be. > > IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that > absolutely must go away), but I get the strong feelign that we instead > should try to get rid of FOLL_FORCE entirely. I can resend v2 soonish, taking care of the VM_BUG_ON as you requested if there are no other comments. > > If some other user action can stop FOLL_FORCE anyway, then why do we > support it at all? My humble opinion is that debugging userfaultfd-managed memory is a corner case and that we can hopefully stop using FOLL_FORCE outside of debugging context soon. Having that said, I do enjoy having the uffd and uffd-wp feature available in user space (especially in QEMU). I don't always enjoy having to handle such corner cases in the kernel. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-08 7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand ` (3 preceding siblings ...) 2022-08-09 18:48 ` Linus Torvalds @ 2022-08-09 20:00 ` Linus Torvalds 2022-08-09 20:06 ` David Hildenbrand 2022-08-09 20:07 ` David Hildenbrand 4 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 20:00 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > So I've read through the patch several times, and it seems fine, but this function (and the pmd version of it) just read oddly to me. > +static inline bool can_follow_write_pte(pte_t pte, struct page *page, > + struct vm_area_struct *vma, > + unsigned int flags) > +{ > + if (pte_write(pte)) > + return true; > + if (!(flags & FOLL_FORCE)) > + return false; > + > + /* > + * See check_vma_flags(): only COW mappings need that special > + * "force" handling when they lack VM_WRITE. > + */ > + if (vma->vm_flags & VM_WRITE) > + return false; > + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); So apart from the VM_BUG_ON(), this code just looks really strange - even despite the comment. Just conceptually, the whole "if it's writable, return that you cannot follow it for a write" just looks so very very strange. That doesn't make the code _wrong_, but considering how many times this has had subtle bugs, let's not write code that looks strange. So I would suggest that to protect against future bugs, we try to make it be fairly clear and straightforward, and maybe even a bit overly protective. For example, let's kill the "shared mapping that you don't have write permissions to" very explicitly and without any subtle code at all. The vm_flags tests are cheap and easy, and we could very easily just add some core ones to make any mistakes much less critical. Now, making that 'is_cow_mapping()' check explicit at the very top of this would already go a long way: /* FOLL_FORCE for writability only affects COW mappings */ if (!is_cow_mapping(vma->vm_flags)) return false; but I'd actually go even further: in this case that "is_cow_mapping()" helper to some degree actually hides what is going on. So I'd actually prefer for that function to be written something like /* If the pte is writable, we can write to the page */ if (pte_write(pte)) return true; /* Maybe FOLL_FORCE is set to override it? */ if (flags & FOLL_FORCE) return false; /* But FOLL_FORCE has no effect on shared mappings */ if (vma->vm_flags & MAP_SHARED) return false; /* .. or read-only private ones */ if (!(vma->vm_flags & MAP_MAYWRITE)) return false; /* .. or already writable ones that just need to take a write fault */ if (vma->vm_flags & MAP_WRITE) return false; and the two first vm_flags tests above are basically doing tat "is_cow_mapping()", and maybe we could even have a comment to that effect, but wouldn't it be nice to just write it out that way? And after you've written it out like the above, now that if (!page || !PageAnon(page) || !PageAnonExclusive(page)) return false; makes you pretty safe from a data sharing perspective: it's most definitely not a shared page at that point. So if you write it that way, the only remaining issues are the magic special soft-dirty and uffd ones, but at that point it's purely about the semantics of those features, no longer about any possible "oh, we fooled some shared page to be writable". And I think the above is fairly legible without any subtle cases, and the one-liner comments make it all fairly clear that it's testing. Is any of this in any _technical_ way different from what your patch did? No. It's literally just rewriting it to be a bit more explicit in what it is doing, I think, and it makes that odd "it's not writable if VM_WRITE is set" case a bit more explicit. Hmm? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:00 ` Linus Torvalds @ 2022-08-09 20:06 ` David Hildenbrand 2022-08-09 20:07 ` David Hildenbrand 1 sibling, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 20:06 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On 09.08.22 22:00, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> > > So I've read through the patch several times, and it seems fine, but > this function (and the pmd version of it) just read oddly to me. > >> +static inline bool can_follow_write_pte(pte_t pte, struct page *page, >> + struct vm_area_struct *vma, >> + unsigned int flags) >> +{ >> + if (pte_write(pte)) >> + return true; >> + if (!(flags & FOLL_FORCE)) >> + return false; >> + >> + /* >> + * See check_vma_flags(): only COW mappings need that special >> + * "force" handling when they lack VM_WRITE. >> + */ >> + if (vma->vm_flags & VM_WRITE) >> + return false; >> + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); > > So apart from the VM_BUG_ON(), this code just looks really strange - > even despite the comment. Just conceptually, the whole "if it's > writable, return that you cannot follow it for a write" just looks so > very very strange. > > That doesn't make the code _wrong_, but considering how many times > this has had subtle bugs, let's not write code that looks strange. > > So I would suggest that to protect against future bugs, we try to make > it be fairly clear and straightforward, and maybe even a bit overly > protective. > > For example, let's kill the "shared mapping that you don't have write > permissions to" very explicitly and without any subtle code at all. > The vm_flags tests are cheap and easy, and we could very easily just > add some core ones to make any mistakes much less critical. > > Now, making that 'is_cow_mapping()' check explicit at the very top of > this would already go a long way: > > /* FOLL_FORCE for writability only affects COW mappings */ > if (!is_cow_mapping(vma->vm_flags)) > return false; I actually put the is_cow_mapping() mapping check in there because check_vma_flags() should make sure that we cannot possibly end up here in that case. But we can spell it out with comments, doesn't hurt. > > but I'd actually go even further: in this case that "is_cow_mapping()" > helper to some degree actually hides what is going on. > > So I'd actually prefer for that function to be written something like > > /* If the pte is writable, we can write to the page */ > if (pte_write(pte)) > return true; > > /* Maybe FOLL_FORCE is set to override it? */ > if (flags & FOLL_FORCE) > return false; > > /* But FOLL_FORCE has no effect on shared mappings */ > if (vma->vm_flags & MAP_SHARED) > return false; > > /* .. or read-only private ones */ > if (!(vma->vm_flags & MAP_MAYWRITE)) > return false; > > /* .. or already writable ones that just need to take a write fault */ > if (vma->vm_flags & MAP_WRITE) > return false; > > and the two first vm_flags tests above are basically doing tat > "is_cow_mapping()", and maybe we could even have a comment to that > effect, but wouldn't it be nice to just write it out that way? > > And after you've written it out like the above, now that > > if (!page || !PageAnon(page) || !PageAnonExclusive(page)) > return false; > > makes you pretty safe from a data sharing perspective: it's most > definitely not a shared page at that point. > > So if you write it that way, the only remaining issues are the magic > special soft-dirty and uffd ones, but at that point it's purely about > the semantics of those features, no longer about any possible "oh, we > fooled some shared page to be writable". > > And I think the above is fairly legible without any subtle cases, and > the one-liner comments make it all fairly clear that it's testing. > > Is any of this in any _technical_ way different from what your patch > did? No. It's literally just rewriting it to be a bit more explicit in > what it is doing, I think, and it makes that odd "it's not writable if > VM_WRITE is set" case a bit more explicit. > > Hmm? No strong opinion. I'm happy as long as it's fixed, and the fix is robust. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:00 ` Linus Torvalds 2022-08-09 20:06 ` David Hildenbrand @ 2022-08-09 20:07 ` David Hildenbrand 2022-08-09 20:14 ` Linus Torvalds 1 sibling, 1 reply; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 20:07 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On 09.08.22 22:00, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> > > So I've read through the patch several times, and it seems fine, but > this function (and the pmd version of it) just read oddly to me. > >> +static inline bool can_follow_write_pte(pte_t pte, struct page *page, >> + struct vm_area_struct *vma, >> + unsigned int flags) >> +{ >> + if (pte_write(pte)) >> + return true; >> + if (!(flags & FOLL_FORCE)) >> + return false; >> + >> + /* >> + * See check_vma_flags(): only COW mappings need that special >> + * "force" handling when they lack VM_WRITE. >> + */ >> + if (vma->vm_flags & VM_WRITE) >> + return false; >> + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); > > So apart from the VM_BUG_ON(), this code just looks really strange - > even despite the comment. Just conceptually, the whole "if it's > writable, return that you cannot follow it for a write" just looks so > very very strange. > > That doesn't make the code _wrong_, but considering how many times > this has had subtle bugs, let's not write code that looks strange. > > So I would suggest that to protect against future bugs, we try to make > it be fairly clear and straightforward, and maybe even a bit overly > protective. > > For example, let's kill the "shared mapping that you don't have write > permissions to" very explicitly and without any subtle code at all. > The vm_flags tests are cheap and easy, and we could very easily just > add some core ones to make any mistakes much less critical. > > Now, making that 'is_cow_mapping()' check explicit at the very top of > this would already go a long way: > > /* FOLL_FORCE for writability only affects COW mappings */ > if (!is_cow_mapping(vma->vm_flags)) > return false; > > but I'd actually go even further: in this case that "is_cow_mapping()" > helper to some degree actually hides what is going on. > > So I'd actually prefer for that function to be written something like > > /* If the pte is writable, we can write to the page */ > if (pte_write(pte)) > return true; > > /* Maybe FOLL_FORCE is set to override it? */ > if (flags & FOLL_FORCE) > return false; > > /* But FOLL_FORCE has no effect on shared mappings */ > if (vma->vm_flags & MAP_SHARED) > return false; I'd actually rather check for MAP_MAYSHARE here, which is even stronger. Thoughts? -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:07 ` David Hildenbrand @ 2022-08-09 20:14 ` Linus Torvalds 2022-08-09 20:20 ` David Hildenbrand 2022-08-09 20:20 ` Linus Torvalds 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 20:14 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote: > > > /* But FOLL_FORCE has no effect on shared mappings */ > > if (vma->vm_flags & MAP_SHARED) > > return false; > > I'd actually rather check for MAP_MAYSHARE here, which is even stronger. > Thoughts? Hmm. Adding the test for both is basically free (all those vm_flags checks end up being a bit mask and compare), so no objections. For some reason I though VM_SHARED and VM_MAYSHARE end up always being the same bits, and it was a mistake to make them two bits in the first place (unlike the read/write/exec bits that can are about mprotect), But as there are two bits, I'm sure somebody ends up touching one and not the other. So yeah, I don't see any downside to just checking both bits. [ That said, is_cow_mapping() only checks VM_SHARED, so if they are ever different, that's a potential source of confusion ] Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:14 ` Linus Torvalds @ 2022-08-09 20:20 ` David Hildenbrand 2022-08-09 20:30 ` Linus Torvalds 2022-08-09 20:20 ` Linus Torvalds 1 sibling, 1 reply; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 20:20 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On 09.08.22 22:14, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote: >> >>> /* But FOLL_FORCE has no effect on shared mappings */ >>> if (vma->vm_flags & MAP_SHARED) >>> return false; >> >> I'd actually rather check for MAP_MAYSHARE here, which is even stronger. >> Thoughts? > > Hmm. Adding the test for both is basically free (all those vm_flags > checks end up being a bit mask and compare), so no objections. > > For some reason I though VM_SHARED and VM_MAYSHARE end up always being > the same bits, and it was a mistake to make them two bits in the first > place (unlike the read/write/exec bits that can are about mprotect), > > But as there are two bits, I'm sure somebody ends up touching one and > not the other. > > So yeah, I don't see any downside to just checking both bits. > > [ That said, is_cow_mapping() only checks VM_SHARED, so if they are > ever different, that's a potential source of confusion ] IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file mappings we only set VM_SHARED if the file allows for writes (and we can set VM_MAYWRITE or even VM_WRITE). [don't ask me why, it's a steady source of confusion] That's why is_cow_mapping() works even due to the weird MAP_SHARED vs. VM_MAYSHARE logic. I'd actually only check for VM_MAYSHARE here, which implies MAP_SHARED. If someone would ever mess that up, I guess we'd be in bigger trouble. But whatever you prefer. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:20 ` David Hildenbrand @ 2022-08-09 20:30 ` Linus Torvalds 2022-08-09 20:38 ` Linus Torvalds 2022-08-09 20:42 ` David Hildenbrand 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 20:30 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote: > > IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file > mappings we only set VM_SHARED if the file allows for writes Heh. This is a horrific hack, and probably should go away. Yeah, we have that if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); but I think that's _entirely_ historical. Long long ago, in a galaxy far away, we didn't handle shared mmap() very well. In fact, we used to not handle it at all. But nntpd would use write() to update the spool file, adn them read it through a shared mmap. And since our mmap() *was* coherent with people doing write() system calls, but didn't handle actual dirty shared mmap, what Linux used to do was to just say "Oh, you want a read-only shared file mmap? I can do that - I'll just downgrade it to a read-only _private_ mapping, and it actually ends up with the same semantics". And here we are, 30 years later, and it still does that, but it leaves the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a shared mapping. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:30 ` Linus Torvalds @ 2022-08-09 20:38 ` Linus Torvalds 2022-08-09 20:42 ` David Hildenbrand 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 20:38 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Tue, Aug 9, 2022 at 1:30 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And here we are, 30 years later, and it still does that, but it leaves > the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a > shared mapping. .. thinking about it, we end up still having some things that this helps. For example, because we clear the VM_SHARED flags for read-only shared mappings, they don't end up going through mapping_{un}map_writable(), and don't update i_mmap_writable, and don't cause issues with mapping_deny_writable() or mapping_writably_mapped(). So it ends up actually having random small semantic details due to those almost three decades of history. I'm sure there are other odd pieces like that. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:30 ` Linus Torvalds 2022-08-09 20:38 ` Linus Torvalds @ 2022-08-09 20:42 ` David Hildenbrand 1 sibling, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 20:42 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On 09.08.22 22:30, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote: >> >> IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file >> mappings we only set VM_SHARED if the file allows for writes > > Heh. > > This is a horrific hack, and probably should go away. > > Yeah, we have that > > if (!(file->f_mode & FMODE_WRITE)) > vm_flags &= ~(VM_MAYWRITE | VM_SHARED); > > > but I think that's _entirely_ historical. > > Long long ago, in a galaxy far away, we didn't handle shared mmap() > very well. In fact, we used to not handle it at all. > > But nntpd would use write() to update the spool file, adn them read it > through a shared mmap. > > And since our mmap() *was* coherent with people doing write() system > calls, but didn't handle actual dirty shared mmap, what Linux used to > do was to just say "Oh, you want a read-only shared file mmap? I can > do that - I'll just downgrade it to a read-only _private_ mapping, and > it actually ends up with the same semantics". > > And here we are, 30 years later, and it still does that, but it leaves > the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a > shared mapping. I was suspecting that this code is full of legacy :) What would make sense to me is to just have VM_SHARED and make it correspond to MAP_SHARED, that would at least confuse me less. Once I have some spare cycles I'll see how easy that might be to achieve. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:14 ` Linus Torvalds 2022-08-09 20:20 ` David Hildenbrand @ 2022-08-09 20:20 ` Linus Torvalds 2022-08-09 20:23 ` David Hildenbrand 1 sibling, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-08-09 20:20 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But as there are two bits, I'm sure somebody ends up touching one and > not the other. Ugh. The nommu code certainly does odd things here. That just looks wrong. And the hugetlb code does something horrible too, but never *sets* it, so it looks like some odd subset of VM_SHARED. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-09 20:20 ` Linus Torvalds @ 2022-08-09 20:23 ` David Hildenbrand 0 siblings, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2022-08-09 20:23 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, linux-mm, stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard, Jason Gunthorpe On 09.08.22 22:20, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> But as there are two bits, I'm sure somebody ends up touching one and >> not the other. > > Ugh. The nommu code certainly does odd things here. That just looks wrong. > > And the hugetlb code does something horrible too, but never *sets* it, > so it looks like some odd subset of VM_SHARED. mm/mmap.c:do_mmap() contains the magic bit switch (flags & MAP_TYPE) { case MAP_SHARED: ... case MAP_SHARED_VALIDATE: ... vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); fallthrough; VM_SHARED semantics are confusing. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2022-08-21 11:29 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-08-08 7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand 2022-08-08 16:02 ` David Hildenbrand 2022-08-09 18:27 ` Linus Torvalds 2022-08-09 18:45 ` David Hildenbrand 2022-08-09 18:59 ` Linus Torvalds 2022-08-09 19:07 ` Jason Gunthorpe 2022-08-09 19:21 ` Linus Torvalds 2022-08-09 21:16 ` David Laight 2022-08-11 7:13 ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar 2022-08-11 20:43 ` Linus Torvalds 2022-08-11 21:28 ` Matthew Wilcox 2022-08-11 23:22 ` Jason Gunthorpe 2022-08-14 1:10 ` John Hubbard 2022-08-12 9:29 ` [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() Ingo Molnar [not found] ` <20220815144143.zjsiamw5y22bvgki@suse.de> 2022-08-15 22:12 ` John Hubbard 2022-08-21 11:28 ` Ingo Molnar 2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds 2022-08-09 18:48 ` Jason Gunthorpe 2022-08-09 18:53 ` David Hildenbrand 2022-08-09 19:07 ` Linus Torvalds 2022-08-09 19:20 ` David Hildenbrand 2022-08-09 18:48 ` Linus Torvalds 2022-08-09 19:09 ` David Hildenbrand 2022-08-09 20:00 ` Linus Torvalds 2022-08-09 20:06 ` David Hildenbrand 2022-08-09 20:07 ` David Hildenbrand 2022-08-09 20:14 ` Linus Torvalds 2022-08-09 20:20 ` David Hildenbrand 2022-08-09 20:30 ` Linus Torvalds 2022-08-09 20:38 ` Linus Torvalds 2022-08-09 20:42 ` David Hildenbrand 2022-08-09 20:20 ` Linus Torvalds 2022-08-09 20:23 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).