linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
@ 2022-08-08  7:32 David Hildenbrand
  2022-08-08 16:02 ` David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-08  7:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, stable, Linus Torvalds,
	Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu,
	Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka,
	John Hubbard, Jason Gunthorpe

Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
that FOLL_FORCE can be possibly dangerous, especially if there are races
that can be exploited by user space.

Right now, it would be sufficient to have some code that sets a PTE of
a R/O-mapped shared page dirty, in order for it to erroneously become
writable by FOLL_FORCE. The implications of setting a write-protected PTE
dirty might not be immediately obvious to everyone.

And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
a shmem page R/O while marking the pte dirty. This can be used by
unprivileged user space to modify tmpfs/shmem file content even if the user
does not have write permissions to the file -- Dirty COW restricted to
tmpfs/shmem (CVE-2022-2590).

To fix such security issues for good, the insight is that we really only
need that fancy retry logic (FOLL_COW) for COW mappings that are not
writable (!VM_WRITE). And in a COW mapping, we really only broke COW if
we have an exclusive anonymous page mapped. If we have something else
mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
we have to trigger a write fault to break COW. If we don't find an
exclusive anonymous page when we retry, we have to trigger COW breaking
once again because something intervened.

Let's move away from this mandatory-retry + dirty handling and rely on
our PageAnonExclusive() flag for making a similar decision, to use the
same COW logic as in other kernel parts here as well. In case we stumble
over a PTE in a COW mapping that does not map an exclusive anonymous page,
COW was not properly broken and we have to trigger a fake write-fault to
break COW.

Just like we do in can_change_pte_writable() added via
commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive
anonymous pages when changing protection") and commit 76aefad628aa
("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take
care of softdirty and uffd-wp manually.

For example, a write() via /proc/self/mem to a uffd-wp-protected range has
to fail instead of silently granting write access and bypassing the
userspace fault handler. Note that FOLL_FORCE is not only used for debug
access, but also triggered by applications without debug intentions, for
example, when pinning pages via RDMA.

This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.

Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
let's just get rid of it.

Note 1: We don't check for the PTE being dirty because it doesn't matter
	for making a "was COWed" decision anymore, and whoever modifies the
	page has to set the page dirty either way.

Note 2: Kernels before extended uffd-wp support and before
	PageAnonExclusive (< 5.19) can simply revert the problematic
	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
	v5.19 requires minor adjustments due to lack of
	vma_soft_dirty_enabled().

Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
Cc: <stable@vger.kernel.org> # 5.16+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

Against upstream from yesterday instead of v5.19 because I wanted to
reference the mprotect commit IDs and can_change_pte_writable(), and I
wanted to directly use vma_soft_dirty_enabled().

I have a working reproducer that I'll post to oss-security in one week. Of
course, that reproducer no longer triggers with that commit and my ptrace
testing indicated that FOLL_FORCE seems to continue working as expected.

---
 include/linux/mm.h |  1 -
 mm/gup.c           | 62 +++++++++++++++++++++++++++++-----------------
 mm/huge_memory.c   | 45 +++++++++++++++++----------------
 3 files changed, 63 insertions(+), 45 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18e01474cf6b..2222ed598112 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2885,7 +2885,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
-#define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
diff --git a/mm/gup.c b/mm/gup.c
index 732825157430..7a0b207f566f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -478,14 +478,34 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 	return -EEXIST;
 }
 
-/*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
-{
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */
+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
+{
+	if (pte_write(pte))
+		return true;
+	if (!(flags & FOLL_FORCE))
+		return false;
+
+	/*
+	 * See check_vma_flags(): only COW mappings need that special
+	 * "force" handling when they lack VM_WRITE.
+	 */
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+	VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
+
+	/*
+	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * writable if we have an exclusive anonymous page and a write-fault
+	 * isn't require for other reasons.
+	 */
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+	if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
+		return false;
+	return !userfaultfd_pte_wp(vma, pte);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -528,12 +548,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
-		pte_unmap_unlock(ptep, ptl);
-		return NULL;
-	}
 
 	page = vm_normal_page(vma, address, pte);
+
+	/*
+	 * We only care about anon pages in can_follow_write_pte() and don't
+	 * have to worry about pte_devmap() because they are never anon.
+	 */
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pte(pte, page, vma, flags)) {
+		page = NULL;
+		goto out;
+	}
+
 	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
 		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -986,17 +1013,6 @@ static int faultin_page(struct vm_area_struct *vma,
 		return -EBUSY;
 	}
 
-	/*
-	 * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
-	 * necessary, even if maybe_mkwrite decided not to set pte_write. We
-	 * can thus safely do subsequent page lookups as if they were reads.
-	 * But only do so when looping for pte_write is futile: in some cases
-	 * userspace may also be wanting to write to the gotten user page,
-	 * which a read fault here might prevent (a readonly page might get
-	 * reCOWed by userspace write).
-	 */
-	if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
-		*flags |= FOLL_COW;
 	return 0;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..352b5220e95e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1040,12 +1040,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	/*
-	 * When we COW a devmap PMD entry, we split it into PTEs, so we should
-	 * not be in this function with `flags & FOLL_COW` set.
-	 */
-	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
-
 	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
 	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
 			 (FOLL_PIN | FOLL_GET)))
@@ -1395,14 +1389,23 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	return VM_FAULT_FALLBACK;
 }
 
-/*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+/* See can_follow_write_pte() on FOLL_FORCE details. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	if (pmd_write(pmd))
+		return true;
+	if (!(flags & FOLL_FORCE))
+		return false;
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+	VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+	return !userfaultfd_huge_pmd_wp(vma, pmd);
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1411,12 +1414,16 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct page *page = NULL;
+	struct page *page;
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
-		goto out;
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pmd(*pmd, page, vma, flags))
+		return NULL;
 
 	/* Avoid dumping huge zero page */
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
@@ -1424,10 +1431,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	/* Full NUMA hinting faults to serialise migration in fault paths */
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
-		goto out;
-
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+		return NULL;
 
 	if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
 		return ERR_PTR(-EMLINK);
@@ -1444,7 +1448,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 
-out:
 	return page;
 }
 

base-commit: 1612c382ffbdf1f673caec76502b1c00e6d35363
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-08  7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand
@ 2022-08-08 16:02 ` David Hildenbrand
  2022-08-09 18:27 ` Linus Torvalds
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-08 16:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, stable, Linus Torvalds, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe, Nadav Amit

On 08.08.22 09:32, David Hildenbrand wrote:
> Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
> that FOLL_FORCE can be possibly dangerous, especially if there are races
> that can be exploited by user space.
> 
> Right now, it would be sufficient to have some code that sets a PTE of
> a R/O-mapped shared page dirty, in order for it to erroneously become
> writable by FOLL_FORCE. The implications of setting a write-protected PTE
> dirty might not be immediately obvious to everyone.
> 
> And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
> pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
> a shmem page R/O while marking the pte dirty. This can be used by
> unprivileged user space to modify tmpfs/shmem file content even if the user
> does not have write permissions to the file -- Dirty COW restricted to
> tmpfs/shmem (CVE-2022-2590).
> 
> To fix such security issues for good, the insight is that we really only
> need that fancy retry logic (FOLL_COW) for COW mappings that are not
> writable (!VM_WRITE). And in a COW mapping, we really only broke COW if
> we have an exclusive anonymous page mapped. If we have something else
> mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
> we have to trigger a write fault to break COW. If we don't find an
> exclusive anonymous page when we retry, we have to trigger COW breaking
> once again because something intervened.
> 
> Let's move away from this mandatory-retry + dirty handling and rely on
> our PageAnonExclusive() flag for making a similar decision, to use the
> same COW logic as in other kernel parts here as well. In case we stumble
> over a PTE in a COW mapping that does not map an exclusive anonymous page,
> COW was not properly broken and we have to trigger a fake write-fault to
> break COW.
> 
> Just like we do in can_change_pte_writable() added via
> commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive
> anonymous pages when changing protection") and commit 76aefad628aa
> ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take
> care of softdirty and uffd-wp manually.
> 
> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> to fail instead of silently granting write access and bypassing the
> userspace fault handler. Note that FOLL_FORCE is not only used for debug
> access, but also triggered by applications without debug intentions, for
> example, when pinning pages via RDMA.
> 
> This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
> affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
> 
> Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
> let's just get rid of it.

I have to add here:

"Thanks to Nadav Amit for pointing out that the pte_dirty() check in
FOLL_FORCE code is problematic and might be exploitable."

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-08  7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand
  2022-08-08 16:02 ` David Hildenbrand
@ 2022-08-09 18:27 ` Linus Torvalds
  2022-08-09 18:45   ` David Hildenbrand
  2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 18:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

I'm still reading through this, but

 STOP DOING THIS

On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));

Using BUG_ON() for debugging is simply not ok.

And saying "but it's just a VM_BUG_ON()" does not change *anything*.
At least Fedora enables that unconditionally for normal people, it is
not some kind of "only VM people do this".

Really. BUG_ON() IS NOT FOR DEBUGGING.

Stop it. Now.

If you have a condition that must not happen, you either write that
condition into the code, or - if you are convinced it cannot happen -
you make it a WARN_ON_ONCE() so that people can report it to you.

The BUG_ON() will just make the machine die.

And for the facebooks and googles of the world, the WARN_ON() will be
sufficient.

               Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-08  7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand
  2022-08-08 16:02 ` David Hildenbrand
  2022-08-09 18:27 ` Linus Torvalds
@ 2022-08-09 18:40 ` Linus Torvalds
  2022-08-09 18:48   ` Jason Gunthorpe
  2022-08-09 18:48 ` Linus Torvalds
  2022-08-09 20:00 ` Linus Torvalds
  4 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 18:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> to fail instead of silently granting write access and bypassing the
> userspace fault handler. Note that FOLL_FORCE is not only used for debug
> access, but also triggered by applications without debug intentions, for
> example, when pinning pages via RDMA.

So this made me go "Whaa?"

I didn't even realize that the media drivers and rdma used FOLL_FORCE.

That's just completely bogus.

Why do they do that?

It seems to be completely bogus, and seems to have no actual valid
reason for it. Looking through the history, it goes back to the
original code submission in 2006, and doesn't have a mention of why.

I think the original reason was that the code didn't have pinning, so
it used "do a write" as a pin mechanism - even for reads.

IOW, I think the non-ptrace use of FOLL_FORCE should just be removed.

                 Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 18:27 ` Linus Torvalds
@ 2022-08-09 18:45   ` David Hildenbrand
  2022-08-09 18:59     ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On 09.08.22 20:27, Linus Torvalds wrote:
> I'm still reading through this, but
> 
>  STOP DOING THIS
> 
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> 
> Using BUG_ON() for debugging is simply not ok.
> 
> And saying "but it's just a VM_BUG_ON()" does not change *anything*.
> At least Fedora enables that unconditionally for normal people, it is
> not some kind of "only VM people do this".
> 
> Really. BUG_ON() IS NOT FOR DEBUGGING.

I totally agree with BUG_ON ... but if I get talked to in all-caps on a
Thursday evening and feel like I just touched the forbidden fruit, I
have to ask for details about VM_BUG_ON nowadays.

VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some
kind of debugging at least to me. I *know* that Fedora enables it and I
*know* that this will make Fedora crash.

I know why Fedora enables this debug option, but it somewhat destorys
the whole purpose of VM_BUG_ON kind of nowadays?

For this case, this condition will never trigger and I consider it much
more a hint to the reader that we can rest assured that this condition
holds. And on production systems, it will get optimized out.

Should we forbid any new usage of VM_BUG_ON just like we mostly do with
BUG_ON?

> 
> Stop it. Now.
> 
> If you have a condition that must not happen, you either write that
> condition into the code, or - if you are convinced it cannot happen -
> you make it a WARN_ON_ONCE() so that people can report it to you.

I can just turn that into a WARN_ON_ONCE() or even a VM_WARN_ON_ONCE().

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-08  7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand
                   ` (2 preceding siblings ...)
  2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds
@ 2022-08-09 18:48 ` Linus Torvalds
  2022-08-09 19:09   ` David Hildenbrand
  2022-08-09 20:00 ` Linus Torvalds
  4 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 18:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> to fail instead of silently granting write access and bypassing the
> userspace fault handler.

This, btw, just makes me go "uffd-wp is broken garbage" once more.

It also makes me go "if uffd-wp can disallow ptrace writes, then why
doesn't regular write protect do it"?

IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that
absolutely must go away), but I get the strong feelign that we instead
should try to get rid of FOLL_FORCE entirely.

If some other user action can stop FOLL_FORCE anyway, then why do we
support it at all?

             Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds
@ 2022-08-09 18:48   ` Jason Gunthorpe
  2022-08-09 18:53     ` David Hildenbrand
  2022-08-09 19:07     ` Linus Torvalds
  0 siblings, 2 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2022-08-09 18:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard

On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> > to fail instead of silently granting write access and bypassing the
> > userspace fault handler. Note that FOLL_FORCE is not only used for debug
> > access, but also triggered by applications without debug intentions, for
> > example, when pinning pages via RDMA.
> 
> So this made me go "Whaa?"
> 
> I didn't even realize that the media drivers and rdma used FOLL_FORCE.
> 
> That's just completely bogus.
> 
> Why do they do that?
> 
> It seems to be completely bogus, and seems to have no actual valid
> reason for it. Looking through the history, it goes back to the
> original code submission in 2006, and doesn't have a mention of why.

It is because of all this madness with COW.

Lets say an app does:

 buf = mmap(MAP_PRIVATE)
 rdma_pin_for_read(buf)
 buf[0] = 1 

Then the store to buf[0] will COW the page and the pin will become
decoherent.

The purpose of the FORCE is to force COW to happen early so this is
avoided.

Sadly there are real apps that end up working this way, eg because
they are using buffer in .data or something.

I've hoped David's new work on this provided some kind of path to a
proper solution, as the need is very similar to all the other places
where we need to ensure there is no possiblity of future COW.

So, these usage can be interpreted as a FOLL flag we don't have - some
kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive
sort of idea.

Jason


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 18:48   ` Jason Gunthorpe
@ 2022-08-09 18:53     ` David Hildenbrand
  2022-08-09 19:07     ` Linus Torvalds
  1 sibling, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 18:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard

On 09.08.22 20:48, Jason Gunthorpe wrote:
> On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote:
>> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
>>> to fail instead of silently granting write access and bypassing the
>>> userspace fault handler. Note that FOLL_FORCE is not only used for debug
>>> access, but also triggered by applications without debug intentions, for
>>> example, when pinning pages via RDMA.
>>
>> So this made me go "Whaa?"
>>
>> I didn't even realize that the media drivers and rdma used FOLL_FORCE.
>>
>> That's just completely bogus.
>>
>> Why do they do that?
>>
>> It seems to be completely bogus, and seems to have no actual valid
>> reason for it. Looking through the history, it goes back to the
>> original code submission in 2006, and doesn't have a mention of why.
> 
> It is because of all this madness with COW.
> 
> Lets say an app does:
> 
>  buf = mmap(MAP_PRIVATE)
>  rdma_pin_for_read(buf)
>  buf[0] = 1 
> 
> Then the store to buf[0] will COW the page and the pin will become
> decoherent.
> 
> The purpose of the FORCE is to force COW to happen early so this is
> avoided.
> 
> Sadly there are real apps that end up working this way, eg because
> they are using buffer in .data or something.
> 
> I've hoped David's new work on this provided some kind of path to a
> proper solution, as the need is very similar to all the other places
> where we need to ensure there is no possiblity of future COW.
> 
> So, these usage can be interpreted as a FOLL flag we don't have - some
> kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive
> sort of idea.

Thanks Jason for the explanation.

I have patches in the works to no longer use FOLL_FORCE|FOLL_WRITE for
taking a reliable longerterm R/O pin in a MAP_PRIVATE mapping. The
patches are mostly done (and comparably simple), I merely deferred
sending them out because I stumbled over this issue first.

Some ugly corner cases of MAP_SHARED remain, but for most prominent use
cases, my upcoming patches should allow us to just have longterm R/O
pins working as expected.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 18:45   ` David Hildenbrand
@ 2022-08-09 18:59     ` Linus Torvalds
  2022-08-09 19:07       ` Jason Gunthorpe
  2022-08-11  7:13       ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 18:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Tue, Aug 9, 2022 at 11:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> I totally agree with BUG_ON ... but if I get talked to in all-caps on a
> Thursday evening and feel like I just touched the forbidden fruit, I
> have to ask for details about VM_BUG_ON nowadays.
>
> VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some
> kind of debugging at least to me. I *know* that Fedora enables it and I
> *know* that this will make Fedora crash.

No.

VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no
different, the only difference is "we can make the code smaller
because these are less important".

The only possible case where BUG_ON can validly be used is "I have
some fundamental data corruption and cannot possibly return an error".

This kind of "I don't think this can happen" is _never_ an excuse for it.

Honestly, 99% of our existing BUG_ON() ones are completely bogus, and
left-over debug code that wasn't removed because they never triggered.
I've several times considered just using a coccinelle script to remove
every single BUG_ON() (and VM_BUG_ON()) as simply bogus. Because they
are pure noise.

I just tried to find a valid BUG_ON() that would make me go "yeah,
that's actually worth it", and couldn't really find one. Yeah, there
are several ones in the scheduler that make me go "ok, if that
triggers, the machine is dead anyway", so in that sense there are
certainly BUG_ON()s that don't _hurt_.

But as a very good approximation, the rule is "absolutely no new
BUG_ON() calls _ever_". Because I really cannot see a single case
where "proper error handling and WARN_ON_ONCE()" isn't the right
thing.

Now, that said, there is one very valid sub-form of BUG_ON():
BUILD_BUG_ON() is absolutely 100% fine.

                Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 18:48   ` Jason Gunthorpe
  2022-08-09 18:53     ` David Hildenbrand
@ 2022-08-09 19:07     ` Linus Torvalds
  2022-08-09 19:20       ` David Hildenbrand
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 19:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard

On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> It is because of all this madness with COW.

Yes, yes, but we have the proper long-term pinning now with
PG_anon_exclusive, and it actually gets the pinning right not just
over COW, but even over a fork - which that early write never did.

David, I thought all of that got properly merged? Is there something
still missing?

                     Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 18:59     ` Linus Torvalds
@ 2022-08-09 19:07       ` Jason Gunthorpe
  2022-08-09 19:21         ` Linus Torvalds
  2022-08-09 21:16         ` David Laight
  2022-08-11  7:13       ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar
  1 sibling, 2 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2022-08-09 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard

On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote:

> But as a very good approximation, the rule is "absolutely no new
> BUG_ON() calls _ever_". Because I really cannot see a single case
> where "proper error handling and WARN_ON_ONCE()" isn't the right
> thing.

Parallel to this discussion I've had ones where people more or less
say

 Since BUG_ON crashes the machine and Linus says that crashing the
 machine is bad, WARN_ON will also crash the machine if you set the
 panic_on_warn parameter, so it is also bad, thus we shouldn't use
 anything.

I've generally maintained that people who set the panic_on_warn *want*
these crashes, because that is the entire point of it. So we should
use WARN_ON with an error recovery for "can't happen" assertions like
these. I think it is what you are saying here.

Jason


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 18:48 ` Linus Torvalds
@ 2022-08-09 19:09   ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On 09.08.22 20:48, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
>> to fail instead of silently granting write access and bypassing the
>> userspace fault handler.
> 
> This, btw, just makes me go "uffd-wp is broken garbage" once more.
> 
> It also makes me go "if uffd-wp can disallow ptrace writes, then why
> doesn't regular write protect do it"?

I remember that it's not just uffd-wp, it's also ordinary userfaultfd if
we have no page mapped, because we'd have to drop the mmap lock in order
to notify user space about the fault and wait for a resolution.

IIUC, we cannot tell what exactly user-space will do as a response to a
user write fault here (for example, QEMU VM snapshots have to copy page
content away such that the VM snapshot remains consistent and we won't
corrupt the snapshot), so we have to back off and fail the GUP. I'd say,
for ptrace that's even the right thing to do because one might deadlock
waiting on the user space thread that handles faults ... but that's a
little off-topic to this fix here. I'm just trying to keep the semantics
unchanged, as weird as they might be.


> 
> IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that
> absolutely must go away), but I get the strong feelign that we instead
> should try to get rid of FOLL_FORCE entirely.

I can resend v2 soonish, taking care of the VM_BUG_ON as you requested
if there are no other comments.

> 
> If some other user action can stop FOLL_FORCE anyway, then why do we
> support it at all?

My humble opinion is that debugging userfaultfd-managed memory is a
corner case and that we can hopefully stop using FOLL_FORCE outside of
debugging context soon.

Having that said, I do enjoy having the uffd and uffd-wp feature
available in user space (especially in QEMU). I don't always enjoy
having to handle such corner cases in the kernel.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 19:07     ` Linus Torvalds
@ 2022-08-09 19:20       ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 19:20 UTC (permalink / raw)
  To: Linus Torvalds, Jason Gunthorpe
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard

On 09.08.22 21:07, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>> It is because of all this madness with COW.
> 
> Yes, yes, but we have the proper long-term pinning now with
> PG_anon_exclusive, and it actually gets the pinning right not just
> over COW, but even over a fork - which that early write never did.
> 
> David, I thought all of that got properly merged? Is there something
> still missing?

The only thing to get R/O longterm pins in MAP_PRIVATE correct that's
missing is that we have to break COW when taking a R/O longterm pin when
*not* finding an anon page inside a private mapping. Regarding anon
pages I am not aware of issues (due to PG_anon_exclusive).

If anybody here wants to stare at a webpage, the following commit
explains the rough idea for MAP_PRIVATE:

https://github.com/davidhildenbrand/linux/commit/cd7989fb76d2513c86f01e6f7a74415eee5d3150

Once we have that in place, we can mostly get rid of
FOLL_FORCE|FOLL_WRITE for R/O longterm pins. There are some corner cases
though that need some additional thought which i am still working on.
FS-handled COW in MAP_SHARED mappings is just nasty (hello DAX).

(the wrong use of FOLL_GET instead of FOLL_PIN for O_DIRECT and friends
still persists, but that's a different thing to handle and it's only
problematic with concurrent fork() IIRC)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 19:07       ` Jason Gunthorpe
@ 2022-08-09 19:21         ` Linus Torvalds
  2022-08-09 21:16         ` David Laight
  1 sibling, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 19:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard

On Tue, Aug 9, 2022 at 12:09 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>  Since BUG_ON crashes the machine and Linus says that crashing the
>  machine is bad, WARN_ON will also crash the machine if you set the
>  panic_on_warn parameter, so it is also bad, thus we shouldn't use
>  anything.

If you set 'panic_on_warn' you get to keep both pieces when something breaks.

The thing is, there are people who *do* want to stop immediately when
something goes wrong in the kernel.

Anybody doing large-scale virtualization presumably has all the
infrastructure to get debug info out of the virtual environment.

And people who run controlled loads in big server machine setups and
have a MIS department to manage said machines typically also prefer
for a machine to just crash over continuing.

So in those situations, a dead machine is still a dead machine, but
you get the information out, and panic_on_warn is fine, because panic
and reboot is fine.

And yes, that's actually a fairly common case. Things like syzkaller
etc *wants* to abort on the first warning, because that's kind of the
point.

But while that kind of virtualized automation machinery is very very
common, and is a big deal, it's by no means the only deal, and the
most important thing to the point where nothing else matters.

And if you are *not* in a farm, and if you are *not* using
virtualization, a dead machine is literally a useless brick. Nobody
has serial lines on individual machines any more. In most cases, the
hardware literally doesn't even exist any more.

So in that situation, you really cannot afford to take the approach of
"just kill the machine". If you are on a laptop and are doing power
management code, you generally cannot do that in a virtual
environment, and you already have enough problems with suspend and
resume being hard to debug, without people also going "oh, let's just
BUG_ON() and kill the machine".

Because the other side of that "we have a lot of machine farms doing
automated testing" is that those machine farms do not generally find a
lot of the exciting cases.

Almost every single merge window, I end up having to bisect and report
an oops or a WARN_ON(), because I actually run on real hardware. And
said problem was never seen in linux-next.

So we have two very different cases: the "virtual machine with good
logging where a dead machine is fine" - use 'panic_on_warn'. And the
actual real hardware with real drivers, running real loads by users.

Both are valid. But the second case means that BUG_ON() is basically
_never_ valid.

              Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-08  7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand
                   ` (3 preceding siblings ...)
  2022-08-09 18:48 ` Linus Torvalds
@ 2022-08-09 20:00 ` Linus Torvalds
  2022-08-09 20:06   ` David Hildenbrand
  2022-08-09 20:07   ` David Hildenbrand
  4 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 20:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>

So I've read through the patch several times, and it seems fine, but
this function (and the pmd version of it) just read oddly to me.

> +static inline bool can_follow_write_pte(pte_t pte, struct page *page,
> +                                       struct vm_area_struct *vma,
> +                                       unsigned int flags)
> +{
> +       if (pte_write(pte))
> +               return true;
> +       if (!(flags & FOLL_FORCE))
> +               return false;
> +
> +       /*
> +        * See check_vma_flags(): only COW mappings need that special
> +        * "force" handling when they lack VM_WRITE.
> +        */
> +       if (vma->vm_flags & VM_WRITE)
> +               return false;
> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));

So apart from the VM_BUG_ON(), this code just looks really strange -
even despite the comment. Just conceptually, the whole "if it's
writable, return that you cannot follow it for a write" just looks so
very very strange.

That doesn't make the code _wrong_, but considering how many times
this has had subtle bugs, let's not write code that looks strange.

So I would suggest that to protect against future bugs, we try to make
it be fairly clear and straightforward, and maybe even a bit overly
protective.

For example, let's kill the "shared mapping that you don't have write
permissions to" very explicitly and without any subtle code at all.
The vm_flags tests are cheap and easy, and we could very easily just
add some core ones to make any mistakes much less critical.

Now, making that 'is_cow_mapping()' check explicit at the very top of
this would already go a long way:

        /* FOLL_FORCE for writability only affects COW mappings */
        if (!is_cow_mapping(vma->vm_flags))
                return false;

but I'd actually go even further: in this case that "is_cow_mapping()"
helper to some degree actually hides what is going on.

So I'd actually prefer for that function to be written something like

        /* If the pte is writable, we can write to the page */
        if (pte_write(pte))
                return true;

        /* Maybe FOLL_FORCE is set to override it? */
        if (flags & FOLL_FORCE)
                return false;

        /* But FOLL_FORCE has no effect on shared mappings */
        if (vma->vm_flags & MAP_SHARED)
                return false;

        /* .. or read-only private ones */
        if (!(vma->vm_flags & MAP_MAYWRITE))
                return false;

        /* .. or already writable ones that just need to take a write fault */
        if (vma->vm_flags & MAP_WRITE)
                return false;

and the two first vm_flags tests above are basically doing tat
"is_cow_mapping()", and maybe we could even have a comment to that
effect, but wouldn't it be nice to just write it out that way?

And after you've written it out like the above, now that

        if (!page || !PageAnon(page) || !PageAnonExclusive(page))
                return false;

makes you pretty safe from a data sharing perspective: it's most
definitely not a shared page at that point.

So if you write it that way, the only remaining issues are the magic
special soft-dirty and uffd ones, but at that point it's purely about
the semantics of those features, no longer about any possible "oh, we
fooled some shared page to be writable".

And I think the above is fairly legible without any subtle cases, and
the one-liner comments make it all fairly clear that it's testing.

Is any of this in any _technical_ way different from what your patch
did? No. It's literally just rewriting it to be a bit more explicit in
what it is doing, I think, and it makes that odd "it's not writable if
VM_WRITE is set" case a bit more explicit.

Hmm?

                  Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:00 ` Linus Torvalds
@ 2022-08-09 20:06   ` David Hildenbrand
  2022-08-09 20:07   ` David Hildenbrand
  1 sibling, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On 09.08.22 22:00, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
> 
> So I've read through the patch several times, and it seems fine, but
> this function (and the pmd version of it) just read oddly to me.
> 
>> +static inline bool can_follow_write_pte(pte_t pte, struct page *page,
>> +                                       struct vm_area_struct *vma,
>> +                                       unsigned int flags)
>> +{
>> +       if (pte_write(pte))
>> +               return true;
>> +       if (!(flags & FOLL_FORCE))
>> +               return false;
>> +
>> +       /*
>> +        * See check_vma_flags(): only COW mappings need that special
>> +        * "force" handling when they lack VM_WRITE.
>> +        */
>> +       if (vma->vm_flags & VM_WRITE)
>> +               return false;
>> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> 
> So apart from the VM_BUG_ON(), this code just looks really strange -
> even despite the comment. Just conceptually, the whole "if it's
> writable, return that you cannot follow it for a write" just looks so
> very very strange.
> 
> That doesn't make the code _wrong_, but considering how many times
> this has had subtle bugs, let's not write code that looks strange.
> 
> So I would suggest that to protect against future bugs, we try to make
> it be fairly clear and straightforward, and maybe even a bit overly
> protective.
> 
> For example, let's kill the "shared mapping that you don't have write
> permissions to" very explicitly and without any subtle code at all.
> The vm_flags tests are cheap and easy, and we could very easily just
> add some core ones to make any mistakes much less critical.
> 
> Now, making that 'is_cow_mapping()' check explicit at the very top of
> this would already go a long way:
> 
>         /* FOLL_FORCE for writability only affects COW mappings */
>         if (!is_cow_mapping(vma->vm_flags))
>                 return false;

I actually put the is_cow_mapping() mapping check in there because
check_vma_flags() should make sure that we cannot possibly end up here
in that case. But we can spell it out with comments, doesn't hurt.

> 
> but I'd actually go even further: in this case that "is_cow_mapping()"
> helper to some degree actually hides what is going on.
> 
> So I'd actually prefer for that function to be written something like
> 
>         /* If the pte is writable, we can write to the page */
>         if (pte_write(pte))
>                 return true;
> 
>         /* Maybe FOLL_FORCE is set to override it? */
>         if (flags & FOLL_FORCE)
>                 return false;
> 
>         /* But FOLL_FORCE has no effect on shared mappings */
>         if (vma->vm_flags & MAP_SHARED)
>                 return false;
> 
>         /* .. or read-only private ones */
>         if (!(vma->vm_flags & MAP_MAYWRITE))
>                 return false;
> 
>         /* .. or already writable ones that just need to take a write fault */
>         if (vma->vm_flags & MAP_WRITE)
>                 return false;
> 
> and the two first vm_flags tests above are basically doing tat
> "is_cow_mapping()", and maybe we could even have a comment to that
> effect, but wouldn't it be nice to just write it out that way?
> 
> And after you've written it out like the above, now that
> 
>         if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>                 return false;
> 
> makes you pretty safe from a data sharing perspective: it's most
> definitely not a shared page at that point.
> 
> So if you write it that way, the only remaining issues are the magic
> special soft-dirty and uffd ones, but at that point it's purely about
> the semantics of those features, no longer about any possible "oh, we
> fooled some shared page to be writable".
> 
> And I think the above is fairly legible without any subtle cases, and
> the one-liner comments make it all fairly clear that it's testing.
> 
> Is any of this in any _technical_ way different from what your patch
> did? No. It's literally just rewriting it to be a bit more explicit in
> what it is doing, I think, and it makes that odd "it's not writable if
> VM_WRITE is set" case a bit more explicit.
> 
> Hmm?

No strong opinion. I'm happy as long as it's fixed, and the fix is robust.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:00 ` Linus Torvalds
  2022-08-09 20:06   ` David Hildenbrand
@ 2022-08-09 20:07   ` David Hildenbrand
  2022-08-09 20:14     ` Linus Torvalds
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On 09.08.22 22:00, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
> 
> So I've read through the patch several times, and it seems fine, but
> this function (and the pmd version of it) just read oddly to me.
> 
>> +static inline bool can_follow_write_pte(pte_t pte, struct page *page,
>> +                                       struct vm_area_struct *vma,
>> +                                       unsigned int flags)
>> +{
>> +       if (pte_write(pte))
>> +               return true;
>> +       if (!(flags & FOLL_FORCE))
>> +               return false;
>> +
>> +       /*
>> +        * See check_vma_flags(): only COW mappings need that special
>> +        * "force" handling when they lack VM_WRITE.
>> +        */
>> +       if (vma->vm_flags & VM_WRITE)
>> +               return false;
>> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> 
> So apart from the VM_BUG_ON(), this code just looks really strange -
> even despite the comment. Just conceptually, the whole "if it's
> writable, return that you cannot follow it for a write" just looks so
> very very strange.
> 
> That doesn't make the code _wrong_, but considering how many times
> this has had subtle bugs, let's not write code that looks strange.
> 
> So I would suggest that to protect against future bugs, we try to make
> it be fairly clear and straightforward, and maybe even a bit overly
> protective.
> 
> For example, let's kill the "shared mapping that you don't have write
> permissions to" very explicitly and without any subtle code at all.
> The vm_flags tests are cheap and easy, and we could very easily just
> add some core ones to make any mistakes much less critical.
> 
> Now, making that 'is_cow_mapping()' check explicit at the very top of
> this would already go a long way:
> 
>         /* FOLL_FORCE for writability only affects COW mappings */
>         if (!is_cow_mapping(vma->vm_flags))
>                 return false;
> 
> but I'd actually go even further: in this case that "is_cow_mapping()"
> helper to some degree actually hides what is going on.
> 
> So I'd actually prefer for that function to be written something like
> 
>         /* If the pte is writable, we can write to the page */
>         if (pte_write(pte))
>                 return true;
> 
>         /* Maybe FOLL_FORCE is set to override it? */
>         if (flags & FOLL_FORCE)
>                 return false;
> 
>         /* But FOLL_FORCE has no effect on shared mappings */
>         if (vma->vm_flags & MAP_SHARED)
>                 return false;

I'd actually rather check for MAP_MAYSHARE here, which is even stronger.
Thoughts?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:07   ` David Hildenbrand
@ 2022-08-09 20:14     ` Linus Torvalds
  2022-08-09 20:20       ` David Hildenbrand
  2022-08-09 20:20       ` Linus Torvalds
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 20:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> >         /* But FOLL_FORCE has no effect on shared mappings */
> >         if (vma->vm_flags & MAP_SHARED)
> >                 return false;
>
> I'd actually rather check for MAP_MAYSHARE here, which is even stronger.
> Thoughts?

Hmm. Adding the test for both is basically free (all those vm_flags
checks end up being a bit mask and compare), so no objections.

For some reason I though VM_SHARED and VM_MAYSHARE end up always being
the same bits, and it was a mistake to make them two bits in the first
place (unlike the read/write/exec bits that can are about mprotect),

But as there are two bits, I'm sure somebody ends up touching one and
not the other.

So yeah, I don't see any downside to just checking both bits.

[ That said, is_cow_mapping() only checks VM_SHARED, so if they are
ever different, that's a potential source of confusion ]

               Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:14     ` Linus Torvalds
@ 2022-08-09 20:20       ` David Hildenbrand
  2022-08-09 20:30         ` Linus Torvalds
  2022-08-09 20:20       ` Linus Torvalds
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On 09.08.22 22:14, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>         /* But FOLL_FORCE has no effect on shared mappings */
>>>         if (vma->vm_flags & MAP_SHARED)
>>>                 return false;
>>
>> I'd actually rather check for MAP_MAYSHARE here, which is even stronger.
>> Thoughts?
> 
> Hmm. Adding the test for both is basically free (all those vm_flags
> checks end up being a bit mask and compare), so no objections.
> 
> For some reason I though VM_SHARED and VM_MAYSHARE end up always being
> the same bits, and it was a mistake to make them two bits in the first
> place (unlike the read/write/exec bits that can are about mprotect),
> 
> But as there are two bits, I'm sure somebody ends up touching one and
> not the other.
> 
> So yeah, I don't see any downside to just checking both bits.
> 
> [ That said, is_cow_mapping() only checks VM_SHARED, so if they are
> ever different, that's a potential source of confusion ]

IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file
mappings we only set VM_SHARED if the file allows for writes (and we can
set VM_MAYWRITE or even VM_WRITE). [don't ask me why, it's a steady
source of confusion]

That's why is_cow_mapping() works even due to the weird MAP_SHARED vs.
VM_MAYSHARE logic.


I'd actually only check for VM_MAYSHARE here, which implies MAP_SHARED.
If someone would ever mess that up, I guess we'd be in bigger trouble.

But whatever you prefer.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:14     ` Linus Torvalds
  2022-08-09 20:20       ` David Hildenbrand
@ 2022-08-09 20:20       ` Linus Torvalds
  2022-08-09 20:23         ` David Hildenbrand
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 20:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But as there are two bits, I'm sure somebody ends up touching one and
> not the other.

Ugh. The nommu code certainly does odd things here. That just looks wrong.

And the hugetlb code does something horrible too, but never *sets* it,
so it looks like some odd subset of VM_SHARED.

                 Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:20       ` Linus Torvalds
@ 2022-08-09 20:23         ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On 09.08.22 22:20, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> But as there are two bits, I'm sure somebody ends up touching one and
>> not the other.
> 
> Ugh. The nommu code certainly does odd things here. That just looks wrong.
> 
> And the hugetlb code does something horrible too, but never *sets* it,
> so it looks like some odd subset of VM_SHARED.

mm/mmap.c:do_mmap() contains the magic bit

switch (flags & MAP_TYPE) {
case MAP_SHARED:
...
case MAP_SHARED_VALIDATE:
...
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
	vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
fallthrough;


VM_SHARED semantics are confusing.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:20       ` David Hildenbrand
@ 2022-08-09 20:30         ` Linus Torvalds
  2022-08-09 20:38           ` Linus Torvalds
  2022-08-09 20:42           ` David Hildenbrand
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 20:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>
> IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file
> mappings we only set VM_SHARED if the file allows for writes

Heh.

This is a horrific hack, and probably should go away.

Yeah, we have that

                        if (!(file->f_mode & FMODE_WRITE))
                                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);


but I think that's _entirely_ historical.

Long long ago, in a galaxy far away, we didn't handle shared mmap()
very well. In fact, we used to not handle it at all.

But nntpd would use write() to update the spool file, adn them read it
through a shared mmap.

And since our mmap() *was* coherent with people doing write() system
calls, but didn't handle actual dirty shared mmap, what Linux used to
do was to just say "Oh, you want a read-only shared file mmap? I can
do that - I'll just downgrade it to a read-only _private_ mapping, and
it actually ends up with the same semantics".

And here we are, 30 years later, and it still does that, but it leaves
the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a
shared mapping.

                 Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:30         ` Linus Torvalds
@ 2022-08-09 20:38           ` Linus Torvalds
  2022-08-09 20:42           ` David Hildenbrand
  1 sibling, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2022-08-09 20:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On Tue, Aug 9, 2022 at 1:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And here we are, 30 years later, and it still does that, but it leaves
> the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a
> shared mapping.

.. thinking about it, we end up still having some things that this helps.

For example, because we clear the VM_SHARED flags for read-only shared
mappings, they don't end up going through mapping_{un}map_writable(),
and don't update i_mmap_writable, and don't cause issues with
mapping_deny_writable() or mapping_writably_mapped().

So it ends up actually having random small semantic details due to
those almost three decades of history.

I'm sure there are other odd pieces like that.

                  Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 20:30         ` Linus Torvalds
  2022-08-09 20:38           ` Linus Torvalds
@ 2022-08-09 20:42           ` David Hildenbrand
  1 sibling, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-08-09 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe

On 09.08.22 22:30, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file
>> mappings we only set VM_SHARED if the file allows for writes
> 
> Heh.
> 
> This is a horrific hack, and probably should go away.
> 
> Yeah, we have that
> 
>                         if (!(file->f_mode & FMODE_WRITE))
>                                 vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
> 
> 
> but I think that's _entirely_ historical.
> 
> Long long ago, in a galaxy far away, we didn't handle shared mmap()
> very well. In fact, we used to not handle it at all.
> 
> But nntpd would use write() to update the spool file, adn them read it
> through a shared mmap.
> 
> And since our mmap() *was* coherent with people doing write() system
> calls, but didn't handle actual dirty shared mmap, what Linux used to
> do was to just say "Oh, you want a read-only shared file mmap? I can
> do that - I'll just downgrade it to a read-only _private_ mapping, and
> it actually ends up with the same semantics".
> 
> And here we are, 30 years later, and it still does that, but it leaves
> the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a
> shared mapping.

I was suspecting that this code is full of legacy :)

What would make sense to me is to just have VM_SHARED and make it
correspond to MAP_SHARED, that would at least confuse me less. Once I
have some spare cycles I'll see how easy that might be to achieve.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  2022-08-09 19:07       ` Jason Gunthorpe
  2022-08-09 19:21         ` Linus Torvalds
@ 2022-08-09 21:16         ` David Laight
  1 sibling, 0 replies; 33+ messages in thread
From: David Laight @ 2022-08-09 21:16 UTC (permalink / raw)
  To: 'Jason Gunthorpe', Linus Torvalds
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard

From: Jason Gunthorpe
> Sent: 09 August 2022 20:08
> 
> On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote:
> 
> > But as a very good approximation, the rule is "absolutely no new
> > BUG_ON() calls _ever_". Because I really cannot see a single case
> > where "proper error handling and WARN_ON_ONCE()" isn't the right
> > thing.
> 
> Parallel to this discussion I've had ones where people more or less
> say
> 
>  Since BUG_ON crashes the machine and Linus says that crashing the
>  machine is bad, WARN_ON will also crash the machine if you set the
>  panic_on_warn parameter, so it is also bad, thus we shouldn't use
>  anything.
> 
> I've generally maintained that people who set the panic_on_warn *want*
> these crashes, because that is the entire point of it. So we should
> use WARN_ON with an error recovery for "can't happen" assertions like
> these. I think it is what you are saying here.

They don't necessarily want the crashes, it is more the people who
built the distribution think they want the crashes.

I have had issues with a customer system (with our drivers) randomly
locking up.
Someone had decided that 'PANIC_ON_OOPS' was a good idea but hadn't
enabled anything to actually take the dump.

So instead of a diagnosable problem (and a 'doh' moment) you
get several weeks of head scratching and a very annoyed user.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()
  2022-08-09 18:59     ` Linus Torvalds
  2022-08-09 19:07       ` Jason Gunthorpe
@ 2022-08-11  7:13       ` Ingo Molnar
  2022-08-11 20:43         ` Linus Torvalds
  1 sibling, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2022-08-11  7:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I just tried to find a valid BUG_ON() that would make me go "yeah, that's 
> actually worth it", and couldn't really find one. Yeah, there are several 
> ones in the scheduler that make me go "ok, if that triggers, the machine 
> is dead anyway", so in that sense there are certainly BUG_ON()s that 
> don't _hurt_.

That's a mostly accidental, historical accumulation of BUG_ON()s - I 
believe we can change all of them to WARN_ON() via the patch below.

As far as the scheduler is concerned, we don't need any BUG_ON()s.

[ This assumes that printk() itself is atomic and non-recursive wrt. the 
  scheduler in these code paths ... ]

Thanks,

	Ingo

===============>
From: Ingo Molnar <mingo@kernel.org>
Date: Thu, 11 Aug 2022 08:54:52 +0200
Subject: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()

There's no good reason to crash a user's system with a BUG_ON(),
chances are high that they'll never even see the crash message on
Xorg, and it won't make it into the syslog either.

By using a WARN_ON() we at least give the user a chance to report
any bugs triggered here - instead of getting silent hangs.

None of these WARN_ON()s are supposed to trigger, ever - so we ignore
cases where a NULL check is done via a BUG_ON() and we let a NULL
pointer through after a WARN_ON().

There's one exception: WARN_ON() arguments with side-effects,
such as locking - in this case we use the return value of the
WARN_ON(), such as in:

 -       BUG_ON(!lock_task_sighand(p, &flags));
 +       if (WARN_ON(!lock_task_sighand(p, &flags)))
 +               return;

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/autogroup.c |  3 ++-
 kernel/sched/core.c      |  2 +-
 kernel/sched/cpupri.c    |  2 +-
 kernel/sched/deadline.c  | 26 +++++++++++++-------------
 kernel/sched/fair.c      | 10 +++++-----
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  6 +++---
 7 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 4ebaf97f7bd8..13f6b6da35a0 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -161,7 +161,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
 	struct task_struct *t;
 	unsigned long flags;
 
-	BUG_ON(!lock_task_sighand(p, &flags));
+	if (WARN_ON(!lock_task_sighand(p, &flags)))
+		return;
 
 	prev = p->signal->autogroup;
 	if (prev == ag) {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d3d61cbb6b3c..f84206bf42cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2328,7 +2328,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 	rq = cpu_rq(new_cpu);
 
 	rq_lock(rq, rf);
-	BUG_ON(task_cpu(p) != new_cpu);
+	WARN_ON(task_cpu(p) != new_cpu);
 	activate_task(rq, p, 0);
 	check_preempt_curr(rq, p, 0);
 
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index fa9ce9d83683..9f719e4ea081 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
 	int task_pri = convert_prio(p->prio);
 	int idx, cpu;
 
-	BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
+	WARN_ON(task_pri >= CPUPRI_NR_PRIORITIES);
 
 	for (idx = 0; idx < task_pri; idx++) {
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1d9c90958baa..fb234077c317 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -310,7 +310,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 {
 	struct rq *rq;
 
-	BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV);
+	WARN_ON(p->dl.flags & SCHED_FLAG_SUGOV);
 
 	if (task_on_rq_queued(p))
 		return;
@@ -607,7 +607,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *leftmost;
 
-	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+	WARN_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
 
 	leftmost = rb_add_cached(&p->pushable_dl_tasks,
 				 &rq->dl.pushable_dl_tasks_root,
@@ -684,7 +684,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 			 * Failed to find any suitable CPU.
 			 * The task will never come back!
 			 */
-			BUG_ON(dl_bandwidth_enabled());
+			WARN_ON(dl_bandwidth_enabled());
 
 			/*
 			 * If admission control is disabled we
@@ -830,7 +830,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	BUG_ON(pi_of(dl_se)->dl_runtime <= 0);
+	WARN_ON(pi_of(dl_se)->dl_runtime <= 0);
 
 	/*
 	 * This could be the case for a !-dl task that is boosted.
@@ -1616,7 +1616,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 
-	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+	WARN_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
 
 	rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less);
 
@@ -1640,7 +1640,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 static void
 enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
-	BUG_ON(on_dl_rq(dl_se));
+	WARN_ON(on_dl_rq(dl_se));
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 
@@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 		return NULL;
 
 	dl_se = pick_next_dl_entity(dl_rq);
-	BUG_ON(!dl_se);
+	WARN_ON(!dl_se);
 	p = dl_task_of(dl_se);
 
 	return p;
@@ -2277,12 +2277,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 
 	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
 
-	BUG_ON(rq->cpu != task_cpu(p));
-	BUG_ON(task_current(rq, p));
-	BUG_ON(p->nr_cpus_allowed <= 1);
+	WARN_ON(rq->cpu != task_cpu(p));
+	WARN_ON(task_current(rq, p));
+	WARN_ON(p->nr_cpus_allowed <= 1);
 
-	BUG_ON(!task_on_rq_queued(p));
-	BUG_ON(!dl_task(p));
+	WARN_ON(!task_on_rq_queued(p));
+	WARN_ON(!dl_task(p));
 
 	return p;
 }
@@ -2492,7 +2492,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 	struct root_domain *src_rd;
 	struct rq *rq;
 
-	BUG_ON(!dl_task(p));
+	WARN_ON(!dl_task(p));
 
 	rq = task_rq(p);
 	src_rd = rq->rd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da388657d5ac..00c01b3232b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	if (!join)
 		return;
 
-	BUG_ON(irqs_disabled());
+	WARN_ON(irqs_disabled());
 	double_lock_irq(&my_grp->lock, &grp->lock);
 
 	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
@@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	find_matching_se(&se, &pse);
-	BUG_ON(!pse);
+	WARN_ON(!pse);
 
 	cse_is_idle = se_is_idle(se);
 	pse_is_idle = se_is_idle(pse);
@@ -8159,7 +8159,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
 {
 	lockdep_assert_rq_held(rq);
 
-	BUG_ON(task_rq(p) != rq);
+	WARN_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	check_preempt_curr(rq, p, 0);
 }
@@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		goto out_balanced;
 	}
 
-	BUG_ON(busiest == env.dst_rq);
+	WARN_ON(busiest == env.dst_rq);
 
 	schedstat_add(sd->lb_imbalance[idle], env.imbalance);
 
@@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data)
 	 * we need to fix it. Originally reported by
 	 * Bjorn Helgaas on a 128-CPU setup.
 	 */
-	BUG_ON(busiest_rq == target_rq);
+	WARN_ON(busiest_rq == target_rq);
 
 	/* Search for an sd spanning us and the target CPU. */
 	rcu_read_lock();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 054b6711e961..acf9f5ce0c4a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -843,7 +843,7 @@ static void __disable_runtime(struct rq *rq)
 		 * We cannot be left wanting - that would mean some runtime
 		 * leaked out of the system.
 		 */
-		BUG_ON(want);
+		WARN_ON(want);
 balanced:
 		/*
 		 * Disable all the borrow logic by pretending we have inf
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b0bf2287dd9d..8e5df3bc3483 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2699,8 +2699,8 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq1->lock)
 	__acquires(rq2->lock)
 {
-	BUG_ON(!irqs_disabled());
-	BUG_ON(rq1 != rq2);
+	WARN_ON(!irqs_disabled());
+	WARN_ON(rq1 != rq2);
 	raw_spin_rq_lock(rq1);
 	__acquire(rq2->lock);	/* Fake it out ;) */
 	double_rq_clock_clear_update(rq1, rq2);
@@ -2716,7 +2716,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	BUG_ON(rq1 != rq2);
+	WARN_ON(rq1 != rq2);
 	raw_spin_rq_unlock(rq1);
 	__release(rq2->lock);
 }


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()
  2022-08-11  7:13       ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar
@ 2022-08-11 20:43         ` Linus Torvalds
  2022-08-11 21:28           ` Matthew Wilcox
  2022-08-12  9:29           ` [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() Ingo Molnar
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2022-08-11 20:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman

On Thu, Aug 11, 2022 at 12:13 AM Ingo Molnar <mingo@kernel.org> wrote:
>
> By using a WARN_ON() we at least give the user a chance to report
> any bugs triggered here - instead of getting silent hangs.
>
> None of these WARN_ON()s are supposed to trigger, ever - so we ignore
> cases where a NULL check is done via a BUG_ON() and we let a NULL
> pointer through after a WARN_ON().

May I suggest going one step further, and making these WARN_ON_ONCE() instead.

From personal experience, once some scheduler bug (or task struct
corruption) happens, ti often *keeps* happening, and the logs just
fill up with more and more data, to the point where you lose sight of
the original report (and the machine can even get unusable just from
the logging).

WARN_ON_ONCE() can help that situation.

Now, obviously

 (a) WARN_ON_ONCE *can* also result in less information, and maybe
there are situations where having more - possibly different - cases of
the same thing triggering could be useful.

 (b) WARN_ON_ONCE historically generated a bit bigger code than
WARN_ON simply due to the extra "did this already trigger" check.

I *think* (b) is no longer true, and it's just a flag these days, but
I didn't actually check.

so it's not like there aren't potential downsides, but in general I
think the sanest and most natural thing is to have BUG_ON() translate
to WARN_ON_ONCE().

For the "reboot-on-warn" people, it ends up being the same thing. And
for the rest of us, the "give me *one* warning" can end up making the
reporting a lot easier.

Obviously, with the "this never actually happens", the whole "once or
many times" is kind of moot. But if it never happens at all, to the
point where it doesn't even add a chance of helping debugging, maybe
the whole test should be removed entirely...

           Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()
  2022-08-11 20:43         ` Linus Torvalds
@ 2022-08-11 21:28           ` Matthew Wilcox
  2022-08-11 23:22             ` Jason Gunthorpe
  2022-08-12  9:29           ` [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() Ingo Molnar
  1 sibling, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2022-08-11 21:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Hildenbrand, linux-kernel, linux-mm, stable,
	Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen, Peter Xu,
	Hugh Dickins, Andrea Arcangeli, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman

On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote:
> May I suggest going one step further, and making these WARN_ON_ONCE() instead.
> 
> >From personal experience, once some scheduler bug (or task struct
> corruption) happens, ti often *keeps* happening, and the logs just
> fill up with more and more data, to the point where you lose sight of
> the original report (and the machine can even get unusable just from
> the logging).

I've been thinking about magically turning all the WARN_ON_ONCE() into
(effectively) WARN_ON_RATELIMIT().  I had some patches in that direction
a while ago but never got round to tidying them up for submission.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()
  2022-08-11 21:28           ` Matthew Wilcox
@ 2022-08-11 23:22             ` Jason Gunthorpe
  2022-08-14  1:10               ` John Hubbard
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2022-08-11 23:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, David Hildenbrand, linux-kernel,
	linux-mm, stable, Andrew Morton, Greg Kroah-Hartman,
	Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli,
	Vlastimil Babka, John Hubbard, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman

On Thu, Aug 11, 2022 at 10:28:27PM +0100, Matthew Wilcox wrote:
> On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote:
> > May I suggest going one step further, and making these WARN_ON_ONCE() instead.
> > 
> > >From personal experience, once some scheduler bug (or task struct
> > corruption) happens, ti often *keeps* happening, and the logs just
> > fill up with more and more data, to the point where you lose sight of
> > the original report (and the machine can even get unusable just from
> > the logging).
> 
> I've been thinking about magically turning all the WARN_ON_ONCE() into
> (effectively) WARN_ON_RATELIMIT().  I had some patches in that direction
> a while ago but never got round to tidying them up for submission.

I often wonder if we have a justification for WARN_ON to even exist, I
see a lot of pressure to make things into WARN_ON_ONCE based on the
logic that spamming makes it useless..

Maybe a global limit of 10 warn ons per minute or something would be
interesting?

Jason


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
  2022-08-11 20:43         ` Linus Torvalds
  2022-08-11 21:28           ` Matthew Wilcox
@ 2022-08-12  9:29           ` Ingo Molnar
       [not found]             ` <20220815144143.zjsiamw5y22bvgki@suse.de>
  1 sibling, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2022-08-12  9:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, linux-kernel, linux-mm, stable, Andrew Morton,
	Greg Kroah-Hartman, Axel Rasmussen, Peter Xu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Jason Gunthorpe, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Aug 11, 2022 at 12:13 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > By using a WARN_ON() we at least give the user a chance to report
> > any bugs triggered here - instead of getting silent hangs.
> >
> > None of these WARN_ON()s are supposed to trigger, ever - so we ignore
> > cases where a NULL check is done via a BUG_ON() and we let a NULL
> > pointer through after a WARN_ON().
> 
> May I suggest going one step further, and making these WARN_ON_ONCE() instead.

Ok, agreed.

> From personal experience, once some scheduler bug (or task struct 
> corruption) happens, ti often *keeps* happening, and the logs just fill 
> up with more and more data, to the point where you lose sight of the 
> original report (and the machine can even get unusable just from the 
> logging).
> 
> WARN_ON_ONCE() can help that situation.

Yeah, true.

> Now, obviously
> 
>  (a) WARN_ON_ONCE *can* also result in less information, and maybe there 
> are situations where having more - possibly different - cases of the same 
> thing triggering could be useful.

None of these warnings are supposed to happen absent serious random data 
corruption, ever™, so if against expectations they do happen once, 
somewhere, and its brevity makes it hard to figure out, we can still 
reconsider on a case by case basis & increase verbosity.

>  (b) WARN_ON_ONCE historically generated a bit bigger code than
> WARN_ON simply due to the extra "did this already trigger" check.
> 
> I *think* (b) is no longer true, and it's just a flag these days, but I 
> didn't actually check.

Yeah, so on architectures that implement a smart __WARN_FLAGS() primitive, 
such as x86, overhead is pretty good:

#define __WARN_FLAGS(flags)                                     \
do {                                                            \
        __auto_type __flags = BUGFLAG_WARNING|(flags);          \
        instrumentation_begin();                                \
        _BUG_FLAGS(ASM_UD2, __flags, ASM_REACHABLE);            \
        instrumentation_end();                                  \
} while (0)

For them the runtime overhead of a WARN_ON_ONCE() is basically just the 
check itself:

# no checks:

ffffffff810a3ae0 <check_preempt_curr>:
ffffffff810a3ae0:       48 8b 87 20 09 00 00    mov    0x920(%rdi),%rax
ffffffff810a3ae7:       48 8b 8e 90 02 00 00    mov    0x290(%rsi),%rcx
ffffffff810a3aee:       53                      push   %rbx
ffffffff810a3aef:       48 89 fb                mov    %rdi,%rbx
ffffffff810a3af2:       48 3b 88 90 02 00 00    cmp    0x290(%rax),%rcx

# Single-branch WARN_ON_ONCE():
#
#  void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
#  {
# +       WARN_ON_ONCE(!rq);
# +
#         if (p->sched_class == rq->curr->sched_class)
#                 rq->curr->sched_class->check_preempt_curr(rq, p, flags);
#         else if (sched_class_above(p->sched_class, rq->curr->sched_class))

ffffffff810a3ae0 <check_preempt_curr>:
ffffffff810a3ae0:       53                      push   %rbx
ffffffff810a3ae1:       48 89 fb                mov    %rdi,%rbx
ffffffff810a3ae4:  |    48 85 ff                test   %rdi,%rdi
ffffffff810a3ae7:  |    74 69                   je     ffffffff810a3b52 <check_preempt_curr+0x72>
ffffffff810a3ae9:       48 8b 83 20 09 00 00    mov    0x920(%rbx),%rax
ffffffff810a3af0:       48 8b 8e 90 02 00 00    mov    0x290(%rsi),%rcx
ffffffff810a3af7:       48 3b 88 90 02 00 00    cmp    0x290(%rax),%rcx
...
ffffffff810a3b50:       eb d1                   jmp    ffffffff810a3b23 <check_preempt_curr+0x43>    # tail-call
ffffffff810a3b52:  |    0f 0b                   ud2    

So it's a test instruction and an unlikely-forward-branch, plus a UD2 trap 
squeezed into the function epilogue, often hidden by alignment holes.

As lightweight as it gets, and no in-line penalty from being a 'once' 
warning.

> so it's not like there aren't potential downsides, but in general I think 
> the sanest and most natural thing is to have BUG_ON() translate to 
> WARN_ON_ONCE().
>
> For the "reboot-on-warn" people, it ends up being the same thing. And for 
> the rest of us, the "give me *one* warning" can end up making the 
> reporting a lot easier.

Agreed - updated v2 patch attached.

> Obviously, with the "this never actually happens", the whole "once or 
> many times" is kind of moot. But if it never happens at all, to the point 
> where it doesn't even add a chance of helping debugging, maybe the whole 
> test should be removed entirely...

Yeah.

Thanks,

	Ingo

========================================

From: Ingo Molnar <mingo@kernel.org>
Date: Thu, 11 Aug 2022 08:54:52 +0200
Subject: [PATCH] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

There's no good reason to crash a user's system with a BUG_ON(),
chances are high that they'll never even see the crash message on
Xorg, and it won't make it into the syslog either.

By using a WARN_ON_ONCE() we at least give the user a chance to report
any bugs triggered here - instead of getting silent hangs.

None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
cases where a NULL check is done via a BUG_ON() and we let a NULL
pointer through after a WARN_ON_ONCE().

There's one exception: WARN_ON_ONCE() arguments with side-effects,
such as locking - in this case we use the return value of the
WARN_ON_ONCE(), such as in:

 -       BUG_ON(!lock_task_sighand(p, &flags));
 +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
 +               return;

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
---
 kernel/sched/autogroup.c |  3 ++-
 kernel/sched/core.c      |  2 +-
 kernel/sched/cpupri.c    |  2 +-
 kernel/sched/deadline.c  | 26 +++++++++++++-------------
 kernel/sched/fair.c      | 10 +++++-----
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  6 +++---
 7 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 4ebaf97f7bd8..991fc9002535 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -161,7 +161,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
 	struct task_struct *t;
 	unsigned long flags;
 
-	BUG_ON(!lock_task_sighand(p, &flags));
+	if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
+		return;
 
 	prev = p->signal->autogroup;
 	if (prev == ag) {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..813687a5f5cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2328,7 +2328,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 	rq = cpu_rq(new_cpu);
 
 	rq_lock(rq, rf);
-	BUG_ON(task_cpu(p) != new_cpu);
+	WARN_ON_ONCE(task_cpu(p) != new_cpu);
 	activate_task(rq, p, 0);
 	check_preempt_curr(rq, p, 0);
 
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index fa9ce9d83683..a286e726eb4b 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
 	int task_pri = convert_prio(p->prio);
 	int idx, cpu;
 
-	BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
+	WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);
 
 	for (idx = 0; idx < task_pri; idx++) {
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0ab79d819a0d..962b169b05cf 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -310,7 +310,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 {
 	struct rq *rq;
 
-	BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV);
+	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
 
 	if (task_on_rq_queued(p))
 		return;
@@ -607,7 +607,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *leftmost;
 
-	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+	WARN_ON_ONCE(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
 
 	leftmost = rb_add_cached(&p->pushable_dl_tasks,
 				 &rq->dl.pushable_dl_tasks_root,
@@ -684,7 +684,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 			 * Failed to find any suitable CPU.
 			 * The task will never come back!
 			 */
-			BUG_ON(dl_bandwidth_enabled());
+			WARN_ON_ONCE(dl_bandwidth_enabled());
 
 			/*
 			 * If admission control is disabled we
@@ -830,7 +830,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	BUG_ON(pi_of(dl_se)->dl_runtime <= 0);
+	WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
 
 	/*
 	 * This could be the case for a !-dl task that is boosted.
@@ -1616,7 +1616,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 
-	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
 
 	rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less);
 
@@ -1640,7 +1640,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 static void
 enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
-	BUG_ON(on_dl_rq(dl_se));
+	WARN_ON_ONCE(on_dl_rq(dl_se));
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 
@@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 		return NULL;
 
 	dl_se = pick_next_dl_entity(dl_rq);
-	BUG_ON(!dl_se);
+	WARN_ON_ONCE(!dl_se);
 	p = dl_task_of(dl_se);
 
 	return p;
@@ -2277,12 +2277,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 
 	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
 
-	BUG_ON(rq->cpu != task_cpu(p));
-	BUG_ON(task_current(rq, p));
-	BUG_ON(p->nr_cpus_allowed <= 1);
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
 
-	BUG_ON(!task_on_rq_queued(p));
-	BUG_ON(!dl_task(p));
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+	WARN_ON_ONCE(!dl_task(p));
 
 	return p;
 }
@@ -2492,7 +2492,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 	struct root_domain *src_rd;
 	struct rq *rq;
 
-	BUG_ON(!dl_task(p));
+	WARN_ON_ONCE(!dl_task(p));
 
 	rq = task_rq(p);
 	src_rd = rq->rd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 914096c5b1ae..28f10dccd194 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	if (!join)
 		return;
 
-	BUG_ON(irqs_disabled());
+	WARN_ON_ONCE(irqs_disabled());
 	double_lock_irq(&my_grp->lock, &grp->lock);
 
 	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
@@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	find_matching_se(&se, &pse);
-	BUG_ON(!pse);
+	WARN_ON_ONCE(!pse);
 
 	cse_is_idle = se_is_idle(se);
 	pse_is_idle = se_is_idle(pse);
@@ -8159,7 +8159,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
 {
 	lockdep_assert_rq_held(rq);
 
-	BUG_ON(task_rq(p) != rq);
+	WARN_ON_ONCE(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	check_preempt_curr(rq, p, 0);
 }
@@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		goto out_balanced;
 	}
 
-	BUG_ON(busiest == env.dst_rq);
+	WARN_ON_ONCE(busiest == env.dst_rq);
 
 	schedstat_add(sd->lb_imbalance[idle], env.imbalance);
 
@@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data)
 	 * we need to fix it. Originally reported by
 	 * Bjorn Helgaas on a 128-CPU setup.
 	 */
-	BUG_ON(busiest_rq == target_rq);
+	WARN_ON_ONCE(busiest_rq == target_rq);
 
 	/* Search for an sd spanning us and the target CPU. */
 	rcu_read_lock();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 55f39c8f4203..2936fe55cef7 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -843,7 +843,7 @@ static void __disable_runtime(struct rq *rq)
 		 * We cannot be left wanting - that would mean some runtime
 		 * leaked out of the system.
 		 */
-		BUG_ON(want);
+		WARN_ON_ONCE(want);
 balanced:
 		/*
 		 * Disable all the borrow logic by pretending we have inf
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..7a44dceeb50a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2709,8 +2709,8 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq1->lock)
 	__acquires(rq2->lock)
 {
-	BUG_ON(!irqs_disabled());
-	BUG_ON(rq1 != rq2);
+	WARN_ON_ONCE(!irqs_disabled());
+	WARN_ON_ONCE(rq1 != rq2);
 	raw_spin_rq_lock(rq1);
 	__acquire(rq2->lock);	/* Fake it out ;) */
 	double_rq_clock_clear_update(rq1, rq2);
@@ -2726,7 +2726,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	BUG_ON(rq1 != rq2);
+	WARN_ON_ONCE(rq1 != rq2);
 	raw_spin_rq_unlock(rq1);
 	__release(rq2->lock);
 }


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()
  2022-08-11 23:22             ` Jason Gunthorpe
@ 2022-08-14  1:10               ` John Hubbard
  0 siblings, 0 replies; 33+ messages in thread
From: John Hubbard @ 2022-08-14  1:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, David Hildenbrand, linux-kernel,
	linux-mm, stable, Andrew Morton, Greg Kroah-Hartman,
	Axel Rasmussen, Peter Xu, Hugh Dickins, Andrea Arcangeli,
	Vlastimil Babka, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman

On 8/11/22 16:22, Jason Gunthorpe wrote:
> On Thu, Aug 11, 2022 at 10:28:27PM +0100, Matthew Wilcox wrote:
>> On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote:
>>> May I suggest going one step further, and making these WARN_ON_ONCE() instead.
>>>
>>> >From personal experience, once some scheduler bug (or task struct
>>> corruption) happens, ti often *keeps* happening, and the logs just
>>> fill up with more and more data, to the point where you lose sight of
>>> the original report (and the machine can even get unusable just from
>>> the logging).
>>
>> I've been thinking about magically turning all the WARN_ON_ONCE() into
>> (effectively) WARN_ON_RATELIMIT().  I had some patches in that direction
>> a while ago but never got round to tidying them up for submission.

If you do that, I'd like to suggest that you avoid using magic here, but
instead just rename at the call sites.

Because:

First and foremost, something named WARN_ON_ONCE() clearly has a solemn
responsibility to warn exactly "once times"! :)

Second, it's not yet clear (or is it?) that WARN_ON_ONCE() is always
worse than rate limiting. It's a trade-off, rather than a clear win for
either case, in my experience. The _ONCE variant can get overwritten
if the kernel log wraps, but the _RATELIMIT on the other hand, may be
excessive.

And finally, if it *is* agreed on here that WARN_ON_RATELIMIT() is
always better than WARN_ON_ONCE(), then there is still no harm in
spending a patch or two (coccinelle...) to rename WARN_ON_ONCE() -->
WARN_ON_RATELIMIT(), so that we end up with accurate names.

> 
> I often wonder if we have a justification for WARN_ON to even exist, I
> see a lot of pressure to make things into WARN_ON_ONCE based on the
> logic that spamming makes it useless..

Agreed. WARN_ON_ONCE() or WARN_ON_RATELIMIT(), take your pick. But not
WARN_ON_EVERY_TIME()--that usually causes a serious problems in the
logs.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
       [not found]             ` <20220815144143.zjsiamw5y22bvgki@suse.de>
@ 2022-08-15 22:12               ` John Hubbard
  2022-08-21 11:28               ` Ingo Molnar
  1 sibling, 0 replies; 33+ messages in thread
From: John Hubbard @ 2022-08-15 22:12 UTC (permalink / raw)
  To: Mel Gorman, Ingo Molnar
  Cc: Linus Torvalds, David Hildenbrand, linux-kernel, linux-mm,
	stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen,
	Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox,
	Vlastimil Babka, Jason Gunthorpe, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt

On 8/15/22 07:41, Mel Gorman wrote:
> For the rest, I didn't see obvious recovery paths that would allow the
> system to run predictably. Any of them firing will have unpredictable
> consequences (e.g. move_queued_task firing would be fun if it was a per-cpu
> kthread). Depending on which warning triggers, the remaining life of the
> system may be very short but maybe long enough to be logged even if system
> locks up shortly afterwards.

Hi Mel,

Are you basically saying, "WARN_ON*() seems acceptable in mm/, because
we can at least get the problem logged before it locks up, probably"?

Or are you implying that we should instead introduce and use some new
PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log
then reboot" behavior?


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
       [not found]             ` <20220815144143.zjsiamw5y22bvgki@suse.de>
  2022-08-15 22:12               ` John Hubbard
@ 2022-08-21 11:28               ` Ingo Molnar
  1 sibling, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2022-08-21 11:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, David Hildenbrand, linux-kernel, linux-mm,
	stable, Andrew Morton, Greg Kroah-Hartman, Axel Rasmussen,
	Peter Xu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox,
	Vlastimil Babka, John Hubbard, Jason Gunthorpe, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt


* Mel Gorman <mgorman@suse.de> wrote:

> For the rest, I didn't see obvious recovery paths that would allow the 
> system to run predictably. Any of them firing will have unpredictable 
> consequences (e.g. move_queued_task firing would be fun if it was a 
> per-cpu kthread). Depending on which warning triggers, the remaining life 
> of the system may be very short but maybe long enough to be logged even 
> if system locks up shortly afterwards.

Correct. I'd prefer to keep all these warnings 'simple' - i.e. no attempted 
recovery & control flow, unless we ever expect these to trigger.

I.e. instead of adding a 'goto' I'd prefer if we removed most of the ones 
you highlighted. But wanted to keep this first patch simple.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-08-21 11:29 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-08  7:32 [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW David Hildenbrand
2022-08-08 16:02 ` David Hildenbrand
2022-08-09 18:27 ` Linus Torvalds
2022-08-09 18:45   ` David Hildenbrand
2022-08-09 18:59     ` Linus Torvalds
2022-08-09 19:07       ` Jason Gunthorpe
2022-08-09 19:21         ` Linus Torvalds
2022-08-09 21:16         ` David Laight
2022-08-11  7:13       ` [PATCH] sched/all: Change BUG_ON() instances to WARN_ON() Ingo Molnar
2022-08-11 20:43         ` Linus Torvalds
2022-08-11 21:28           ` Matthew Wilcox
2022-08-11 23:22             ` Jason Gunthorpe
2022-08-14  1:10               ` John Hubbard
2022-08-12  9:29           ` [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() Ingo Molnar
     [not found]             ` <20220815144143.zjsiamw5y22bvgki@suse.de>
2022-08-15 22:12               ` John Hubbard
2022-08-21 11:28               ` Ingo Molnar
2022-08-09 18:40 ` [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Linus Torvalds
2022-08-09 18:48   ` Jason Gunthorpe
2022-08-09 18:53     ` David Hildenbrand
2022-08-09 19:07     ` Linus Torvalds
2022-08-09 19:20       ` David Hildenbrand
2022-08-09 18:48 ` Linus Torvalds
2022-08-09 19:09   ` David Hildenbrand
2022-08-09 20:00 ` Linus Torvalds
2022-08-09 20:06   ` David Hildenbrand
2022-08-09 20:07   ` David Hildenbrand
2022-08-09 20:14     ` Linus Torvalds
2022-08-09 20:20       ` David Hildenbrand
2022-08-09 20:30         ` Linus Torvalds
2022-08-09 20:38           ` Linus Torvalds
2022-08-09 20:42           ` David Hildenbrand
2022-08-09 20:20       ` Linus Torvalds
2022-08-09 20:23         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).