linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
@ 2021-04-29 12:25 David Hildenbrand
  2021-04-29 12:25 ` [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

This is (obviously) for v5.13++; no need to rush with review, but I decided
to send it around right away.

Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned.

Examples for PageOffline() pages are pages inflated in a balloon,
memory unplugged via virtio-mem, and partially-present sections in
memory added by the Hyper-V balloon.

When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.

With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch.
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.

Tested via kcore_dump:
	https://github.com/schlafwandler/kcore_dump

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: linux-hyperv@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

David Hildenbrand (7):
  fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
  fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
  mm: rename and move page_is_poisoned()
  fs/proc/kcore: don't read offline sections, logically offline pages
    and hwpoisoned pages
  mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize
    setting PageOffline()
  virtio-mem: use page_offline_(start|end) when setting PageOffline()
  fs/proc/kcore: use page_offline_(freeze|unfreeze)

 drivers/virtio/virtio_mem.c |  2 ++
 fs/proc/kcore.c             | 69 ++++++++++++++++++++++++++++++-------
 include/linux/kcore.h       |  3 --
 include/linux/page-flags.h  | 12 +++++++
 mm/gup.c                    |  6 +++-
 mm/internal.h               | 20 -----------
 mm/util.c                   | 40 +++++++++++++++++++++
 7 files changed, 115 insertions(+), 37 deletions(-)


base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
-- 
2.30.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
  2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
@ 2021-04-29 12:25 ` David Hildenbrand
  2021-05-02  6:31   ` Mike Rapoport
  2021-04-29 12:25 ` [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM David Hildenbrand
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.

Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.

Let's drop both types. While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c       | 7 ++-----
 include/linux/kcore.h | 3 ---
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 4d2e64e9016c..09f77d3c6e15 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			phdr->p_type = PT_LOAD;
 			phdr->p_flags = PF_R | PF_W | PF_X;
 			phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
-			if (m->type == KCORE_REMAP)
-				phdr->p_vaddr = (size_t)m->vaddr;
-			else
-				phdr->p_vaddr = (size_t)m->addr;
-			if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
+			phdr->p_vaddr = (size_t)m->addr;
+			if (m->type == KCORE_RAM)
 				phdr->p_paddr = __pa(m->addr);
 			else if (m->type == KCORE_TEXT)
 				phdr->p_paddr = __pa_symbol(m->addr);
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index da676cdbd727..86c0f1d18998 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -11,14 +11,11 @@ enum kcore_type {
 	KCORE_RAM,
 	KCORE_VMEMMAP,
 	KCORE_USER,
-	KCORE_OTHER,
-	KCORE_REMAP,
 };
 
 struct kcore_list {
 	struct list_head list;
 	unsigned long addr;
-	unsigned long vaddr;
 	size_t size;
 	int type;
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
  2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
  2021-04-29 12:25 ` [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
@ 2021-04-29 12:25 ` David Hildenbrand
  2021-05-02  6:31   ` Mike Rapoport
  2021-04-29 12:25 ` [PATCH v1 3/7] mm: rename and move page_is_poisoned() David Hildenbrand
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 09f77d3c6e15..ed6fbb3bd50c 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 				goto out;
 			}
 			m = NULL;	/* skip the list anchor */
-		} else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
-			if (clear_user(buffer, tsz)) {
-				ret = -EFAULT;
-				goto out;
-			}
-		} else if (m->type == KCORE_VMALLOC) {
+			goto skip;
+		}
+
+		switch (m->type) {
+		case KCORE_VMALLOC:
 			vread(buf, (char *)start, tsz);
 			/* we have to zero-fill user buffer even if no read */
 			if (copy_to_user(buffer, buf, tsz)) {
 				ret = -EFAULT;
 				goto out;
 			}
-		} else if (m->type == KCORE_USER) {
+			break;
+		case KCORE_USER:
 			/* User page is handled prior to normal kernel page: */
 			if (copy_to_user(buffer, (char *)start, tsz)) {
 				ret = -EFAULT;
 				goto out;
 			}
-		} else {
+			break;
+		case KCORE_RAM:
+			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+				if (clear_user(buffer, tsz)) {
+					ret = -EFAULT;
+					goto out;
+				}
+				break;
+			}
+			fallthrough;
+		case KCORE_VMEMMAP:
+		case KCORE_TEXT:
 			if (kern_addr_valid(start)) {
 				/*
 				 * Using bounce buffer to bypass the
@@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 					goto out;
 				}
 			}
+			break;
+		default:
+			pr_warn_once("Unhandled KCORE type: %d\n", m->type);
+			if (clear_user(buffer, tsz)) {
+				ret = -EFAULT;
+				goto out;
+			}
 		}
+skip:
 		buflen -= tsz;
 		*fpos += tsz;
 		buffer += tsz;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
  2021-04-29 12:25 ` [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
  2021-04-29 12:25 ` [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM David Hildenbrand
@ 2021-04-29 12:25 ` David Hildenbrand
  2021-05-02  6:32   ` Mike Rapoport
  2021-05-05 13:13   ` Michal Hocko
  2021-04-29 12:25 ` [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
introduced page_is_poisoned(), however, v5 [1] of the patch used
"page_is_hwpoison()" and something went wrong while upstreaming. Rename the
function and move it to page-flags.h, from where it can be used in other
-- kcore -- context.

Move the comment to the place where it belongs and simplify.

[1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h |  7 +++++++
 mm/gup.c                   |  6 +++++-
 mm/internal.h              | 20 --------------------
 3 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 04a34c08e0a6..b8c56672a588 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -694,6 +694,13 @@ PAGEFLAG_FALSE(DoubleMap)
 	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
+static inline bool is_page_hwpoison(struct page *page)
+{
+	if (PageHWPoison(page))
+		return true;
+	return PageHuge(page) && PageHWPoison(compound_head(page));
+}
+
 /*
  * For pages that are never mapped to userspace (and aren't PageSlab),
  * page_type may be used.  Because it is initialised to -1, we invert the
diff --git a/mm/gup.c b/mm/gup.c
index ef7d2da9f03f..000f3303e7f2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,7 +1536,11 @@ struct page *get_dump_page(unsigned long addr)
 	if (locked)
 		mmap_read_unlock(mm);
 
-	if (ret == 1 && is_page_poisoned(page))
+	/*
+	 * We might have hwpoisoned pages still mapped into user space. Don't
+	 * read these pages when creating a coredump, access could be fatal.
+	 */
+	if (ret == 1 && is_page_hwpoison(page))
 		return NULL;
 
 	return (ret == 1) ? page : NULL;
diff --git a/mm/internal.h b/mm/internal.h
index cb3c5e0a7799..1432feec62df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,26 +97,6 @@ static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
-/*
- * When kernel touch the user page, the user page may be have been marked
- * poison but still mapped in user space, if without this page, the kernel
- * can guarantee the data integrity and operation success, the kernel is
- * better to check the posion status and avoid touching it, be good not to
- * panic, coredump for process fatal signal is a sample case matching this
- * scenario. Or if kernel can't guarantee the data integrity, it's better
- * not to call this function, let kernel touch the poison page and get to
- * panic.
- */
-static inline bool is_page_poisoned(struct page *page)
-{
-	if (PageHWPoison(page))
-		return true;
-	else if (PageHuge(page) && PageHWPoison(compound_head(page)))
-		return true;
-
-	return false;
-}
-
 extern unsigned long highest_memmap_pfn;
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
  2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-04-29 12:25 ` [PATCH v1 3/7] mm: rename and move page_is_poisoned() David Hildenbrand
@ 2021-04-29 12:25 ` David Hildenbrand
  2021-05-02  6:32   ` Mike Rapoport
  2021-04-29 12:25 ` [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline() David Hildenbrand
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Let's avoid reading:

1) Offline memory sections: the content of offline memory sections is stale
   as the memory is effectively unused by the kernel. On s390x with standby
   memory, offline memory sections (belonging to offline storage
   increments) are not accessible. With virtio-mem and the hyper-v balloon,
   we can have unavailable memory chunks that should not be accessed inside
   offline memory sections. Last but not least, offline memory sections
   might contain hwpoisoned pages which we can no longer identify
   because the memmap is stale.

2) PG_offline pages: logically offline pages that are documented as
   "The content of these pages is effectively stale. Such pages should not
    be touched (read/write/dump/save) except by their owner.".
   Examples include pages inflated in a balloon or unavailble memory
   ranges inside hotplugged memory sections with virtio-mem or the hyper-v
   balloon.

3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
   As documented: "Accessing is not safe since it may cause another machine
   check. Don't touch!"

Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.

Note 1: we can race against memory offlining code, especially
memory going offline and getting unplugged: however, we will properly tear
down the identity mapping and handle faults gracefully when accessing
this memory from kcore code.

Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor. We'll handle this in a follow-up
patch.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ed6fbb3bd50c..92ff1e4436cb 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 
 	m = NULL;
 	while (buflen) {
+		struct page *page;
+		unsigned long pfn;
+
 		/*
 		 * If this is the first iteration or the address is not within
 		 * the previous entry, search for a matching entry.
@@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			}
 			break;
 		case KCORE_RAM:
-			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+			pfn = __pa(start) >> PAGE_SHIFT;
+			page = pfn_to_online_page(pfn);
+
+			/*
+			 * Don't read offline sections, logically offline pages
+			 * (e.g., inflated in a balloon), hwpoisoned pages,
+			 * and explicitly excluded physical ranges.
+			 */
+			if (!page || PageOffline(page) ||
+			    is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
 				if (clear_user(buffer, tsz)) {
 					ret = -EFAULT;
 					goto out;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()
  2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-04-29 12:25 ` [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
@ 2021-04-29 12:25 ` David Hildenbrand
  2021-05-02  6:33   ` Mike Rapoport
  2021-05-05 13:24   ` Michal Hocko
  2021-04-29 12:25 ` [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
  2021-04-29 12:25 ` [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze) David Hildenbrand
  6 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

A driver might set a page logically offline -- PageOffline() -- and
turn the page inaccessible in the hypervisor; after that, access to page
content can be fatal. One example is virtio-mem; while unplugged memory
-- marked as PageOffline() can currently be read in the hypervisor, this
will no longer be the case in the future; for example, when having
a virtio-mem device backed by huge pages in the hypervisor.

Some special PFN walkers -- i.e., /proc/kcore -- read content of random
pages after checking PageOffline(); however, these PFN walkers can race
with drivers that set PageOffline().

Let's introduce page_offline_(begin|end|freeze|unfreeze) for
synchronizing.

page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
synchronize with such drivers, achieving that a page cannot be set
PageOffline() while frozen.

page_offline_begin()/page_offline_end() is used by drivers that care about
such races when setting a page PageOffline().

For simplicity, use a rwsem for now; neither drivers nor users are
performance sensitive.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h |  5 +++++
 mm/util.c                  | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b8c56672a588..e3d00c72f459 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -767,6 +767,11 @@ PAGE_TYPE_OPS(Buddy, buddy)
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+extern void page_offline_freeze(void);
+extern void page_offline_unfreeze(void);
+extern void page_offline_begin(void);
+extern void page_offline_end(void);
+
 /*
  * Marks pages in use as page tables.
  */
diff --git a/mm/util.c b/mm/util.c
index 54870226cea6..95395d4e4209 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1013,3 +1013,41 @@ void mem_dump_obj(void *object)
 	}
 	pr_cont(" non-slab/vmalloc memory.\n");
 }
+
+/*
+ * A driver might set a page logically offline -- PageOffline() -- and
+ * turn the page inaccessible in the hypervisor; after that, access to page
+ * content can be fatal.
+ *
+ * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
+ * pages after checking PageOffline(); however, these PFN walkers can race
+ * with drivers that set PageOffline().
+ *
+ * page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
+ * synchronize with such drivers, achieving that a page cannot be set
+ * PageOffline() while frozen.
+ *
+ * page_offline_begin()/page_offline_end() is used by drivers that care about
+ * such races when setting a page PageOffline().
+ */
+static DECLARE_RWSEM(page_offline_rwsem);
+
+void page_offline_freeze(void)
+{
+	down_read(&page_offline_rwsem);
+}
+
+void page_offline_unfreeze(void)
+{
+	up_read(&page_offline_rwsem);
+}
+
+void page_offline_begin(void)
+{
+	down_write(&page_offline_rwsem);
+}
+
+void page_offline_end(void)
+{
+	up_write(&page_offline_rwsem);
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-04-29 12:25 ` [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline() David Hildenbrand
@ 2021-04-29 12:25 ` David Hildenbrand
  2021-05-02  6:33   ` Mike Rapoport
  2021-05-03  8:23   ` Michael S. Tsirkin
  2021-04-29 12:25 ` [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze) David Hildenbrand
  6 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Let's properly use page_offline_(start|end) to synchronize setting
PageOffline(), so we won't have valid page access to unplugged memory
regions from /proc/kcore.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/virtio/virtio_mem.c | 2 ++
 mm/util.c                   | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 10ec60d81e84..dc2a2e2b2ff8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
 static void virtio_mem_set_fake_offline(unsigned long pfn,
 					unsigned long nr_pages, bool onlined)
 {
+	page_offline_begin();
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
@@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
 			ClearPageReserved(page);
 		}
 	}
+	page_offline_end();
 }
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 95395d4e4209..d0e357bd65e6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1046,8 +1046,10 @@ void page_offline_begin(void)
 {
 	down_write(&page_offline_rwsem);
 }
+EXPORT_SYMBOL(page_offline_begin);
 
 void page_offline_end(void)
 {
 	up_write(&page_offline_rwsem);
 }
+EXPORT_SYMBOL(page_offline_end);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
  2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-04-29 12:25 ` [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
@ 2021-04-29 12:25 ` David Hildenbrand
  2021-05-02  6:34   ` Mike Rapoport
  6 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-04-29 12:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Let's properly synchronize with drivers that set PageOffline(). Unfreeze
every now and then, so drivers that want to set PageOffline() can make
progress.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 92ff1e4436cb..3d7531f47389 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
 static ssize_t
 read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 {
+	size_t page_offline_frozen = 0;
 	char *buf = file->private_data;
 	size_t phdrs_offset, notes_offset, data_offset;
 	size_t phdrs_len, notes_len;
@@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			pfn = __pa(start) >> PAGE_SHIFT;
 			page = pfn_to_online_page(pfn);
 
+			/*
+			 * Don't race against drivers that set PageOffline()
+			 * and expect no further page access.
+			 */
+			if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
+				page_offline_unfreeze();
+				page_offline_frozen = 0;
+				cond_resched();
+			}
+			if (!page_offline_frozen++)
+				page_offline_freeze();
+
 			/*
 			 * Don't read offline sections, logically offline pages
 			 * (e.g., inflated in a balloon), hwpoisoned pages,
@@ -565,6 +578,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 	}
 
 out:
+	if (page_offline_frozen)
+		page_offline_unfreeze();
 	up_read(&kclist_lock);
 	if (ret)
 		return ret;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
  2021-04-29 12:25 ` [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM David Hildenbrand
@ 2021-05-02  6:31   ` Mike Rapoport
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2021-05-02  6:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:14PM +0200, David Hildenbrand wrote:
> Let's resturcture the code, using switch-case, and checking pfn_is_ram()
> only when we are dealing with KCORE_RAM.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  fs/proc/kcore.c | 35 +++++++++++++++++++++++++++--------
>  1 file changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 09f77d3c6e15..ed6fbb3bd50c 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  				goto out;
>  			}
>  			m = NULL;	/* skip the list anchor */
> -		} else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
> -			if (clear_user(buffer, tsz)) {
> -				ret = -EFAULT;
> -				goto out;
> -			}
> -		} else if (m->type == KCORE_VMALLOC) {
> +			goto skip;
> +		}
> +
> +		switch (m->type) {
> +		case KCORE_VMALLOC:
>  			vread(buf, (char *)start, tsz);
>  			/* we have to zero-fill user buffer even if no read */
>  			if (copy_to_user(buffer, buf, tsz)) {
>  				ret = -EFAULT;
>  				goto out;
>  			}
> -		} else if (m->type == KCORE_USER) {
> +			break;
> +		case KCORE_USER:
>  			/* User page is handled prior to normal kernel page: */
>  			if (copy_to_user(buffer, (char *)start, tsz)) {
>  				ret = -EFAULT;
>  				goto out;
>  			}
> -		} else {
> +			break;
> +		case KCORE_RAM:
> +			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
> +				if (clear_user(buffer, tsz)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}
> +				break;
> +			}
> +			fallthrough;
> +		case KCORE_VMEMMAP:
> +		case KCORE_TEXT:
>  			if (kern_addr_valid(start)) {
>  				/*
>  				 * Using bounce buffer to bypass the
> @@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  					goto out;
>  				}
>  			}
> +			break;
> +		default:
> +			pr_warn_once("Unhandled KCORE type: %d\n", m->type);
> +			if (clear_user(buffer, tsz)) {
> +				ret = -EFAULT;
> +				goto out;
> +			}
>  		}
> +skip:
>  		buflen -= tsz;
>  		*fpos += tsz;
>  		buffer += tsz;
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
  2021-04-29 12:25 ` [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
@ 2021-05-02  6:31   ` Mike Rapoport
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2021-05-02  6:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:13PM +0200, David Hildenbrand wrote:
> Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
> removed the last user of KCORE_REMAP.
> 
> Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
> dumping vsyscall user page") removed the last user of KCORE_OTHER.
> 
> Let's drop both types. While at it, also drop vaddr in "struct
> kcore_list", used by KCORE_REMAP only.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  fs/proc/kcore.c       | 7 ++-----
>  include/linux/kcore.h | 3 ---
>  2 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 4d2e64e9016c..09f77d3c6e15 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  			phdr->p_type = PT_LOAD;
>  			phdr->p_flags = PF_R | PF_W | PF_X;
>  			phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
> -			if (m->type == KCORE_REMAP)
> -				phdr->p_vaddr = (size_t)m->vaddr;
> -			else
> -				phdr->p_vaddr = (size_t)m->addr;
> -			if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
> +			phdr->p_vaddr = (size_t)m->addr;
> +			if (m->type == KCORE_RAM)
>  				phdr->p_paddr = __pa(m->addr);
>  			else if (m->type == KCORE_TEXT)
>  				phdr->p_paddr = __pa_symbol(m->addr);
> diff --git a/include/linux/kcore.h b/include/linux/kcore.h
> index da676cdbd727..86c0f1d18998 100644
> --- a/include/linux/kcore.h
> +++ b/include/linux/kcore.h
> @@ -11,14 +11,11 @@ enum kcore_type {
>  	KCORE_RAM,
>  	KCORE_VMEMMAP,
>  	KCORE_USER,
> -	KCORE_OTHER,
> -	KCORE_REMAP,
>  };
>  
>  struct kcore_list {
>  	struct list_head list;
>  	unsigned long addr;
> -	unsigned long vaddr;
>  	size_t size;
>  	int type;
>  };
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-04-29 12:25 ` [PATCH v1 3/7] mm: rename and move page_is_poisoned() David Hildenbrand
@ 2021-05-02  6:32   ` Mike Rapoport
  2021-05-05 13:13   ` Michal Hocko
  1 sibling, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2021-05-02  6:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:15PM +0200, David Hildenbrand wrote:
> Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> introduced page_is_poisoned(), however, v5 [1] of the patch used
> "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> function and move it to page-flags.h, from where it can be used in other
> -- kcore -- context.
> 
> Move the comment to the place where it belongs and simplify.
> 
> [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  include/linux/page-flags.h |  7 +++++++
>  mm/gup.c                   |  6 +++++-
>  mm/internal.h              | 20 --------------------
>  3 files changed, 12 insertions(+), 21 deletions(-)

Nice :)

> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 04a34c08e0a6..b8c56672a588 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -694,6 +694,13 @@ PAGEFLAG_FALSE(DoubleMap)
>  	TESTSCFLAG_FALSE(DoubleMap)
>  #endif
>  
> +static inline bool is_page_hwpoison(struct page *page)
> +{
> +	if (PageHWPoison(page))
> +		return true;
> +	return PageHuge(page) && PageHWPoison(compound_head(page));
> +}
> +
>  /*
>   * For pages that are never mapped to userspace (and aren't PageSlab),
>   * page_type may be used.  Because it is initialised to -1, we invert the
> diff --git a/mm/gup.c b/mm/gup.c
> index ef7d2da9f03f..000f3303e7f2 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1536,7 +1536,11 @@ struct page *get_dump_page(unsigned long addr)
>  	if (locked)
>  		mmap_read_unlock(mm);
>  
> -	if (ret == 1 && is_page_poisoned(page))
> +	/*
> +	 * We might have hwpoisoned pages still mapped into user space. Don't
> +	 * read these pages when creating a coredump, access could be fatal.
> +	 */
> +	if (ret == 1 && is_page_hwpoison(page))
>  		return NULL;
>  
>  	return (ret == 1) ? page : NULL;
> diff --git a/mm/internal.h b/mm/internal.h
> index cb3c5e0a7799..1432feec62df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -97,26 +97,6 @@ static inline void set_page_refcounted(struct page *page)
>  	set_page_count(page, 1);
>  }
>  
> -/*
> - * When kernel touch the user page, the user page may be have been marked
> - * poison but still mapped in user space, if without this page, the kernel
> - * can guarantee the data integrity and operation success, the kernel is
> - * better to check the posion status and avoid touching it, be good not to
> - * panic, coredump for process fatal signal is a sample case matching this
> - * scenario. Or if kernel can't guarantee the data integrity, it's better
> - * not to call this function, let kernel touch the poison page and get to
> - * panic.
> - */
> -static inline bool is_page_poisoned(struct page *page)
> -{
> -	if (PageHWPoison(page))
> -		return true;
> -	else if (PageHuge(page) && PageHWPoison(compound_head(page)))
> -		return true;
> -
> -	return false;
> -}
> -
>  extern unsigned long highest_memmap_pfn;
>  
>  /*
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
  2021-04-29 12:25 ` [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
@ 2021-05-02  6:32   ` Mike Rapoport
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2021-05-02  6:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:16PM +0200, David Hildenbrand wrote:
> Let's avoid reading:
> 
> 1) Offline memory sections: the content of offline memory sections is stale
>    as the memory is effectively unused by the kernel. On s390x with standby
>    memory, offline memory sections (belonging to offline storage
>    increments) are not accessible. With virtio-mem and the hyper-v balloon,
>    we can have unavailable memory chunks that should not be accessed inside
>    offline memory sections. Last but not least, offline memory sections
>    might contain hwpoisoned pages which we can no longer identify
>    because the memmap is stale.
> 
> 2) PG_offline pages: logically offline pages that are documented as
>    "The content of these pages is effectively stale. Such pages should not
>     be touched (read/write/dump/save) except by their owner.".
>    Examples include pages inflated in a balloon or unavailble memory
>    ranges inside hotplugged memory sections with virtio-mem or the hyper-v
>    balloon.
> 
> 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
>    As documented: "Accessing is not safe since it may cause another machine
>    check. Don't touch!"
> 
> Reading /proc/kcore now performs similar checks as when reading
> /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
> It's also similar to hibernation code, however, we don't skip hwpoisoned
> pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.
> 
> Note 1: we can race against memory offlining code, especially
> memory going offline and getting unplugged: however, we will properly tear
> down the identity mapping and handle faults gracefully when accessing
> this memory from kcore code.
> 
> Note 2: we can race against drivers setting PageOffline() and turning
> memory inaccessible in the hypervisor. We'll handle this in a follow-up
> patch.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  fs/proc/kcore.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index ed6fbb3bd50c..92ff1e4436cb 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  
>  	m = NULL;
>  	while (buflen) {
> +		struct page *page;
> +		unsigned long pfn;
> +
>  		/*
>  		 * If this is the first iteration or the address is not within
>  		 * the previous entry, search for a matching entry.
> @@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  			}
>  			break;
>  		case KCORE_RAM:
> -			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
> +			pfn = __pa(start) >> PAGE_SHIFT;
> +			page = pfn_to_online_page(pfn);
> +
> +			/*
> +			 * Don't read offline sections, logically offline pages
> +			 * (e.g., inflated in a balloon), hwpoisoned pages,
> +			 * and explicitly excluded physical ranges.
> +			 */
> +			if (!page || PageOffline(page) ||
> +			    is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
>  				if (clear_user(buffer, tsz)) {
>  					ret = -EFAULT;
>  					goto out;
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()
  2021-04-29 12:25 ` [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline() David Hildenbrand
@ 2021-05-02  6:33   ` Mike Rapoport
  2021-05-03  8:11     ` David Hildenbrand
  2021-05-05 13:24   ` Michal Hocko
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2021-05-02  6:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:17PM +0200, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
> 
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
> 
> Let's introduce page_offline_(begin|end|freeze|unfreeze) for

Bikeshed: freeze|thaw?

> synchronizing.
> 
> page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
> 
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
> 
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/page-flags.h |  5 +++++
>  mm/util.c                  | 38 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b8c56672a588..e3d00c72f459 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -767,6 +767,11 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +extern void page_offline_freeze(void);
> +extern void page_offline_unfreeze(void);
> +extern void page_offline_begin(void);
> +extern void page_offline_end(void);
> +
>  /*
>   * Marks pages in use as page tables.
>   */
> diff --git a/mm/util.c b/mm/util.c
> index 54870226cea6..95395d4e4209 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1013,3 +1013,41 @@ void mem_dump_obj(void *object)
>  	}
>  	pr_cont(" non-slab/vmalloc memory.\n");
>  }
> +
> +/*
> + * A driver might set a page logically offline -- PageOffline() -- and
> + * turn the page inaccessible in the hypervisor; after that, access to page
> + * content can be fatal.
> + *
> + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> + * pages after checking PageOffline(); however, these PFN walkers can race
> + * with drivers that set PageOffline().
> + *
> + * page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> + * synchronize with such drivers, achieving that a page cannot be set
> + * PageOffline() while frozen.
> + *
> + * page_offline_begin()/page_offline_end() is used by drivers that care about
> + * such races when setting a page PageOffline().
> + */
> +static DECLARE_RWSEM(page_offline_rwsem);
> +
> +void page_offline_freeze(void)
> +{
> +	down_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_unfreeze(void)
> +{
> +	up_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_begin(void)
> +{
> +	down_write(&page_offline_rwsem);
> +}
> +
> +void page_offline_end(void)
> +{
> +	up_write(&page_offline_rwsem);
> +}
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-04-29 12:25 ` [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
@ 2021-05-02  6:33   ` Mike Rapoport
  2021-05-03  8:16     ` David Hildenbrand
  2021-05-03  8:23   ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2021-05-02  6:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:18PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/virtio/virtio_mem.c | 2 ++
>  mm/util.c                   | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 10ec60d81e84..dc2a2e2b2ff8 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
>  static void virtio_mem_set_fake_offline(unsigned long pfn,
>  					unsigned long nr_pages, bool onlined)
>  {
> +	page_offline_begin();
>  	for (; nr_pages--; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>  
> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>  			ClearPageReserved(page);
>  		}
>  	}
> +	page_offline_end();

I'm not really familiar with ballooning and memory hotplug, but is it the
only place that needs page_offline_{begin,end} ?

>  }
>  
>  /*
> diff --git a/mm/util.c b/mm/util.c
> index 95395d4e4209..d0e357bd65e6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1046,8 +1046,10 @@ void page_offline_begin(void)
>  {
>  	down_write(&page_offline_rwsem);
>  }
> +EXPORT_SYMBOL(page_offline_begin);

Should have been a part of the previous patch.
  
>  void page_offline_end(void)
>  {
>  	up_write(&page_offline_rwsem);
>  }
> +EXPORT_SYMBOL(page_offline_end);

Ditto

> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
  2021-04-29 12:25 ` [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze) David Hildenbrand
@ 2021-05-02  6:34   ` Mike Rapoport
  2021-05-03  8:28     ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2021-05-02  6:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
> every now and then, so drivers that want to set PageOffline() can make
> progress.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  fs/proc/kcore.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 92ff1e4436cb..3d7531f47389 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
>  static ssize_t
>  read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  {
> +	size_t page_offline_frozen = 0;
>  	char *buf = file->private_data;
>  	size_t phdrs_offset, notes_offset, data_offset;
>  	size_t phdrs_len, notes_len;
> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  			pfn = __pa(start) >> PAGE_SHIFT;
>  			page = pfn_to_online_page(pfn);

Can't this race with page offlining for the first time we get here?
 
> +			/*
> +			 * Don't race against drivers that set PageOffline()
> +			 * and expect no further page access.
> +			 */
> +			if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
> +				page_offline_unfreeze();
> +				page_offline_frozen = 0;
> +				cond_resched();
> +			}
> +			if (!page_offline_frozen++)
> +				page_offline_freeze();
> +

Don't we need to freeze before doing pfn_to_online_page()?

>  			/*
>  			 * Don't read offline sections, logically offline pages
>  			 * (e.g., inflated in a balloon), hwpoisoned pages,
> @@ -565,6 +578,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  	}
>  
>  out:
> +	if (page_offline_frozen)
> +		page_offline_unfreeze();
>  	up_read(&kclist_lock);
>  	if (ret)
>  		return ret;
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()
  2021-05-02  6:33   ` Mike Rapoport
@ 2021-05-03  8:11     ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-05-03  8:11 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On 02.05.21 08:33, Mike Rapoport wrote:
> On Thu, Apr 29, 2021 at 02:25:17PM +0200, David Hildenbrand wrote:
>> A driver might set a page logically offline -- PageOffline() -- and
>> turn the page inaccessible in the hypervisor; after that, access to page
>> content can be fatal. One example is virtio-mem; while unplugged memory
>> -- marked as PageOffline() can currently be read in the hypervisor, this
>> will no longer be the case in the future; for example, when having
>> a virtio-mem device backed by huge pages in the hypervisor.
>>
>> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
>> pages after checking PageOffline(); however, these PFN walkers can race
>> with drivers that set PageOffline().
>>
>> Let's introduce page_offline_(begin|end|freeze|unfreeze) for
> 
> Bikeshed: freeze|thaw?
>

Sure :)


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-05-02  6:33   ` Mike Rapoport
@ 2021-05-03  8:16     ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-05-03  8:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On 02.05.21 08:33, Mike Rapoport wrote:
> On Thu, Apr 29, 2021 at 02:25:18PM +0200, David Hildenbrand wrote:
>> Let's properly use page_offline_(start|end) to synchronize setting
>> PageOffline(), so we won't have valid page access to unplugged memory
>> regions from /proc/kcore.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   drivers/virtio/virtio_mem.c | 2 ++
>>   mm/util.c                   | 2 ++
>>   2 files changed, 4 insertions(+)
>>
>> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
>> index 10ec60d81e84..dc2a2e2b2ff8 100644
>> --- a/drivers/virtio/virtio_mem.c
>> +++ b/drivers/virtio/virtio_mem.c
>> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
>>   static void virtio_mem_set_fake_offline(unsigned long pfn,
>>   					unsigned long nr_pages, bool onlined)
>>   {
>> +	page_offline_begin();
>>   	for (; nr_pages--; pfn++) {
>>   		struct page *page = pfn_to_page(pfn);
>>   
>> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>>   			ClearPageReserved(page);
>>   		}
>>   	}
>> +	page_offline_end();
> 
> I'm not really familiar with ballooning and memory hotplug, but is it the
> only place that needs page_offline_{begin,end} ?

Existing balloon implementations that I am aware of (Hyper-V, XEN, 
virtio-balloon, vmware-balloon) usually allow reading inflated memory; 
doing so might result in unnecessary overhead in the hypervisor, so we 
really want to avoid it -- but it's strictly not forbidden and has been 
working forever. So we barely care about races: if there would be a rare 
race, we'd still be able to read that memory.

For virtio-mem, it'll be different in the future when using shmem, huge 
pages, !anonymous private mappings, ... as backing storage for a VM; 
there will be a virtio spec extension to document that virtio-mem 
changes that indicate the new behavior won't allow reading unplugged 
memory and doing so will result in undefined behavior.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-04-29 12:25 ` [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
  2021-05-02  6:33   ` Mike Rapoport
@ 2021-05-03  8:23   ` Michael S. Tsirkin
  1 sibling, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2021-05-03  8:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Jason Wang, Alexey Dobriyan,
	Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Thu, Apr 29, 2021 at 02:25:18PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>


the patch looks good to me as such

Acked-by: Michael S. Tsirkin <mst@redhat.com>

Feel free to merge with rest of patcgset - it seems to mostly
live in the fs/mm space.

IF you respin, maybe add the explanation you sent in response to Mike's comments
in the commit log.


> ---
>  drivers/virtio/virtio_mem.c | 2 ++
>  mm/util.c                   | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 10ec60d81e84..dc2a2e2b2ff8 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
>  static void virtio_mem_set_fake_offline(unsigned long pfn,
>  					unsigned long nr_pages, bool onlined)
>  {
> +	page_offline_begin();
>  	for (; nr_pages--; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>  
> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>  			ClearPageReserved(page);
>  		}
>  	}
> +	page_offline_end();
>  }
>  
>  /*
> diff --git a/mm/util.c b/mm/util.c
> index 95395d4e4209..d0e357bd65e6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1046,8 +1046,10 @@ void page_offline_begin(void)
>  {
>  	down_write(&page_offline_rwsem);
>  }
> +EXPORT_SYMBOL(page_offline_begin);
>  
>  void page_offline_end(void)
>  {
>  	up_write(&page_offline_rwsem);
>  }
> +EXPORT_SYMBOL(page_offline_end);
> -- 
> 2.30.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
  2021-05-02  6:34   ` Mike Rapoport
@ 2021-05-03  8:28     ` David Hildenbrand
  2021-05-03  9:28       ` Mike Rapoport
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-05-03  8:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On 02.05.21 08:34, Mike Rapoport wrote:
> On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
>> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
>> every now and then, so drivers that want to set PageOffline() can make
>> progress.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   fs/proc/kcore.c | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>> index 92ff1e4436cb..3d7531f47389 100644
>> --- a/fs/proc/kcore.c
>> +++ b/fs/proc/kcore.c
>> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
>>   static ssize_t
>>   read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>   {
>> +	size_t page_offline_frozen = 0;
>>   	char *buf = file->private_data;
>>   	size_t phdrs_offset, notes_offset, data_offset;
>>   	size_t phdrs_len, notes_len;
>> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>   			pfn = __pa(start) >> PAGE_SHIFT;
>>   			page = pfn_to_online_page(pfn);
> 
> Can't this race with page offlining for the first time we get here?


To clarify, we have three types of offline pages in the kernel ...

a) Pages part of an offline memory section; the memap is stale and not 
trustworthy. pfn_to_online_page() checks that. We *can* protect against 
memory offlining using get_online_mems()/put_online_mems(), but usually 
avoid doing so as the race window is very small (and a problem all over 
the kernel we basically never hit) and locking is rather expensive. In 
the future, we might switch to rcu to handle that more efficiently and 
avoiding these possible races.

b) PageOffline(): logically offline pages contained in an online memory 
section with a sane memmap. virtio-mem calls these pages "fake offline"; 
something like a "temporary" memory hole. The new mechanism I propose 
will be used to handle synchronization as races can be more severe, 
e.g., when reading actual page content here.

c) Soft offline pages: hwpoisoned pages that are not actually harmful 
yet, but could become harmful in the future. So we better try to remove 
the page from the page allcoator and try to migrate away existing users.


So page_offline_* handle "b) PageOffline()" only. There is a tiny race 
between pfn_to_online_page(pfn) and looking at the memmap as we have in 
many cases already throughout the kernel, to be tackled in the future.


(A better name for PageOffline() might make sense; PageSoftOffline() 
would be catchy but interferes with c). PageLogicallyOffline() is ugly; 
PageFakeOffline() might do)

>   
>> +			/*
>> +			 * Don't race against drivers that set PageOffline()
>> +			 * and expect no further page access.
>> +			 */
>> +			if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
>> +				page_offline_unfreeze();
>> +				page_offline_frozen = 0;
>> +				cond_resched();
>> +			}
>> +			if (!page_offline_frozen++)
>> +				page_offline_freeze();
>> +
> 
> Don't we need to freeze before doing pfn_to_online_page()?

See my explanation above. Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
  2021-05-03  8:28     ` David Hildenbrand
@ 2021-05-03  9:28       ` Mike Rapoport
  2021-05-03 10:13         ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2021-05-03  9:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
> On 02.05.21 08:34, Mike Rapoport wrote:
> > On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
> > > Let's properly synchronize with drivers that set PageOffline(). Unfreeze
> > > every now and then, so drivers that want to set PageOffline() can make
> > > progress.
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   fs/proc/kcore.c | 15 +++++++++++++++
> > >   1 file changed, 15 insertions(+)
> > > 
> > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > > index 92ff1e4436cb..3d7531f47389 100644
> > > --- a/fs/proc/kcore.c
> > > +++ b/fs/proc/kcore.c
> > > @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
> > >   static ssize_t
> > >   read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > >   {
> > > +	size_t page_offline_frozen = 0;
> > >   	char *buf = file->private_data;
> > >   	size_t phdrs_offset, notes_offset, data_offset;
> > >   	size_t phdrs_len, notes_len;
> > > @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > >   			pfn = __pa(start) >> PAGE_SHIFT;
> > >   			page = pfn_to_online_page(pfn);
> > 
> > Can't this race with page offlining for the first time we get here?
> 
> 
> To clarify, we have three types of offline pages in the kernel ...
> 
> a) Pages part of an offline memory section; the memap is stale and not
> trustworthy. pfn_to_online_page() checks that. We *can* protect against
> memory offlining using get_online_mems()/put_online_mems(), but usually
> avoid doing so as the race window is very small (and a problem all over the
> kernel we basically never hit) and locking is rather expensive. In the
> future, we might switch to rcu to handle that more efficiently and avoiding
> these possible races.
> 
> b) PageOffline(): logically offline pages contained in an online memory
> section with a sane memmap. virtio-mem calls these pages "fake offline";
> something like a "temporary" memory hole. The new mechanism I propose will
> be used to handle synchronization as races can be more severe, e.g., when
> reading actual page content here.
> 
> c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
> but could become harmful in the future. So we better try to remove the page
> from the page allcoator and try to migrate away existing users.
> 
> 
> So page_offline_* handle "b) PageOffline()" only. There is a tiny race
> between pfn_to_online_page(pfn) and looking at the memmap as we have in many
> cases already throughout the kernel, to be tackled in the future.

Right, but here you anyway add locking, so why exclude the first iteration?
 
> (A better name for PageOffline() might make sense; PageSoftOffline() would
> be catchy but interferes with c). PageLogicallyOffline() is ugly;
> PageFakeOffline() might do)
> 
> > > +			/*
> > > +			 * Don't race against drivers that set PageOffline()
> > > +			 * and expect no further page access.
> > > +			 */
> > > +			if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
> > > +				page_offline_unfreeze();
> > > +				page_offline_frozen = 0;
> > > +				cond_resched();
> > > +			}
> > > +			if (!page_offline_frozen++)
> > > +				page_offline_freeze();
> > > +

BTW, did you consider something like

	if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
		page_offline_unfreeze();
		cond_resched();
		page_offline_freeze();
	}

We don't seem to care about page_offline_frozen overflows here, do we?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
  2021-05-03  9:28       ` Mike Rapoport
@ 2021-05-03 10:13         ` David Hildenbrand
  2021-05-03 11:33           ` Mike Rapoport
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-05-03 10:13 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On 03.05.21 11:28, Mike Rapoport wrote:
> On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
>> On 02.05.21 08:34, Mike Rapoport wrote:
>>> On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
>>>> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
>>>> every now and then, so drivers that want to set PageOffline() can make
>>>> progress.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    fs/proc/kcore.c | 15 +++++++++++++++
>>>>    1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>>>> index 92ff1e4436cb..3d7531f47389 100644
>>>> --- a/fs/proc/kcore.c
>>>> +++ b/fs/proc/kcore.c
>>>> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
>>>>    static ssize_t
>>>>    read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>>    {
>>>> +	size_t page_offline_frozen = 0;
>>>>    	char *buf = file->private_data;
>>>>    	size_t phdrs_offset, notes_offset, data_offset;
>>>>    	size_t phdrs_len, notes_len;
>>>> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>>    			pfn = __pa(start) >> PAGE_SHIFT;
>>>>    			page = pfn_to_online_page(pfn);
>>>
>>> Can't this race with page offlining for the first time we get here?
>>
>>
>> To clarify, we have three types of offline pages in the kernel ...
>>
>> a) Pages part of an offline memory section; the memap is stale and not
>> trustworthy. pfn_to_online_page() checks that. We *can* protect against
>> memory offlining using get_online_mems()/put_online_mems(), but usually
>> avoid doing so as the race window is very small (and a problem all over the
>> kernel we basically never hit) and locking is rather expensive. In the
>> future, we might switch to rcu to handle that more efficiently and avoiding
>> these possible races.
>>
>> b) PageOffline(): logically offline pages contained in an online memory
>> section with a sane memmap. virtio-mem calls these pages "fake offline";
>> something like a "temporary" memory hole. The new mechanism I propose will
>> be used to handle synchronization as races can be more severe, e.g., when
>> reading actual page content here.
>>
>> c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
>> but could become harmful in the future. So we better try to remove the page
>> from the page allcoator and try to migrate away existing users.
>>
>>
>> So page_offline_* handle "b) PageOffline()" only. There is a tiny race
>> between pfn_to_online_page(pfn) and looking at the memmap as we have in many
>> cases already throughout the kernel, to be tackled in the future.
> 
> Right, but here you anyway add locking, so why exclude the first iteration?

What we're protecting is PageOffline() below. If I didn't mess up, we 
should always be calling page_offline_freeze() before calling 
PageOffline(). Or am I missing something?

> 
> BTW, did you consider something like

Yes, I played with something like that. We'd have to handle the first 
page_offline_freeze() freeze differently, though, and that's where 
things got a bit ugly in my attempts.

> 
> 	if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> 		page_offline_unfreeze();
> 		cond_resched();
> 		page_offline_freeze();
> 	}
> 
> We don't seem to care about page_offline_frozen overflows here, do we?

No, the buffer size is also size_t and gets incremented on a per-byte 
basis. The variant I have right now looked the cleanest to me. Happy to 
hear simpler alternatives.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
  2021-05-03 10:13         ` David Hildenbrand
@ 2021-05-03 11:33           ` Mike Rapoport
  2021-05-03 11:35             ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2021-05-03 11:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote:
> On 03.05.21 11:28, Mike Rapoport wrote:
> > On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
> > > On 02.05.21 08:34, Mike Rapoport wrote:
> > > > On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
> > > > > Let's properly synchronize with drivers that set PageOffline(). Unfreeze
> > > > > every now and then, so drivers that want to set PageOffline() can make
> > > > > progress.
> > > > > 
> > > > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > > > ---
> > > > >    fs/proc/kcore.c | 15 +++++++++++++++
> > > > >    1 file changed, 15 insertions(+)
> > > > > 
> > > > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > > > > index 92ff1e4436cb..3d7531f47389 100644
> > > > > --- a/fs/proc/kcore.c
> > > > > +++ b/fs/proc/kcore.c
> > > > > @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
> > > > >    static ssize_t
> > > > >    read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > > >    {
> > > > > +	size_t page_offline_frozen = 0;
> > > > >    	char *buf = file->private_data;
> > > > >    	size_t phdrs_offset, notes_offset, data_offset;
> > > > >    	size_t phdrs_len, notes_len;
> > > > > @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > > >    			pfn = __pa(start) >> PAGE_SHIFT;
> > > > >    			page = pfn_to_online_page(pfn);
> > > > 
> > > > Can't this race with page offlining for the first time we get here?
> > > 
> > > 
> > > To clarify, we have three types of offline pages in the kernel ...
> > > 
> > > a) Pages part of an offline memory section; the memap is stale and not
> > > trustworthy. pfn_to_online_page() checks that. We *can* protect against
> > > memory offlining using get_online_mems()/put_online_mems(), but usually
> > > avoid doing so as the race window is very small (and a problem all over the
> > > kernel we basically never hit) and locking is rather expensive. In the
> > > future, we might switch to rcu to handle that more efficiently and avoiding
> > > these possible races.
> > > 
> > > b) PageOffline(): logically offline pages contained in an online memory
> > > section with a sane memmap. virtio-mem calls these pages "fake offline";
> > > something like a "temporary" memory hole. The new mechanism I propose will
> > > be used to handle synchronization as races can be more severe, e.g., when
> > > reading actual page content here.
> > > 
> > > c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
> > > but could become harmful in the future. So we better try to remove the page
> > > from the page allcoator and try to migrate away existing users.
> > > 
> > > 
> > > So page_offline_* handle "b) PageOffline()" only. There is a tiny race
> > > between pfn_to_online_page(pfn) and looking at the memmap as we have in many
> > > cases already throughout the kernel, to be tackled in the future.
> > 
> > Right, but here you anyway add locking, so why exclude the first iteration?
> 
> What we're protecting is PageOffline() below. If I didn't mess up, we should
> always be calling page_offline_freeze() before calling PageOffline(). Or am
> I missing something?
 
Somehow I was under impression we are protecting both pfn_to_online_page()
and PageOffline().
 
> > BTW, did you consider something like
> 
> Yes, I played with something like that. We'd have to handle the first
> page_offline_freeze() freeze differently, though, and that's where things
> got a bit ugly in my attempts.
> 
> > 
> > 	if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> > 		page_offline_unfreeze();
> > 		cond_resched();
> > 		page_offline_freeze();
> > 	}
> > 
> > We don't seem to care about page_offline_frozen overflows here, do we?
> 
> No, the buffer size is also size_t and gets incremented on a per-byte basis.
> The variant I have right now looked the cleanest to me. Happy to hear
> simpler alternatives.

Well, locking for the first time before the while() loop and doing
resched-relock outside switch() would be definitely nicer, and it makes the
last unlock unconditional.

The cost of prevention of memory offline during reads of !KCORE_RAM parts
does not seem that significant to me, but I may be missing something.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
  2021-05-03 11:33           ` Mike Rapoport
@ 2021-05-03 11:35             ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-05-03 11:35 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On 03.05.21 13:33, Mike Rapoport wrote:
> On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote:
>> On 03.05.21 11:28, Mike Rapoport wrote:
>>> On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
>>>> On 02.05.21 08:34, Mike Rapoport wrote:
>>>>> On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
>>>>>> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
>>>>>> every now and then, so drivers that want to set PageOffline() can make
>>>>>> progress.
>>>>>>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>     fs/proc/kcore.c | 15 +++++++++++++++
>>>>>>     1 file changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>>>>>> index 92ff1e4436cb..3d7531f47389 100644
>>>>>> --- a/fs/proc/kcore.c
>>>>>> +++ b/fs/proc/kcore.c
>>>>>> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
>>>>>>     static ssize_t
>>>>>>     read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>>>>     {
>>>>>> +	size_t page_offline_frozen = 0;
>>>>>>     	char *buf = file->private_data;
>>>>>>     	size_t phdrs_offset, notes_offset, data_offset;
>>>>>>     	size_t phdrs_len, notes_len;
>>>>>> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>>>>     			pfn = __pa(start) >> PAGE_SHIFT;
>>>>>>     			page = pfn_to_online_page(pfn);
>>>>>
>>>>> Can't this race with page offlining for the first time we get here?
>>>>
>>>>
>>>> To clarify, we have three types of offline pages in the kernel ...
>>>>
>>>> a) Pages part of an offline memory section; the memap is stale and not
>>>> trustworthy. pfn_to_online_page() checks that. We *can* protect against
>>>> memory offlining using get_online_mems()/put_online_mems(), but usually
>>>> avoid doing so as the race window is very small (and a problem all over the
>>>> kernel we basically never hit) and locking is rather expensive. In the
>>>> future, we might switch to rcu to handle that more efficiently and avoiding
>>>> these possible races.
>>>>
>>>> b) PageOffline(): logically offline pages contained in an online memory
>>>> section with a sane memmap. virtio-mem calls these pages "fake offline";
>>>> something like a "temporary" memory hole. The new mechanism I propose will
>>>> be used to handle synchronization as races can be more severe, e.g., when
>>>> reading actual page content here.
>>>>
>>>> c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
>>>> but could become harmful in the future. So we better try to remove the page
>>>> from the page allcoator and try to migrate away existing users.
>>>>
>>>>
>>>> So page_offline_* handle "b) PageOffline()" only. There is a tiny race
>>>> between pfn_to_online_page(pfn) and looking at the memmap as we have in many
>>>> cases already throughout the kernel, to be tackled in the future.
>>>
>>> Right, but here you anyway add locking, so why exclude the first iteration?
>>
>> What we're protecting is PageOffline() below. If I didn't mess up, we should
>> always be calling page_offline_freeze() before calling PageOffline(). Or am
>> I missing something?
>   
> Somehow I was under impression we are protecting both pfn_to_online_page()
> and PageOffline().
>   
>>> BTW, did you consider something like
>>
>> Yes, I played with something like that. We'd have to handle the first
>> page_offline_freeze() freeze differently, though, and that's where things
>> got a bit ugly in my attempts.
>>
>>>
>>> 	if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
>>> 		page_offline_unfreeze();
>>> 		cond_resched();
>>> 		page_offline_freeze();
>>> 	}
>>>
>>> We don't seem to care about page_offline_frozen overflows here, do we?
>>
>> No, the buffer size is also size_t and gets incremented on a per-byte basis.
>> The variant I have right now looked the cleanest to me. Happy to hear
>> simpler alternatives.
> 
> Well, locking for the first time before the while() loop and doing
> resched-relock outside switch() would be definitely nicer, and it makes the
> last unlock unconditional.
> 
> The cost of prevention of memory offline during reads of !KCORE_RAM parts
> does not seem that significant to me, but I may be missing something.

Also true, I'll have a look if I can just simplify that.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-04-29 12:25 ` [PATCH v1 3/7] mm: rename and move page_is_poisoned() David Hildenbrand
  2021-05-02  6:32   ` Mike Rapoport
@ 2021-05-05 13:13   ` Michal Hocko
  2021-05-05 13:17     ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-05-05 13:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
> Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> introduced page_is_poisoned(), however, v5 [1] of the patch used
> "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> function and move it to page-flags.h, from where it can be used in other
> -- kcore -- context.
> 
> Move the comment to the place where it belongs and simplify.
> 
> [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

I do agree that being explicit about hwpoison is much better. Poisoned
page can be also an unitialized one and I believe this is the reason why
you are bringing that up.

But you've made me look at d3378e86d182 and I am wondering whether this
is really a valid patch. First of all it can leak a reference count
AFAICS. Moreover it doesn't really fix anything because the page can be
marked hwpoison right after the check is done. I do not think the race
is feasible to be closed. So shouldn't we rather revert it?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-05 13:13   ` Michal Hocko
@ 2021-05-05 13:17     ` David Hildenbrand
  2021-05-05 13:27       ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-05-05 13:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On 05.05.21 15:13, Michal Hocko wrote:
> On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
>> Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
>> introduced page_is_poisoned(), however, v5 [1] of the patch used
>> "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
>> function and move it to page-flags.h, from where it can be used in other
>> -- kcore -- context.
>>
>> Move the comment to the place where it belongs and simplify.
>>
>> [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I do agree that being explicit about hwpoison is much better. Poisoned
> page can be also an unitialized one and I believe this is the reason why
> you are bringing that up.

I'm bringing it up because I want to reuse that function as state above :)

> 
> But you've made me look at d3378e86d182 and I am wondering whether this
> is really a valid patch. First of all it can leak a reference count
> AFAICS. Moreover it doesn't really fix anything because the page can be
> marked hwpoison right after the check is done. I do not think the race
> is feasible to be closed. So shouldn't we rather revert it?

I am not sure if we really care about races here that much here? I mean, 
essentially we are racing with HW breaking asynchronously. Just because 
we would be synchronizing with SetPageHWPoison() wouldn't mean we can 
stop HW from breaking.

Long story short, this should be good enough for the cases we actually 
can handle? What am I missing?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()
  2021-04-29 12:25 ` [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline() David Hildenbrand
  2021-05-02  6:33   ` Mike Rapoport
@ 2021-05-05 13:24   ` Michal Hocko
  2021-05-05 15:10     ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-05-05 13:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Thu 29-04-21 14:25:17, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
> 
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
> 
> Let's introduce page_offline_(begin|end|freeze|unfreeze) for
> synchronizing.
> 
> page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
> 
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
> 
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.

Please add a note to the PageOffline documentation as well. While are
adding the api close enough an explicit note there wouldn't hurt.

> Signed-off-by: David Hildenbrand <david@redhat.com>

As to the patch itself, I am slightly worried that other pfn walkers
might be less tolerant to the locking than the proc ones. On the other
hand most users shouldn't really care as they do not tend to touch the
memory content and PageOffline check without any synchronization should
be sufficient for those. Let's try this out and see where we get...

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/page-flags.h |  5 +++++
>  mm/util.c                  | 38 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b8c56672a588..e3d00c72f459 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -767,6 +767,11 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +extern void page_offline_freeze(void);
> +extern void page_offline_unfreeze(void);
> +extern void page_offline_begin(void);
> +extern void page_offline_end(void);
> +
>  /*
>   * Marks pages in use as page tables.
>   */
> diff --git a/mm/util.c b/mm/util.c
> index 54870226cea6..95395d4e4209 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1013,3 +1013,41 @@ void mem_dump_obj(void *object)
>  	}
>  	pr_cont(" non-slab/vmalloc memory.\n");
>  }
> +
> +/*
> + * A driver might set a page logically offline -- PageOffline() -- and
> + * turn the page inaccessible in the hypervisor; after that, access to page
> + * content can be fatal.
> + *
> + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> + * pages after checking PageOffline(); however, these PFN walkers can race
> + * with drivers that set PageOffline().
> + *
> + * page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> + * synchronize with such drivers, achieving that a page cannot be set
> + * PageOffline() while frozen.
> + *
> + * page_offline_begin()/page_offline_end() is used by drivers that care about
> + * such races when setting a page PageOffline().
> + */
> +static DECLARE_RWSEM(page_offline_rwsem);
> +
> +void page_offline_freeze(void)
> +{
> +	down_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_unfreeze(void)
> +{
> +	up_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_begin(void)
> +{
> +	down_write(&page_offline_rwsem);
> +}
> +
> +void page_offline_end(void)
> +{
> +	up_write(&page_offline_rwsem);
> +}
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-05 13:17     ` David Hildenbrand
@ 2021-05-05 13:27       ` Michal Hocko
  2021-05-05 13:39         ` David Hildenbrand
  2021-05-06  0:56         ` Aili Yao
  0 siblings, 2 replies; 37+ messages in thread
From: Michal Hocko @ 2021-05-05 13:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Wed 05-05-21 15:17:53, David Hildenbrand wrote:
> On 05.05.21 15:13, Michal Hocko wrote:
> > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
> > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > function and move it to page-flags.h, from where it can be used in other
> > > -- kcore -- context.
> > > 
> > > Move the comment to the place where it belongs and simplify.
> > > 
> > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > I do agree that being explicit about hwpoison is much better. Poisoned
> > page can be also an unitialized one and I believe this is the reason why
> > you are bringing that up.
> 
> I'm bringing it up because I want to reuse that function as state above :)
> 
> > 
> > But you've made me look at d3378e86d182 and I am wondering whether this
> > is really a valid patch. First of all it can leak a reference count
> > AFAICS. Moreover it doesn't really fix anything because the page can be
> > marked hwpoison right after the check is done. I do not think the race
> > is feasible to be closed. So shouldn't we rather revert it?
> 
> I am not sure if we really care about races here that much here? I mean,
> essentially we are racing with HW breaking asynchronously. Just because we
> would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> from breaking.

Right

> Long story short, this should be good enough for the cases we actually can
> handle? What am I missing?

I am not sure I follow. My point is that I fail to see any added value
of the check as it doesn't prevent the race (it fundamentally cannot as
the page can be poisoned at any time) but the failure path doesn't
put_page which is incorrect even for hwpoison pages.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-05 13:27       ` Michal Hocko
@ 2021-05-05 13:39         ` David Hildenbrand
  2021-05-05 13:45           ` Michal Hocko
  2021-05-06  0:56         ` Aili Yao
  1 sibling, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-05-05 13:39 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-kernel, Michael S. Tsirkin, Jason Wang, Alexey Dobriyan,
	Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

>> Long story short, this should be good enough for the cases we actually can
>> handle? What am I missing?
> 
> I am not sure I follow. My point is that I fail to see any added value
> of the check as it doesn't prevent the race (it fundamentally cannot as
> the page can be poisoned at any time) but the failure path doesn't
> put_page which is incorrect even for hwpoison pages.

Oh, I think you are right. If we have a page and return NULL we would 
leak a reference.

Actually, we discussed in that thread handling this entirely 
differently, which resulted in a v7 [1]; however Andrew moved forward 
with this (outdated?) patch, maybe that was just a mistake?

Yes, I agree we should revert that patch for now.

Regarding the race comment: AFAIU e.g., [2], it's not really a problem 
with a race, but rather some corner case issue that can happen if we 
fail in memory_failure().


[1] https://lkml.kernel.org/r/20210406104123.451ee3c3@alex-virtual-machine
[2] 
https://lkml.kernel.org/r/20210331015258.GB22060@hori.linux.bs1.fc.nec.co.jp

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-05 13:39         ` David Hildenbrand
@ 2021-05-05 13:45           ` Michal Hocko
  2021-05-06  1:08             ` Aili Yao
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-05-05 13:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Wed 05-05-21 15:39:08, David Hildenbrand wrote:
> > > Long story short, this should be good enough for the cases we actually can
> > > handle? What am I missing?
> > 
> > I am not sure I follow. My point is that I fail to see any added value
> > of the check as it doesn't prevent the race (it fundamentally cannot as
> > the page can be poisoned at any time) but the failure path doesn't
> > put_page which is incorrect even for hwpoison pages.
> 
> Oh, I think you are right. If we have a page and return NULL we would leak a
> reference.
> 
> Actually, we discussed in that thread handling this entirely differently,
> which resulted in a v7 [1]; however Andrew moved forward with this
> (outdated?) patch, maybe that was just a mistake?
> 
> Yes, I agree we should revert that patch for now.

OK, Let me send the revert to Andrew.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()
  2021-05-05 13:24   ` Michal Hocko
@ 2021-05-05 15:10     ` David Hildenbrand
  2021-05-05 17:41       ` Mike Rapoport
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-05-05 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On 05.05.21 15:24, Michal Hocko wrote:
> On Thu 29-04-21 14:25:17, David Hildenbrand wrote:
>> A driver might set a page logically offline -- PageOffline() -- and
>> turn the page inaccessible in the hypervisor; after that, access to page
>> content can be fatal. One example is virtio-mem; while unplugged memory
>> -- marked as PageOffline() can currently be read in the hypervisor, this
>> will no longer be the case in the future; for example, when having
>> a virtio-mem device backed by huge pages in the hypervisor.
>>
>> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
>> pages after checking PageOffline(); however, these PFN walkers can race
>> with drivers that set PageOffline().
>>
>> Let's introduce page_offline_(begin|end|freeze|unfreeze) for
>> synchronizing.
>>
>> page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
>> synchronize with such drivers, achieving that a page cannot be set
>> PageOffline() while frozen.
>>
>> page_offline_begin()/page_offline_end() is used by drivers that care about
>> such races when setting a page PageOffline().
>>
>> For simplicity, use a rwsem for now; neither drivers nor users are
>> performance sensitive.
> 
> Please add a note to the PageOffline documentation as well. While are
> adding the api close enough an explicit note there wouldn't hurt.

Will do.

> 
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> As to the patch itself, I am slightly worried that other pfn walkers
> might be less tolerant to the locking than the proc ones. On the other
> hand most users shouldn't really care as they do not tend to touch the
> memory content and PageOffline check without any synchronization should
> be sufficient for those. Let's try this out and see where we get...

My thinking. Users that actually read random page content (as discussed 
in the cover letter) are

1. Hibernation
2. Dumping (/proc/kcore, /proc/vmcore)
3. Physical memory access bypassing the kernel via /dev/mem
4. Live debug tools (kgdb)

Other PFN walkers really shouldn't (and don't) access random page content.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()
  2021-05-05 15:10     ` David Hildenbrand
@ 2021-05-05 17:41       ` Mike Rapoport
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2021-05-05 17:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-kernel, Andrew Morton, Michael S. Tsirkin,
	Jason Wang, Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Wed, May 05, 2021 at 05:10:33PM +0200, David Hildenbrand wrote:
> On 05.05.21 15:24, Michal Hocko wrote:
> > On Thu 29-04-21 14:25:17, David Hildenbrand wrote:
> > > A driver might set a page logically offline -- PageOffline() -- and
> > > turn the page inaccessible in the hypervisor; after that, access to page
> > > content can be fatal. One example is virtio-mem; while unplugged memory
> > > -- marked as PageOffline() can currently be read in the hypervisor, this
> > > will no longer be the case in the future; for example, when having
> > > a virtio-mem device backed by huge pages in the hypervisor.
> > > 
> > > Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> > > pages after checking PageOffline(); however, these PFN walkers can race
> > > with drivers that set PageOffline().
> > > 
> > > Let's introduce page_offline_(begin|end|freeze|unfreeze) for
> > > synchronizing.
> > > 
> > > page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> > > synchronize with such drivers, achieving that a page cannot be set
> > > PageOffline() while frozen.
> > > 
> > > page_offline_begin()/page_offline_end() is used by drivers that care about
> > > such races when setting a page PageOffline().
> > > 
> > > For simplicity, use a rwsem for now; neither drivers nor users are
> > > performance sensitive.
> > 
> > Please add a note to the PageOffline documentation as well. While are
> > adding the api close enough an explicit note there wouldn't hurt.
> 
> Will do.
> 
> > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > As to the patch itself, I am slightly worried that other pfn walkers
> > might be less tolerant to the locking than the proc ones. On the other
> > hand most users shouldn't really care as they do not tend to touch the
> > memory content and PageOffline check without any synchronization should
> > be sufficient for those. Let's try this out and see where we get...
> 
> My thinking. Users that actually read random page content (as discussed in
> the cover letter) are
> 
> 1. Hibernation
> 2. Dumping (/proc/kcore, /proc/vmcore)
> 3. Physical memory access bypassing the kernel via /dev/mem
> 4. Live debug tools (kgdb)

I think you can add

5. Very old drivers
 
> Other PFN walkers really shouldn't (and don't) access random page content.
> 
> Thanks!
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-05 13:27       ` Michal Hocko
  2021-05-05 13:39         ` David Hildenbrand
@ 2021-05-06  0:56         ` Aili Yao
  2021-05-06  7:06           ` Michal Hocko
  1 sibling, 1 reply; 37+ messages in thread
From: Aili Yao @ 2021-05-06  0:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-kernel, Andrew Morton,
	Michael S. Tsirkin, Jason Wang, Alexey Dobriyan, Mike Rapoport,
	Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Jiri Bohac, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Naoya Horiguchi, linux-hyperv,
	virtualization, linux-fsdevel, linux-mm, yaoaili126

On Wed, 5 May 2021 15:27:39 +0200
Michal Hocko <mhocko@suse.com> wrote:

> On Wed 05-05-21 15:17:53, David Hildenbrand wrote:
> > On 05.05.21 15:13, Michal Hocko wrote:  
> > > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:  
> > > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > > function and move it to page-flags.h, from where it can be used in other
> > > > -- kcore -- context.
> > > > 
> > > > Move the comment to the place where it belongs and simplify.
> > > > 
> > > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > > > 
> > > > Signed-off-by: David Hildenbrand <david@redhat.com>  
> > > 
> > > I do agree that being explicit about hwpoison is much better. Poisoned
> > > page can be also an unitialized one and I believe this is the reason why
> > > you are bringing that up.  
> > 
> > I'm bringing it up because I want to reuse that function as state above :)
> >   
> > > 
> > > But you've made me look at d3378e86d182 and I am wondering whether this
> > > is really a valid patch. First of all it can leak a reference count
> > > AFAICS. Moreover it doesn't really fix anything because the page can be
> > > marked hwpoison right after the check is done. I do not think the race
> > > is feasible to be closed. So shouldn't we rather revert it?  
> > 
> > I am not sure if we really care about races here that much here? I mean,
> > essentially we are racing with HW breaking asynchronously. Just because we
> > would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> > from breaking.  
> 
> Right
> 
> > Long story short, this should be good enough for the cases we actually can
> > handle? What am I missing?  
> 
> I am not sure I follow. My point is that I fail to see any added value
> of the check as it doesn't prevent the race (it fundamentally cannot as
> the page can be poisoned at any time) but the failure path doesn't
> put_page which is incorrect even for hwpoison pages.

Sorry, I have something to say:

I have noticed the ref count leak in the previous topic ,but  I don't think
it's a really matter. For memory recovery case for user pages, we will keep one
reference to the poison page so the error page will not be freed to buddy allocator.
which can be checked in memory_faulure() function.

For the case here, the reference count for error page may be greater than one as it's not
unmmapped successfully and may shared. we take a reference for that page will not result some
really mattering issue.

The only break I can think for this leak is that we will fail to unpoison the error page for test purpose.

Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-05 13:45           ` Michal Hocko
@ 2021-05-06  1:08             ` Aili Yao
  0 siblings, 0 replies; 37+ messages in thread
From: Aili Yao @ 2021-05-06  1:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, linux-kernel,
	Michael S. Tsirkin, Jason Wang, Alexey Dobriyan, Mike Rapoport,
	Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Jiri Bohac, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Naoya Horiguchi, linux-hyperv,
	virtualization, linux-fsdevel, linux-mm, yaoaili126

On Wed, 5 May 2021 15:45:47 +0200
Michal Hocko <mhocko@suse.com> wrote:

> On Wed 05-05-21 15:39:08, David Hildenbrand wrote:
> > > > Long story short, this should be good enough for the cases we actually can
> > > > handle? What am I missing?  
> > > 
> > > I am not sure I follow. My point is that I fail to see any added value
> > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > the page can be poisoned at any time) but the failure path doesn't
> > > put_page which is incorrect even for hwpoison pages.  
> > 
> > Oh, I think you are right. If we have a page and return NULL we would leak a
> > reference.
> > 
> > Actually, we discussed in that thread handling this entirely differently,
> > which resulted in a v7 [1]; however Andrew moved forward with this
> > (outdated?) patch, maybe that was just a mistake?
> > 
> > Yes, I agree we should revert that patch for now.  
> 
> OK, Let me send the revert to Andrew.
> 

Got this!
Anyway, I will try to post a new patch for this issue based on the previous patch v7.

Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-06  0:56         ` Aili Yao
@ 2021-05-06  7:06           ` Michal Hocko
  2021-05-06  7:28             ` Aili Yao
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-05-06  7:06 UTC (permalink / raw)
  To: Aili Yao
  Cc: David Hildenbrand, linux-kernel, Andrew Morton,
	Michael S. Tsirkin, Jason Wang, Alexey Dobriyan, Mike Rapoport,
	Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Jiri Bohac, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Naoya Horiguchi, linux-hyperv,
	virtualization, linux-fsdevel, linux-mm, yaoaili126

On Thu 06-05-21 08:56:11, Aili Yao wrote:
> On Wed, 5 May 2021 15:27:39 +0200
> Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Wed 05-05-21 15:17:53, David Hildenbrand wrote:
> > > On 05.05.21 15:13, Michal Hocko wrote:  
> > > > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:  
> > > > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > > > function and move it to page-flags.h, from where it can be used in other
> > > > > -- kcore -- context.
> > > > > 
> > > > > Move the comment to the place where it belongs and simplify.
> > > > > 
> > > > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > > > > 
> > > > > Signed-off-by: David Hildenbrand <david@redhat.com>  
> > > > 
> > > > I do agree that being explicit about hwpoison is much better. Poisoned
> > > > page can be also an unitialized one and I believe this is the reason why
> > > > you are bringing that up.  
> > > 
> > > I'm bringing it up because I want to reuse that function as state above :)
> > >   
> > > > 
> > > > But you've made me look at d3378e86d182 and I am wondering whether this
> > > > is really a valid patch. First of all it can leak a reference count
> > > > AFAICS. Moreover it doesn't really fix anything because the page can be
> > > > marked hwpoison right after the check is done. I do not think the race
> > > > is feasible to be closed. So shouldn't we rather revert it?  
> > > 
> > > I am not sure if we really care about races here that much here? I mean,
> > > essentially we are racing with HW breaking asynchronously. Just because we
> > > would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> > > from breaking.  
> > 
> > Right
> > 
> > > Long story short, this should be good enough for the cases we actually can
> > > handle? What am I missing?  
> > 
> > I am not sure I follow. My point is that I fail to see any added value
> > of the check as it doesn't prevent the race (it fundamentally cannot as
> > the page can be poisoned at any time) but the failure path doesn't
> > put_page which is incorrect even for hwpoison pages.
> 
> Sorry, I have something to say:
> 
> I have noticed the ref count leak in the previous topic ,but  I don't think
> it's a really matter. For memory recovery case for user pages, we will keep one
> reference to the poison page so the error page will not be freed to buddy allocator.
> which can be checked in memory_faulure() function.

So what would happen if those pages are hwpoisoned from userspace rather
than by HW. And repeatedly so?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-06  7:06           ` Michal Hocko
@ 2021-05-06  7:28             ` Aili Yao
  2021-05-06  7:55               ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Aili Yao @ 2021-05-06  7:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-kernel, Andrew Morton,
	Michael S. Tsirkin, Jason Wang, Alexey Dobriyan, Mike Rapoport,
	Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Jiri Bohac, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Naoya Horiguchi, linux-hyperv,
	virtualization, linux-fsdevel, linux-mm, yaoaili126

On Thu, 6 May 2021 09:06:14 +0200
Michal Hocko <mhocko@suse.com> wrote:

> On Thu 06-05-21 08:56:11, Aili Yao wrote:
> > On Wed, 5 May 2021 15:27:39 +0200
> > Michal Hocko <mhocko@suse.com> wrote:
> >   
> > > On Wed 05-05-21 15:17:53, David Hildenbrand wrote:  
> > > > On 05.05.21 15:13, Michal Hocko wrote:    
> > > > > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:    
> > > > > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > > > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > > > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > > > > function and move it to page-flags.h, from where it can be used in other
> > > > > > -- kcore -- context.
> > > > > > 
> > > > > > Move the comment to the place where it belongs and simplify.
> > > > > > 
> > > > > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > > > > > 
> > > > > > Signed-off-by: David Hildenbrand <david@redhat.com>    
> > > > > 
> > > > > I do agree that being explicit about hwpoison is much better. Poisoned
> > > > > page can be also an unitialized one and I believe this is the reason why
> > > > > you are bringing that up.    
> > > > 
> > > > I'm bringing it up because I want to reuse that function as state above :)
> > > >     
> > > > > 
> > > > > But you've made me look at d3378e86d182 and I am wondering whether this
> > > > > is really a valid patch. First of all it can leak a reference count
> > > > > AFAICS. Moreover it doesn't really fix anything because the page can be
> > > > > marked hwpoison right after the check is done. I do not think the race
> > > > > is feasible to be closed. So shouldn't we rather revert it?    
> > > > 
> > > > I am not sure if we really care about races here that much here? I mean,
> > > > essentially we are racing with HW breaking asynchronously. Just because we
> > > > would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> > > > from breaking.    
> > > 
> > > Right
> > >   
> > > > Long story short, this should be good enough for the cases we actually can
> > > > handle? What am I missing?    
> > > 
> > > I am not sure I follow. My point is that I fail to see any added value
> > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > the page can be poisoned at any time) but the failure path doesn't
> > > put_page which is incorrect even for hwpoison pages.  
> > 
> > Sorry, I have something to say:
> > 
> > I have noticed the ref count leak in the previous topic ,but  I don't think
> > it's a really matter. For memory recovery case for user pages, we will keep one
> > reference to the poison page so the error page will not be freed to buddy allocator.
> > which can be checked in memory_faulure() function.  
> 
> So what would happen if those pages are hwpoisoned from userspace rather
> than by HW. And repeatedly so?

Sorry, I may be not totally understand what you mean.

Do you mean hard page offline from mcelog?
If yes, I think it's not for one real UC error but for CE storms.
when we access this page in kernel, the access may success even it was marked hwpoison.

Thanks!
Aili Yao 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-06  7:28             ` Aili Yao
@ 2021-05-06  7:55               ` Michal Hocko
  2021-05-06  8:52                 ` Aili Yao
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-05-06  7:55 UTC (permalink / raw)
  To: Aili Yao
  Cc: David Hildenbrand, linux-kernel, Andrew Morton,
	Michael S. Tsirkin, Jason Wang, Alexey Dobriyan, Mike Rapoport,
	Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Jiri Bohac, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Naoya Horiguchi, linux-hyperv,
	virtualization, linux-fsdevel, linux-mm, yaoaili126

On Thu 06-05-21 15:28:05, Aili Yao wrote:
> On Thu, 6 May 2021 09:06:14 +0200
> Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 06-05-21 08:56:11, Aili Yao wrote:
> > > On Wed, 5 May 2021 15:27:39 +0200
> > > Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > > I am not sure I follow. My point is that I fail to see any added value
> > > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > > the page can be poisoned at any time) but the failure path doesn't
> > > > put_page which is incorrect even for hwpoison pages.  
> > > 
> > > Sorry, I have something to say:
> > > 
> > > I have noticed the ref count leak in the previous topic ,but  I don't think
> > > it's a really matter. For memory recovery case for user pages, we will keep one
> > > reference to the poison page so the error page will not be freed to buddy allocator.
> > > which can be checked in memory_faulure() function.  
> > 
> > So what would happen if those pages are hwpoisoned from userspace rather
> > than by HW. And repeatedly so?
> 
> Sorry, I may be not totally understand what you mean.
> 
> Do you mean hard page offline from mcelog?

No I mean soft hwpoison from userspace (e.g. by MADV_HWPOISON but there
are other interfaces AFAIK).

And just to be explicit. All those interfaces are root only
(CAP_SYS_ADMIN) so I am not really worried about any malitious abuse of
the reference leak. I am mostly concerned that this is obviously broken
without a good reason. The most trivial fix would have been to put_page
in the return path but as I've mentioned in other email thread the fix
really needs a deeper thought and consider other things.

Hope that clarifies this some more.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()
  2021-05-06  7:55               ` Michal Hocko
@ 2021-05-06  8:52                 ` Aili Yao
  0 siblings, 0 replies; 37+ messages in thread
From: Aili Yao @ 2021-05-06  8:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-kernel, Andrew Morton,
	Michael S. Tsirkin, Jason Wang, Alexey Dobriyan, Mike Rapoport,
	Matthew Wilcox (Oracle),
	Oscar Salvador, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Jiri Bohac, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Naoya Horiguchi, linux-hyperv,
	virtualization, linux-fsdevel, linux-mm, yaoaili126

On Thu, 6 May 2021 09:55:51 +0200
Michal Hocko <mhocko@suse.com> wrote:

> On Thu 06-05-21 15:28:05, Aili Yao wrote:
> > On Thu, 6 May 2021 09:06:14 +0200
> > Michal Hocko <mhocko@suse.com> wrote:
> >   
> > > On Thu 06-05-21 08:56:11, Aili Yao wrote:  
> > > > On Wed, 5 May 2021 15:27:39 +0200
> > > > Michal Hocko <mhocko@suse.com> wrote:  
> [...]
> > > > > I am not sure I follow. My point is that I fail to see any added value
> > > > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > > > the page can be poisoned at any time) but the failure path doesn't
> > > > > put_page which is incorrect even for hwpoison pages.    
> > > > 
> > > > Sorry, I have something to say:
> > > > 
> > > > I have noticed the ref count leak in the previous topic ,but  I don't think
> > > > it's a really matter. For memory recovery case for user pages, we will keep one
> > > > reference to the poison page so the error page will not be freed to buddy allocator.
> > > > which can be checked in memory_faulure() function.    
> > > 
> > > So what would happen if those pages are hwpoisoned from userspace rather
> > > than by HW. And repeatedly so?  
> > 
> > Sorry, I may be not totally understand what you mean.
> > 
> > Do you mean hard page offline from mcelog?  
> 
> No I mean soft hwpoison from userspace (e.g. by MADV_HWPOISON but there
> are other interfaces AFAIK).
> 
> And just to be explicit. All those interfaces are root only
> (CAP_SYS_ADMIN) so I am not really worried about any malitious abuse of
> the reference leak. I am mostly concerned that this is obviously broken
> without a good reason. The most trivial fix would have been to put_page
> in the return path but as I've mentioned in other email thread the fix
> really needs a deeper thought and consider other things.
> 
> Hope that clarifies this some more.

Thanks, got it!
Yes, there are some test scenarios that should be covered.

But for test, the default SIGBUS handlers is usually replaced, and the test process
may not hit the coredump code.

Anyway, there is a ref leak in the normal enviorments and better to be fixed.

Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-05-06  8:52 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-29 12:25 [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
2021-04-29 12:25 ` [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
2021-05-02  6:31   ` Mike Rapoport
2021-04-29 12:25 ` [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM David Hildenbrand
2021-05-02  6:31   ` Mike Rapoport
2021-04-29 12:25 ` [PATCH v1 3/7] mm: rename and move page_is_poisoned() David Hildenbrand
2021-05-02  6:32   ` Mike Rapoport
2021-05-05 13:13   ` Michal Hocko
2021-05-05 13:17     ` David Hildenbrand
2021-05-05 13:27       ` Michal Hocko
2021-05-05 13:39         ` David Hildenbrand
2021-05-05 13:45           ` Michal Hocko
2021-05-06  1:08             ` Aili Yao
2021-05-06  0:56         ` Aili Yao
2021-05-06  7:06           ` Michal Hocko
2021-05-06  7:28             ` Aili Yao
2021-05-06  7:55               ` Michal Hocko
2021-05-06  8:52                 ` Aili Yao
2021-04-29 12:25 ` [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
2021-05-02  6:32   ` Mike Rapoport
2021-04-29 12:25 ` [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline() David Hildenbrand
2021-05-02  6:33   ` Mike Rapoport
2021-05-03  8:11     ` David Hildenbrand
2021-05-05 13:24   ` Michal Hocko
2021-05-05 15:10     ` David Hildenbrand
2021-05-05 17:41       ` Mike Rapoport
2021-04-29 12:25 ` [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
2021-05-02  6:33   ` Mike Rapoport
2021-05-03  8:16     ` David Hildenbrand
2021-05-03  8:23   ` Michael S. Tsirkin
2021-04-29 12:25 ` [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze) David Hildenbrand
2021-05-02  6:34   ` Mike Rapoport
2021-05-03  8:28     ` David Hildenbrand
2021-05-03  9:28       ` Mike Rapoport
2021-05-03 10:13         ` David Hildenbrand
2021-05-03 11:33           ` Mike Rapoport
2021-05-03 11:35             ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).