linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
@ 2021-05-14 17:22 David Hildenbrand
  2021-05-14 17:22 ` [PATCH v2 1/6] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-05-14 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned ones.

Examples for PageOffline() pages are pages inflated in a balloon,
memory unplugged via virtio-mem, and partially-present sections in
memory added by the Hyper-V balloon.

When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.

With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch.
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.

Tested via kcore_dump:
	https://github.com/schlafwandler/kcore_dump

v1 -> v2:
- Dropped "mm: rename and move page_is_poisoned()"
- "fs/proc/kcore: don't read offline sections, logically offline pages ..."
-- Add is_page_hwpoison() in page-flags.h along with a comment
- "mm: introduce page_offline_(begin|end|freeze|thaw) to ..."
-- s/unfreeze/thaw/
-- Add a comment to PageOffline documentation in page-flags.h
- "virtio-mem: use page_offline_(start|end) when setting PageOffline()"
-- Extend patch description
- "fs/proc/kcore: use page_offline_(freeze|thaw)"
-- Simplify freeze/thaw logic
- Collected acks/rbs

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: linux-hyperv@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

David Hildenbrand (6):
  fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
  fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
  fs/proc/kcore: don't read offline sections, logically offline pages
    and hwpoisoned pages
  mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize
    setting PageOffline()
  virtio-mem: use page_offline_(start|end) when setting PageOffline()
  fs/proc/kcore: use page_offline_(freeze|thaw)

 drivers/virtio/virtio_mem.c |  2 ++
 fs/proc/kcore.c             | 67 ++++++++++++++++++++++++++++++-------
 include/linux/kcore.h       |  3 --
 include/linux/page-flags.h  | 22 ++++++++++++
 mm/util.c                   | 40 ++++++++++++++++++++++
 5 files changed, 118 insertions(+), 16 deletions(-)


base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/6] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
  2021-05-14 17:22 [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
@ 2021-05-14 17:22 ` David Hildenbrand
  2021-05-14 17:22 ` [PATCH v2 2/6] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM David Hildenbrand
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-05-14 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm, Mike Rapoport

Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.

Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.

Let's drop both types. While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c       | 7 ++-----
 include/linux/kcore.h | 3 ---
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 4d2e64e9016c..09f77d3c6e15 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			phdr->p_type = PT_LOAD;
 			phdr->p_flags = PF_R | PF_W | PF_X;
 			phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
-			if (m->type == KCORE_REMAP)
-				phdr->p_vaddr = (size_t)m->vaddr;
-			else
-				phdr->p_vaddr = (size_t)m->addr;
-			if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
+			phdr->p_vaddr = (size_t)m->addr;
+			if (m->type == KCORE_RAM)
 				phdr->p_paddr = __pa(m->addr);
 			else if (m->type == KCORE_TEXT)
 				phdr->p_paddr = __pa_symbol(m->addr);
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index da676cdbd727..86c0f1d18998 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -11,14 +11,11 @@ enum kcore_type {
 	KCORE_RAM,
 	KCORE_VMEMMAP,
 	KCORE_USER,
-	KCORE_OTHER,
-	KCORE_REMAP,
 };
 
 struct kcore_list {
 	struct list_head list;
 	unsigned long addr;
-	unsigned long vaddr;
 	size_t size;
 	int type;
 };
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 2/6] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
  2021-05-14 17:22 [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
  2021-05-14 17:22 ` [PATCH v2 1/6] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
@ 2021-05-14 17:22 ` David Hildenbrand
  2021-05-14 17:22 ` [PATCH v2 3/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-05-14 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm, Mike Rapoport

Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 09f77d3c6e15..ed6fbb3bd50c 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 				goto out;
 			}
 			m = NULL;	/* skip the list anchor */
-		} else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
-			if (clear_user(buffer, tsz)) {
-				ret = -EFAULT;
-				goto out;
-			}
-		} else if (m->type == KCORE_VMALLOC) {
+			goto skip;
+		}
+
+		switch (m->type) {
+		case KCORE_VMALLOC:
 			vread(buf, (char *)start, tsz);
 			/* we have to zero-fill user buffer even if no read */
 			if (copy_to_user(buffer, buf, tsz)) {
 				ret = -EFAULT;
 				goto out;
 			}
-		} else if (m->type == KCORE_USER) {
+			break;
+		case KCORE_USER:
 			/* User page is handled prior to normal kernel page: */
 			if (copy_to_user(buffer, (char *)start, tsz)) {
 				ret = -EFAULT;
 				goto out;
 			}
-		} else {
+			break;
+		case KCORE_RAM:
+			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+				if (clear_user(buffer, tsz)) {
+					ret = -EFAULT;
+					goto out;
+				}
+				break;
+			}
+			fallthrough;
+		case KCORE_VMEMMAP:
+		case KCORE_TEXT:
 			if (kern_addr_valid(start)) {
 				/*
 				 * Using bounce buffer to bypass the
@@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 					goto out;
 				}
 			}
+			break;
+		default:
+			pr_warn_once("Unhandled KCORE type: %d\n", m->type);
+			if (clear_user(buffer, tsz)) {
+				ret = -EFAULT;
+				goto out;
+			}
 		}
+skip:
 		buflen -= tsz;
 		*fpos += tsz;
 		buffer += tsz;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 3/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
  2021-05-14 17:22 [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
  2021-05-14 17:22 ` [PATCH v2 1/6] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
  2021-05-14 17:22 ` [PATCH v2 2/6] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM David Hildenbrand
@ 2021-05-14 17:22 ` David Hildenbrand
  2021-05-25  8:09   ` Oscar Salvador
  2021-05-14 17:22 ` [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() David Hildenbrand
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-05-14 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm, Mike Rapoport

Let's avoid reading:

1) Offline memory sections: the content of offline memory sections is stale
   as the memory is effectively unused by the kernel. On s390x with standby
   memory, offline memory sections (belonging to offline storage
   increments) are not accessible. With virtio-mem and the hyper-v balloon,
   we can have unavailable memory chunks that should not be accessed inside
   offline memory sections. Last but not least, offline memory sections
   might contain hwpoisoned pages which we can no longer identify
   because the memmap is stale.

2) PG_offline pages: logically offline pages that are documented as
   "The content of these pages is effectively stale. Such pages should not
    be touched (read/write/dump/save) except by their owner.".
   Examples include pages inflated in a balloon or unavailble memory
   ranges inside hotplugged memory sections with virtio-mem or the hyper-v
   balloon.

3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
   As documented: "Accessing is not safe since it may cause another machine
   check. Don't touch!"

Introduce is_page_hwpoison(), adding a comment that it is inherently
racy but best we can really do.

Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.

Note 1: we can race against memory offlining code, especially
memory going offline and getting unplugged: however, we will properly tear
down the identity mapping and handle faults gracefully when accessing
this memory from kcore code.

Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor. We'll handle this in a follow-up
patch.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c            | 14 +++++++++++++-
 include/linux/page-flags.h | 12 ++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ed6fbb3bd50c..92ff1e4436cb 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 
 	m = NULL;
 	while (buflen) {
+		struct page *page;
+		unsigned long pfn;
+
 		/*
 		 * If this is the first iteration or the address is not within
 		 * the previous entry, search for a matching entry.
@@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			}
 			break;
 		case KCORE_RAM:
-			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+			pfn = __pa(start) >> PAGE_SHIFT;
+			page = pfn_to_online_page(pfn);
+
+			/*
+			 * Don't read offline sections, logically offline pages
+			 * (e.g., inflated in a balloon), hwpoisoned pages,
+			 * and explicitly excluded physical ranges.
+			 */
+			if (!page || PageOffline(page) ||
+			    is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
 				if (clear_user(buffer, tsz)) {
 					ret = -EFAULT;
 					goto out;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 04a34c08e0a6..daed82744f4b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -694,6 +694,18 @@ PAGEFLAG_FALSE(DoubleMap)
 	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
+/*
+ * Check if a page is currently marked HWPoisoned. Note that this check is
+ * best effort only and inherently racy: there is no way to synchronize with
+ * failing hardware.
+ */
+static inline bool is_page_hwpoison(struct page *page)
+{
+	if (PageHWPoison(page))
+		return true;
+	return PageHuge(page) && PageHWPoison(compound_head(page));
+}
+
 /*
  * For pages that are never mapped to userspace (and aren't PageSlab),
  * page_type may be used.  Because it is initialised to -1, we invert the
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline()
  2021-05-14 17:22 [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-05-14 17:22 ` [PATCH v2 3/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
@ 2021-05-14 17:22 ` David Hildenbrand
  2021-05-17  6:43   ` Mike Rapoport
  2021-05-25  8:16   ` Oscar Salvador
  2021-05-14 17:22 ` [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
  2021-05-14 17:22 ` [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw) David Hildenbrand
  5 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-05-14 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

A driver might set a page logically offline -- PageOffline() -- and
turn the page inaccessible in the hypervisor; after that, access to page
content can be fatal. One example is virtio-mem; while unplugged memory
-- marked as PageOffline() can currently be read in the hypervisor, this
will no longer be the case in the future; for example, when having
a virtio-mem device backed by huge pages in the hypervisor.

Some special PFN walkers -- i.e., /proc/kcore -- read content of random
pages after checking PageOffline(); however, these PFN walkers can race
with drivers that set PageOffline().

Let's introduce page_offline_(begin|end|freeze|thaw) for
synchronizing.

page_offline_freeze()/page_offline_thaw() allows for a subsystem to
synchronize with such drivers, achieving that a page cannot be set
PageOffline() while frozen.

page_offline_begin()/page_offline_end() is used by drivers that care about
such races when setting a page PageOffline().

For simplicity, use a rwsem for now; neither drivers nor users are
performance sensitive.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h | 10 ++++++++++
 mm/util.c                  | 40 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index daed82744f4b..ea2df9a247b3 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * relies on this feature is aware that re-onlining the memory block will
  * require to re-set the pages PageOffline() and not giving them to the
  * buddy via online_page_callback_t.
+ *
+ * There are drivers that mark a page PageOffline() and do not expect any
+ * further access to page content. PFN walkers that read content of random
+ * pages should check PageOffline() and synchronize with such drivers using
+ * page_offline_freeze()/page_offline_thaw().
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+extern void page_offline_freeze(void);
+extern void page_offline_thaw(void);
+extern void page_offline_begin(void);
+extern void page_offline_end(void);
+
 /*
  * Marks pages in use as page tables.
  */
diff --git a/mm/util.c b/mm/util.c
index a8bf17f18a81..a034525e7ba2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1010,3 +1010,43 @@ void mem_dump_obj(void *object)
 }
 EXPORT_SYMBOL_GPL(mem_dump_obj);
 #endif
+
+/*
+ * A driver might set a page logically offline -- PageOffline() -- and
+ * turn the page inaccessible in the hypervisor; after that, access to page
+ * content can be fatal.
+ *
+ * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
+ * pages after checking PageOffline(); however, these PFN walkers can race
+ * with drivers that set PageOffline().
+ *
+ * page_offline_freeze()/page_offline_thaw() allows for a subsystem to
+ * synchronize with such drivers, achieving that a page cannot be set
+ * PageOffline() while frozen.
+ *
+ * page_offline_begin()/page_offline_end() is used by drivers that care about
+ * such races when setting a page PageOffline().
+ */
+static DECLARE_RWSEM(page_offline_rwsem);
+
+void page_offline_freeze(void)
+{
+	down_read(&page_offline_rwsem);
+}
+
+void page_offline_thaw(void)
+{
+	up_read(&page_offline_rwsem);
+}
+
+void page_offline_begin(void)
+{
+	down_write(&page_offline_rwsem);
+}
+EXPORT_SYMBOL(page_offline_begin);
+
+void page_offline_end(void)
+{
+	up_write(&page_offline_rwsem);
+}
+EXPORT_SYMBOL(page_offline_end);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-05-14 17:22 [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-05-14 17:22 ` [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() David Hildenbrand
@ 2021-05-14 17:22 ` David Hildenbrand
  2021-05-17  6:43   ` Mike Rapoport
  2021-05-25  8:20   ` Oscar Salvador
  2021-05-14 17:22 ` [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw) David Hildenbrand
  5 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-05-14 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Let's properly use page_offline_(start|end) to synchronize setting
PageOffline(), so we won't have valid page access to unplugged memory
regions from /proc/kcore.

Existing balloon implementations usually allow reading inflated memory;
doing so might result in unnecessary overhead in the hypervisor, which
is currently the case with virtio-mem.

For future virtio-mem use cases, it will be different when using shmem,
huge pages, !anonymous private mappings, ... as backing storage for a VM.
virtio-mem unplugged memory must no longer be accessed and access might
result in undefined behavior. There will be a virtio spec extension to
document this change, including a new feature flag indicating the
changed behavior. We really don't want to race against PFN walkers
reading random page content.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/virtio/virtio_mem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 10ec60d81e84..dc2a2e2b2ff8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
 static void virtio_mem_set_fake_offline(unsigned long pfn,
 					unsigned long nr_pages, bool onlined)
 {
+	page_offline_begin();
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
@@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
 			ClearPageReserved(page);
 		}
 	}
+	page_offline_end();
 }
 
 /*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw)
  2021-05-14 17:22 [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-05-14 17:22 ` [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
@ 2021-05-14 17:22 ` David Hildenbrand
  2021-05-17  6:44   ` Mike Rapoport
  2021-05-25  8:21   ` Oscar Salvador
  5 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-05-14 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

Let's properly synchronize with drivers that set PageOffline().
Unfreeze/thaw every now and then, so drivers that want to set PageOffline()
can make progress.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 92ff1e4436cb..982e694aae77 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -313,6 +313,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 {
 	char *buf = file->private_data;
 	size_t phdrs_offset, notes_offset, data_offset;
+	size_t page_offline_frozen = 1;
 	size_t phdrs_len, notes_len;
 	struct kcore_list *m;
 	size_t tsz;
@@ -322,6 +323,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 	int ret = 0;
 
 	down_read(&kclist_lock);
+	/*
+	 * Don't race against drivers that set PageOffline() and expect no
+	 * further page access.
+	 */
+	page_offline_freeze();
 
 	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
 	phdrs_offset = sizeof(struct elfhdr);
@@ -480,6 +486,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			}
 		}
 
+		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
+			page_offline_thaw();
+			cond_resched();
+			page_offline_freeze();
+		}
+
 		if (&m->list == &kclist_head) {
 			if (clear_user(buffer, tsz)) {
 				ret = -EFAULT;
@@ -565,6 +577,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 	}
 
 out:
+	page_offline_thaw();
 	up_read(&kclist_lock);
 	if (ret)
 		return ret;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline()
  2021-05-14 17:22 ` [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() David Hildenbrand
@ 2021-05-17  6:43   ` Mike Rapoport
  2021-05-17 15:18     ` David Hildenbrand
  2021-05-25  8:16   ` Oscar Salvador
  1 sibling, 1 reply; 15+ messages in thread
From: Mike Rapoport @ 2021-05-17  6:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Fri, May 14, 2021 at 07:22:45PM +0200, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
> 
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
> 
> Let's introduce page_offline_(begin|end|freeze|thaw) for
> synchronizing.
> 
> page_offline_freeze()/page_offline_thaw() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
> 
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
> 
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

One nit below, otherwise

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  include/linux/page-flags.h | 10 ++++++++++
>  mm/util.c                  | 40 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index daed82744f4b..ea2df9a247b3 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   * relies on this feature is aware that re-onlining the memory block will
>   * require to re-set the pages PageOffline() and not giving them to the
>   * buddy via online_page_callback_t.
> + *
> + * There are drivers that mark a page PageOffline() and do not expect any

Maybe "and expect there won't be any further access"...

> + * further access to page content. PFN walkers that read content of random
> + * pages should check PageOffline() and synchronize with such drivers using
> + * page_offline_freeze()/page_offline_thaw().
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +extern void page_offline_freeze(void);
> +extern void page_offline_thaw(void);
> +extern void page_offline_begin(void);
> +extern void page_offline_end(void);
> +
>  /*
>   * Marks pages in use as page tables.
>   */
> diff --git a/mm/util.c b/mm/util.c
> index a8bf17f18a81..a034525e7ba2 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1010,3 +1010,43 @@ void mem_dump_obj(void *object)
>  }
>  EXPORT_SYMBOL_GPL(mem_dump_obj);
>  #endif
> +
> +/*
> + * A driver might set a page logically offline -- PageOffline() -- and
> + * turn the page inaccessible in the hypervisor; after that, access to page
> + * content can be fatal.
> + *
> + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> + * pages after checking PageOffline(); however, these PFN walkers can race
> + * with drivers that set PageOffline().
> + *
> + * page_offline_freeze()/page_offline_thaw() allows for a subsystem to
> + * synchronize with such drivers, achieving that a page cannot be set
> + * PageOffline() while frozen.
> + *
> + * page_offline_begin()/page_offline_end() is used by drivers that care about
> + * such races when setting a page PageOffline().
> + */
> +static DECLARE_RWSEM(page_offline_rwsem);
> +
> +void page_offline_freeze(void)
> +{
> +	down_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_thaw(void)
> +{
> +	up_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_begin(void)
> +{
> +	down_write(&page_offline_rwsem);
> +}
> +EXPORT_SYMBOL(page_offline_begin);
> +
> +void page_offline_end(void)
> +{
> +	up_write(&page_offline_rwsem);
> +}
> +EXPORT_SYMBOL(page_offline_end);
> -- 
> 2.31.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-05-14 17:22 ` [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
@ 2021-05-17  6:43   ` Mike Rapoport
  2021-05-25  8:20   ` Oscar Salvador
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Rapoport @ 2021-05-17  6:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Fri, May 14, 2021 at 07:22:46PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
> 
> Existing balloon implementations usually allow reading inflated memory;
> doing so might result in unnecessary overhead in the hypervisor, which
> is currently the case with virtio-mem.
> 
> For future virtio-mem use cases, it will be different when using shmem,
> huge pages, !anonymous private mappings, ... as backing storage for a VM.
> virtio-mem unplugged memory must no longer be accessed and access might
> result in undefined behavior. There will be a virtio spec extension to
> document this change, including a new feature flag indicating the
> changed behavior. We really don't want to race against PFN walkers
> reading random page content.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/virtio/virtio_mem.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 10ec60d81e84..dc2a2e2b2ff8 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
>  static void virtio_mem_set_fake_offline(unsigned long pfn,
>  					unsigned long nr_pages, bool onlined)
>  {
> +	page_offline_begin();
>  	for (; nr_pages--; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>  
> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>  			ClearPageReserved(page);
>  		}
>  	}
> +	page_offline_end();
>  }
>  
>  /*
> -- 
> 2.31.1
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw)
  2021-05-14 17:22 ` [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw) David Hildenbrand
@ 2021-05-17  6:44   ` Mike Rapoport
  2021-05-25  8:21   ` Oscar Salvador
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Rapoport @ 2021-05-17  6:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On Fri, May 14, 2021 at 07:22:47PM +0200, David Hildenbrand wrote:
> Let's properly synchronize with drivers that set PageOffline().
> Unfreeze/thaw every now and then, so drivers that want to set PageOffline()
> can make progress.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  fs/proc/kcore.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 92ff1e4436cb..982e694aae77 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -313,6 +313,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  {
>  	char *buf = file->private_data;
>  	size_t phdrs_offset, notes_offset, data_offset;
> +	size_t page_offline_frozen = 1;
>  	size_t phdrs_len, notes_len;
>  	struct kcore_list *m;
>  	size_t tsz;
> @@ -322,6 +323,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  	int ret = 0;
>  
>  	down_read(&kclist_lock);
> +	/*
> +	 * Don't race against drivers that set PageOffline() and expect no
> +	 * further page access.
> +	 */
> +	page_offline_freeze();
>  
>  	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
>  	phdrs_offset = sizeof(struct elfhdr);
> @@ -480,6 +486,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  			}
>  		}
>  
> +		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> +			page_offline_thaw();
> +			cond_resched();
> +			page_offline_freeze();
> +		}
> +
>  		if (&m->list == &kclist_head) {
>  			if (clear_user(buffer, tsz)) {
>  				ret = -EFAULT;
> @@ -565,6 +577,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  	}
>  
>  out:
> +	page_offline_thaw();
>  	up_read(&kclist_lock);
>  	if (ret)
>  		return ret;
> -- 
> 2.31.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline()
  2021-05-17  6:43   ` Mike Rapoport
@ 2021-05-17 15:18     ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-05-17 15:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Matthew Wilcox (Oracle),
	Oscar Salvador, Michal Hocko, Roman Gushchin, Alex Shi,
	Steven Price, Mike Kravetz, Aili Yao, Jiri Bohac,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Naoya Horiguchi, linux-hyperv, virtualization, linux-fsdevel,
	linux-mm

On 17.05.21 08:43, Mike Rapoport wrote:
> On Fri, May 14, 2021 at 07:22:45PM +0200, David Hildenbrand wrote:
>> A driver might set a page logically offline -- PageOffline() -- and
>> turn the page inaccessible in the hypervisor; after that, access to page
>> content can be fatal. One example is virtio-mem; while unplugged memory
>> -- marked as PageOffline() can currently be read in the hypervisor, this
>> will no longer be the case in the future; for example, when having
>> a virtio-mem device backed by huge pages in the hypervisor.
>>
>> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
>> pages after checking PageOffline(); however, these PFN walkers can race
>> with drivers that set PageOffline().
>>
>> Let's introduce page_offline_(begin|end|freeze|thaw) for
>> synchronizing.
>>
>> page_offline_freeze()/page_offline_thaw() allows for a subsystem to
>> synchronize with such drivers, achieving that a page cannot be set
>> PageOffline() while frozen.
>>
>> page_offline_begin()/page_offline_end() is used by drivers that care about
>> such races when setting a page PageOffline().
>>
>> For simplicity, use a rwsem for now; neither drivers nor users are
>> performance sensitive.
>>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> One nit below, otherwise
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> 
>> ---
>>   include/linux/page-flags.h | 10 ++++++++++
>>   mm/util.c                  | 40 ++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index daed82744f4b..ea2df9a247b3 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy)
>>    * relies on this feature is aware that re-onlining the memory block will
>>    * require to re-set the pages PageOffline() and not giving them to the
>>    * buddy via online_page_callback_t.
>> + *
>> + * There are drivers that mark a page PageOffline() and do not expect any
> 
> Maybe "and expect there won't be any further access"...
> 

Thanks, makes sense.

I'll wait a bit before I resend.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
  2021-05-14 17:22 ` [PATCH v2 3/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
@ 2021-05-25  8:09   ` Oscar Salvador
  0 siblings, 0 replies; 15+ messages in thread
From: Oscar Salvador @ 2021-05-25  8:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Michal Hocko, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm,
	Mike Rapoport

On Fri, May 14, 2021 at 07:22:44PM +0200, David Hildenbrand wrote:
> Let's avoid reading:
> 
> 1) Offline memory sections: the content of offline memory sections is stale
>    as the memory is effectively unused by the kernel. On s390x with standby
>    memory, offline memory sections (belonging to offline storage
>    increments) are not accessible. With virtio-mem and the hyper-v balloon,
>    we can have unavailable memory chunks that should not be accessed inside
>    offline memory sections. Last but not least, offline memory sections
>    might contain hwpoisoned pages which we can no longer identify
>    because the memmap is stale.
> 
> 2) PG_offline pages: logically offline pages that are documented as
>    "The content of these pages is effectively stale. Such pages should not
>     be touched (read/write/dump/save) except by their owner.".
>    Examples include pages inflated in a balloon or unavailble memory
>    ranges inside hotplugged memory sections with virtio-mem or the hyper-v
>    balloon.
> 
> 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
>    As documented: "Accessing is not safe since it may cause another machine
>    check. Don't touch!"
> 
> Introduce is_page_hwpoison(), adding a comment that it is inherently
> racy but best we can really do.
> 
> Reading /proc/kcore now performs similar checks as when reading
> /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
> It's also similar to hibernation code, however, we don't skip hwpoisoned
> pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.
> 
> Note 1: we can race against memory offlining code, especially
> memory going offline and getting unplugged: however, we will properly tear
> down the identity mapping and handle faults gracefully when accessing
> this memory from kcore code.
> 
> Note 2: we can race against drivers setting PageOffline() and turning
> memory inaccessible in the hypervisor. We'll handle this in a follow-up
> patch.
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline()
  2021-05-14 17:22 ` [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() David Hildenbrand
  2021-05-17  6:43   ` Mike Rapoport
@ 2021-05-25  8:16   ` Oscar Salvador
  1 sibling, 0 replies; 15+ messages in thread
From: Oscar Salvador @ 2021-05-25  8:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Michal Hocko, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Fri, May 14, 2021 at 07:22:45PM +0200, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
> 
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
> 
> Let's introduce page_offline_(begin|end|freeze|thaw) for
> synchronizing.
> 
> page_offline_freeze()/page_offline_thaw() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
> 
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
> 
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-05-14 17:22 ` [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
  2021-05-17  6:43   ` Mike Rapoport
@ 2021-05-25  8:20   ` Oscar Salvador
  1 sibling, 0 replies; 15+ messages in thread
From: Oscar Salvador @ 2021-05-25  8:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Michal Hocko, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Fri, May 14, 2021 at 07:22:46PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
> 
> Existing balloon implementations usually allow reading inflated memory;
> doing so might result in unnecessary overhead in the hypervisor, which
> is currently the case with virtio-mem.
> 
> For future virtio-mem use cases, it will be different when using shmem,
> huge pages, !anonymous private mappings, ... as backing storage for a VM.
> virtio-mem unplugged memory must no longer be accessed and access might
> result in undefined behavior. There will be a virtio spec extension to
> document this change, including a new feature flag indicating the
> changed behavior. We really don't want to race against PFN walkers
> reading random page content.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw)
  2021-05-14 17:22 ` [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw) David Hildenbrand
  2021-05-17  6:44   ` Mike Rapoport
@ 2021-05-25  8:21   ` Oscar Salvador
  1 sibling, 0 replies; 15+ messages in thread
From: Oscar Salvador @ 2021-05-25  8:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Michael S. Tsirkin, Jason Wang,
	Alexey Dobriyan, Mike Rapoport, Matthew Wilcox (Oracle),
	Michal Hocko, Roman Gushchin, Alex Shi, Steven Price,
	Mike Kravetz, Aili Yao, Jiri Bohac, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Naoya Horiguchi,
	linux-hyperv, virtualization, linux-fsdevel, linux-mm

On Fri, May 14, 2021 at 07:22:47PM +0200, David Hildenbrand wrote:
> Let's properly synchronize with drivers that set PageOffline().
> Unfreeze/thaw every now and then, so drivers that want to set PageOffline()
> can make progress.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-05-25  8:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-14 17:22 [PATCH v2 0/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
2021-05-14 17:22 ` [PATCH v2 1/6] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER David Hildenbrand
2021-05-14 17:22 ` [PATCH v2 2/6] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM David Hildenbrand
2021-05-14 17:22 ` [PATCH v2 3/6] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages David Hildenbrand
2021-05-25  8:09   ` Oscar Salvador
2021-05-14 17:22 ` [PATCH v2 4/6] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() David Hildenbrand
2021-05-17  6:43   ` Mike Rapoport
2021-05-17 15:18     ` David Hildenbrand
2021-05-25  8:16   ` Oscar Salvador
2021-05-14 17:22 ` [PATCH v2 5/6] virtio-mem: use page_offline_(start|end) when " David Hildenbrand
2021-05-17  6:43   ` Mike Rapoport
2021-05-25  8:20   ` Oscar Salvador
2021-05-14 17:22 ` [PATCH v2 6/6] fs/proc/kcore: use page_offline_(freeze|thaw) David Hildenbrand
2021-05-17  6:44   ` Mike Rapoport
2021-05-25  8:21   ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).