Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
@ 2019-10-09  9:12 David Hildenbrand
  2019-10-09  9:37 ` Michal Hocko
  2019-10-09  9:57 ` Naoya Horiguchi
  0 siblings, 2 replies; 14+ messages in thread
From: David Hildenbrand @ 2019-10-09  9:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell, Michal Hocko,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

There are various places where we access uninitialized memmaps, namely:
- /proc/kpagecount
- /proc/kpageflags
- /proc/kpagecgroup
- memory_failure() - which reuses stable_page_flags() from fs/proc/page.c

We have initialized memmaps either when the section is online or when
the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
garbage and in the worst case trigger kernel BUGs, especially with
CONFIG_PAGE_POISONING.

For example, not onlining a DIMM during boot and calling /proc/kpagecount
with CONFIG_PAGE_POISONING:
:/# cat /proc/kpagecount > tmp.test
[   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
[   95.601238] #PF: supervisor read access in kernel mode
[   95.601675] #PF: error_code(0x0000) - not-present page
[   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
[   95.602596] Oops: 0000 [#1] SMP NOPTI
[   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
[   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
[   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
[   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
[   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
[   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
[   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
[   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
[   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
[   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
[   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
[   95.611686] Call Trace:
[   95.611906]  proc_reg_read+0x3c/0x60
[   95.612228]  vfs_read+0xc5/0x180
[   95.612505]  ksys_read+0x68/0xe0
[   95.612785]  do_syscall_64+0x5c/0xa0
[   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that there are still two possible races as far as I can see:
- pfn_to_online_page() succeeding but the memory getting offlined and
  removed. get_online_mems() could help once we run into this.
- pfn_zone_device() succeeding but the memmap not being fully
  initialized yet. As the memmap is initialized outside of the memory
  hoptlug lock, get_online_mems() can't help.

Let's keep the existing interfaces working with ZONE_DEVICE memory. We
can later come back and fix these rare races and eventually speed-up the
ZONE_DEVICE detection. This patch now also makes sure we don't dump data
about memory blocks that are already offline again.

Reported-by: Qian Cai <cai@lca.pw>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/page.c           | 12 ++++++------
 include/linux/memremap.h | 11 +++++++++--
 mm/memory-failure.c      | 22 ++++++++++++++++------
 mm/memremap.c            | 19 +++++++++++--------
 4 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index decd3fe39674..76502af461e2 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
 		return -EINVAL;
 
 	while (count > 0) {
-		if (pfn_valid(pfn))
+		if (pfn_valid(pfn) &&
+		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
 			ppage = pfn_to_page(pfn);
 		else
 			ppage = NULL;
@@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page)
 	if (!page)
 		return BIT_ULL(KPF_NOPAGE);
 
-	if (pfn_zone_device_reserved(page_to_pfn(page)))
-		return BIT_ULL(KPF_RESERVED);
-
 	k = page->flags;
 	u = 0;
 
@@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
 		return -EINVAL;
 
 	while (count > 0) {
-		if (pfn_valid(pfn))
+		if (pfn_valid(pfn) &&
+		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
 			ppage = pfn_to_page(pfn);
 		else
 			ppage = NULL;
@@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
 		return -EINVAL;
 
 	while (count > 0) {
-		if (pfn_valid(pfn))
+		if (pfn_valid(pfn) &&
+		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
 			ppage = pfn_to_page(pfn);
 		else
 			ppage = NULL;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c676e33205d3..c076bb163c2f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-bool pfn_zone_device_reserved(unsigned long pfn);
+bool pfn_zone_device(unsigned long pfn);
+bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
 #else
-static inline bool pfn_zone_device_reserved(unsigned long pfn)
+static inline bool pfn_zone_device(unsigned long pfn)
+{
+	return false;
+}
+
+static inline bool __pfn_zone_device(unsigned long pfn,
+				     struct dev_pagemap *pgmap)
 {
 	return false;
 }
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7ef849da8278..2b4cc6b67720 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	loff_t start;
 	dax_entry_t cookie;
 
+	/* memmaps of driver reserved memory is not initialized */
+	if (!__pfn_zone_device(pfn, pgmap)) {
+		pr_err("Memory failure: %#lx: driver reserved memory\n",
+			pfn);
+		rc = -ENXIO;
+		goto out;
+	}
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
@@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
 
-	if (!pfn_valid(pfn)) {
+	p = pfn_to_online_page(pfn);
+	if (!p) {
+		if (pfn_valid(pfn)) {
+			pgmap = get_dev_pagemap(pfn, NULL);
+			if (pgmap)
+				return memory_failure_dev_pagemap(pfn, flags,
+								  pgmap);
+		}
 		pr_err("Memory failure: %#lx: memory outside kernel control\n",
 			pfn);
 		return -ENXIO;
 	}
 
-	pgmap = get_dev_pagemap(pfn, NULL);
-	if (pgmap)
-		return memory_failure_dev_pagemap(pfn, flags, pgmap);
-
-	p = pfn_to_page(pfn);
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
 	if (TestSetPageHWPoison(p)) {
diff --git a/mm/memremap.c b/mm/memremap.c
index 7fed8bd32a18..9f3bb223aec7 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn)
 	return pfn + 1;
 }
 
+bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap)
+{
+	return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap);
+}
+
 /*
- * This returns true if the page is reserved by ZONE_DEVICE driver.
+ * Returns true if the page was initialized to the ZONE_DEVICE (especially,
+ * is not reserved for driver usage).
  */
-bool pfn_zone_device_reserved(unsigned long pfn)
+bool pfn_zone_device(unsigned long pfn)
 {
 	struct dev_pagemap *pgmap;
-	struct vmem_altmap *altmap;
-	bool ret = false;
+	bool ret;
 
 	pgmap = get_dev_pagemap(pfn, NULL);
 	if (!pgmap)
-		return ret;
-	altmap = pgmap_altmap(pgmap);
-	if (altmap && pfn < (altmap->base_pfn + altmap->reserve))
-		ret = true;
+		return false;
+	ret = __pfn_zone_device(pfn, pgmap);
 	put_dev_pagemap(pgmap);
 
 	return ret;
-- 
2.21.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09  9:12 [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c David Hildenbrand
@ 2019-10-09  9:37 ` Michal Hocko
  2019-10-09 10:19   ` David Hildenbrand
  2019-10-09  9:57 ` Naoya Horiguchi
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-10-09  9:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On Wed 09-10-19 11:12:04, David Hildenbrand wrote:
> There are various places where we access uninitialized memmaps, namely:
> - /proc/kpagecount
> - /proc/kpageflags
> - /proc/kpagecgroup
> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
> 
> We have initialized memmaps either when the section is online or when
> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> garbage and in the worst case trigger kernel BUGs, especially with
> CONFIG_PAGE_POISONING.
> 
> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> with CONFIG_PAGE_POISONING:
> :/# cat /proc/kpagecount > tmp.test
> [   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
> [   95.601238] #PF: supervisor read access in kernel mode
> [   95.601675] #PF: error_code(0x0000) - not-present page
> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> [   95.602596] Oops: 0000 [#1] SMP NOPTI
> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> [   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
> [   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
> [   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
> [   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
> [   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> [   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
> [   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
> [   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
> [   95.611686] Call Trace:
> [   95.611906]  proc_reg_read+0x3c/0x60
> [   95.612228]  vfs_read+0xc5/0x180
> [   95.612505]  ksys_read+0x68/0xe0
> [   95.612785]  do_syscall_64+0x5c/0xa0
> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Note that there are still two possible races as far as I can see:
> - pfn_to_online_page() succeeding but the memory getting offlined and
>   removed. get_online_mems() could help once we run into this.
> - pfn_zone_device() succeeding but the memmap not being fully
>   initialized yet. As the memmap is initialized outside of the memory
>   hoptlug lock, get_online_mems() can't help.
> 
> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
> can later come back and fix these rare races and eventually speed-up the
> ZONE_DEVICE detection. This patch now also makes sure we don't dump data
> about memory blocks that are already offline again.

As I've already expressed already I am not really happy to have explicit
checks for zone device pages in pfn walkers. This is simply too fragile.

pfn_to_online_page makes sense because offline pages are not really in a
defined state. This would be worth a patch of its own. I remember there
was a discussion about the uninitialized zone device memmaps. It would
be really good to summarize this discussion in the changelog and
conclude why the explicit check is really good and what were other
alternatives considered.

Thanks!

> Reported-by: Qian Cai <cai@lca.pw>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  fs/proc/page.c           | 12 ++++++------
>  include/linux/memremap.h | 11 +++++++++--
>  mm/memory-failure.c      | 22 ++++++++++++++++------
>  mm/memremap.c            | 19 +++++++++++--------
>  4 files changed, 42 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index decd3fe39674..76502af461e2 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
>  		return -EINVAL;
>  
>  	while (count > 0) {
> -		if (pfn_valid(pfn))
> +		if (pfn_valid(pfn) &&
> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>  			ppage = pfn_to_page(pfn);
>  		else
>  			ppage = NULL;
> @@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page)
>  	if (!page)
>  		return BIT_ULL(KPF_NOPAGE);
>  
> -	if (pfn_zone_device_reserved(page_to_pfn(page)))
> -		return BIT_ULL(KPF_RESERVED);
> -
>  	k = page->flags;
>  	u = 0;
>  
> @@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
>  		return -EINVAL;
>  
>  	while (count > 0) {
> -		if (pfn_valid(pfn))
> +		if (pfn_valid(pfn) &&
> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>  			ppage = pfn_to_page(pfn);
>  		else
>  			ppage = NULL;
> @@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
>  		return -EINVAL;
>  
>  	while (count > 0) {
> -		if (pfn_valid(pfn))
> +		if (pfn_valid(pfn) &&
> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>  			ppage = pfn_to_page(pfn);
>  		else
>  			ppage = NULL;
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c676e33205d3..c076bb163c2f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -bool pfn_zone_device_reserved(unsigned long pfn);
> +bool pfn_zone_device(unsigned long pfn);
> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
>  #else
> -static inline bool pfn_zone_device_reserved(unsigned long pfn)
> +static inline bool pfn_zone_device(unsigned long pfn)
> +{
> +	return false;
> +}
> +
> +static inline bool __pfn_zone_device(unsigned long pfn,
> +				     struct dev_pagemap *pgmap)
>  {
>  	return false;
>  }
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7ef849da8278..2b4cc6b67720 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	loff_t start;
>  	dax_entry_t cookie;
>  
> +	/* memmaps of driver reserved memory is not initialized */
> +	if (!__pfn_zone_device(pfn, pgmap)) {
> +		pr_err("Memory failure: %#lx: driver reserved memory\n",
> +			pfn);
> +		rc = -ENXIO;
> +		goto out;
> +	}
> +
>  	/*
>  	 * Prevent the inode from being freed while we are interrogating
>  	 * the address_space, typically this would be handled by
> @@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
>  
> -	if (!pfn_valid(pfn)) {
> +	p = pfn_to_online_page(pfn);
> +	if (!p) {
> +		if (pfn_valid(pfn)) {
> +			pgmap = get_dev_pagemap(pfn, NULL);
> +			if (pgmap)
> +				return memory_failure_dev_pagemap(pfn, flags,
> +								  pgmap);
> +		}
>  		pr_err("Memory failure: %#lx: memory outside kernel control\n",
>  			pfn);
>  		return -ENXIO;
>  	}
>  
> -	pgmap = get_dev_pagemap(pfn, NULL);
> -	if (pgmap)
> -		return memory_failure_dev_pagemap(pfn, flags, pgmap);
> -
> -	p = pfn_to_page(pfn);
>  	if (PageHuge(p))
>  		return memory_failure_hugetlb(pfn, flags);
>  	if (TestSetPageHWPoison(p)) {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 7fed8bd32a18..9f3bb223aec7 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn)
>  	return pfn + 1;
>  }
>  
> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap)
> +{
> +	return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap);
> +}
> +
>  /*
> - * This returns true if the page is reserved by ZONE_DEVICE driver.
> + * Returns true if the page was initialized to the ZONE_DEVICE (especially,
> + * is not reserved for driver usage).
>   */
> -bool pfn_zone_device_reserved(unsigned long pfn)
> +bool pfn_zone_device(unsigned long pfn)
>  {
>  	struct dev_pagemap *pgmap;
> -	struct vmem_altmap *altmap;
> -	bool ret = false;
> +	bool ret;
>  
>  	pgmap = get_dev_pagemap(pfn, NULL);
>  	if (!pgmap)
> -		return ret;
> -	altmap = pgmap_altmap(pgmap);
> -	if (altmap && pfn < (altmap->base_pfn + altmap->reserve))
> -		ret = true;
> +		return false;
> +	ret = __pfn_zone_device(pfn, pgmap);
>  	put_dev_pagemap(pgmap);
>  
>  	return ret;
> -- 
> 2.21.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09  9:12 [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c David Hildenbrand
  2019-10-09  9:37 ` Michal Hocko
@ 2019-10-09  9:57 ` Naoya Horiguchi
  2019-10-10  7:30   ` David Hildenbrand
  1 sibling, 1 reply; 14+ messages in thread
From: Naoya Horiguchi @ 2019-10-09  9:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan, Andrew Morton,
	Stephen Rothwell, Michal Hocko, Toshiki Fukasawa,
	Konstantin Khlebnikov, Mike Rapoport, Anthony Yznaga,
	Jason Gunthorpe, Dan Williams, Logan Gunthorpe, Ira Weiny,
	Aneesh Kumar K.V, linux-fsdevel

Hi David,

On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
> There are various places where we access uninitialized memmaps, namely:
> - /proc/kpagecount
> - /proc/kpageflags
> - /proc/kpagecgroup
> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c

Ah right, memory_failure is another victim of this bug.

> 
> We have initialized memmaps either when the section is online or when
> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> garbage and in the worst case trigger kernel BUGs, especially with
> CONFIG_PAGE_POISONING.
> 
> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> with CONFIG_PAGE_POISONING:
> :/# cat /proc/kpagecount > tmp.test
> [   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
> [   95.601238] #PF: supervisor read access in kernel mode
> [   95.601675] #PF: error_code(0x0000) - not-present page
> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> [   95.602596] Oops: 0000 [#1] SMP NOPTI
> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> [   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
> [   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
> [   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
> [   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
> [   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> [   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
> [   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
> [   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
> [   95.611686] Call Trace:
> [   95.611906]  proc_reg_read+0x3c/0x60
> [   95.612228]  vfs_read+0xc5/0x180
> [   95.612505]  ksys_read+0x68/0xe0
> [   95.612785]  do_syscall_64+0x5c/0xa0
> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Note that there are still two possible races as far as I can see:
> - pfn_to_online_page() succeeding but the memory getting offlined and
>   removed. get_online_mems() could help once we run into this.
> - pfn_zone_device() succeeding but the memmap not being fully
>   initialized yet. As the memmap is initialized outside of the memory
>   hoptlug lock, get_online_mems() can't help.
> 
> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
> can later come back and fix these rare races and eventually speed-up the
> ZONE_DEVICE detection.

Actually, Toshiki is writing code to refactor and optimize the pfn walking
part, where we find the pfn ranges covered by zone devices by running over
xarray pgmap_array and use the range info to reduce pointer dereferences
to speed up pfn walk. I hope he will share it soon.

Thanks,
Naoya Horiguchi

> This patch now also makes sure we don't dump data
> about memory blocks that are already offline again.
> 
> Reported-by: Qian Cai <cai@lca.pw>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  fs/proc/page.c           | 12 ++++++------
>  include/linux/memremap.h | 11 +++++++++--
>  mm/memory-failure.c      | 22 ++++++++++++++++------
>  mm/memremap.c            | 19 +++++++++++--------
>  4 files changed, 42 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index decd3fe39674..76502af461e2 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
>  		return -EINVAL;
>  
>  	while (count > 0) {
> -		if (pfn_valid(pfn))
> +		if (pfn_valid(pfn) &&
> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>  			ppage = pfn_to_page(pfn);
>  		else
>  			ppage = NULL;
> @@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page)
>  	if (!page)
>  		return BIT_ULL(KPF_NOPAGE);
>  
> -	if (pfn_zone_device_reserved(page_to_pfn(page)))
> -		return BIT_ULL(KPF_RESERVED);
> -
>  	k = page->flags;
>  	u = 0;
>  
> @@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
>  		return -EINVAL;
>  
>  	while (count > 0) {
> -		if (pfn_valid(pfn))
> +		if (pfn_valid(pfn) &&
> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>  			ppage = pfn_to_page(pfn);
>  		else
>  			ppage = NULL;
> @@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
>  		return -EINVAL;
>  
>  	while (count > 0) {
> -		if (pfn_valid(pfn))
> +		if (pfn_valid(pfn) &&
> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>  			ppage = pfn_to_page(pfn);
>  		else
>  			ppage = NULL;
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c676e33205d3..c076bb163c2f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -bool pfn_zone_device_reserved(unsigned long pfn);
> +bool pfn_zone_device(unsigned long pfn);
> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
>  #else
> -static inline bool pfn_zone_device_reserved(unsigned long pfn)
> +static inline bool pfn_zone_device(unsigned long pfn)
> +{
> +	return false;
> +}
> +
> +static inline bool __pfn_zone_device(unsigned long pfn,
> +				     struct dev_pagemap *pgmap)
>  {
>  	return false;
>  }
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7ef849da8278..2b4cc6b67720 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	loff_t start;
>  	dax_entry_t cookie;
>  
> +	/* memmaps of driver reserved memory is not initialized */
> +	if (!__pfn_zone_device(pfn, pgmap)) {
> +		pr_err("Memory failure: %#lx: driver reserved memory\n",
> +			pfn);
> +		rc = -ENXIO;
> +		goto out;
> +	}
> +
>  	/*
>  	 * Prevent the inode from being freed while we are interrogating
>  	 * the address_space, typically this would be handled by
> @@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
>  
> -	if (!pfn_valid(pfn)) {
> +	p = pfn_to_online_page(pfn);
> +	if (!p) {
> +		if (pfn_valid(pfn)) {
> +			pgmap = get_dev_pagemap(pfn, NULL);
> +			if (pgmap)
> +				return memory_failure_dev_pagemap(pfn, flags,
> +								  pgmap);
> +		}
>  		pr_err("Memory failure: %#lx: memory outside kernel control\n",
>  			pfn);
>  		return -ENXIO;
>  	}
>  
> -	pgmap = get_dev_pagemap(pfn, NULL);
> -	if (pgmap)
> -		return memory_failure_dev_pagemap(pfn, flags, pgmap);
> -
> -	p = pfn_to_page(pfn);
>  	if (PageHuge(p))
>  		return memory_failure_hugetlb(pfn, flags);
>  	if (TestSetPageHWPoison(p)) {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 7fed8bd32a18..9f3bb223aec7 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn)
>  	return pfn + 1;
>  }
>  
> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap)
> +{
> +	return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap);
> +}
> +
>  /*
> - * This returns true if the page is reserved by ZONE_DEVICE driver.
> + * Returns true if the page was initialized to the ZONE_DEVICE (especially,
> + * is not reserved for driver usage).
>   */
> -bool pfn_zone_device_reserved(unsigned long pfn)
> +bool pfn_zone_device(unsigned long pfn)
>  {
>  	struct dev_pagemap *pgmap;
> -	struct vmem_altmap *altmap;
> -	bool ret = false;
> +	bool ret;
>  
>  	pgmap = get_dev_pagemap(pfn, NULL);
>  	if (!pgmap)
> -		return ret;
> -	altmap = pgmap_altmap(pgmap);
> -	if (altmap && pfn < (altmap->base_pfn + altmap->reserve))
> -		ret = true;
> +		return false;
> +	ret = __pfn_zone_device(pfn, pgmap);
>  	put_dev_pagemap(pgmap);
>  
>  	return ret;
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09  9:37 ` Michal Hocko
@ 2019-10-09 10:19   ` David Hildenbrand
  2019-10-09 11:24     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2019-10-09 10:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On 09.10.19 11:37, Michal Hocko wrote:
> On Wed 09-10-19 11:12:04, David Hildenbrand wrote:
>> There are various places where we access uninitialized memmaps, namely:
>> - /proc/kpagecount
>> - /proc/kpageflags
>> - /proc/kpagecgroup
>> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
>>
>> We have initialized memmaps either when the section is online or when
>> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
>> garbage and in the worst case trigger kernel BUGs, especially with
>> CONFIG_PAGE_POISONING.
>>
>> For example, not onlining a DIMM during boot and calling /proc/kpagecount
>> with CONFIG_PAGE_POISONING:
>> :/# cat /proc/kpagecount > tmp.test
>> [   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
>> [   95.601238] #PF: supervisor read access in kernel mode
>> [   95.601675] #PF: error_code(0x0000) - not-present page
>> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
>> [   95.602596] Oops: 0000 [#1] SMP NOPTI
>> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
>> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
>> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
>> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
>> [   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
>> [   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
>> [   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
>> [   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
>> [   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
>> [   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
>> [   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
>> [   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
>> [   95.611686] Call Trace:
>> [   95.611906]  proc_reg_read+0x3c/0x60
>> [   95.612228]  vfs_read+0xc5/0x180
>> [   95.612505]  ksys_read+0x68/0xe0
>> [   95.612785]  do_syscall_64+0x5c/0xa0
>> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Note that there are still two possible races as far as I can see:
>> - pfn_to_online_page() succeeding but the memory getting offlined and
>>   removed. get_online_mems() could help once we run into this.
>> - pfn_zone_device() succeeding but the memmap not being fully
>>   initialized yet. As the memmap is initialized outside of the memory
>>   hoptlug lock, get_online_mems() can't help.
>>
>> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
>> can later come back and fix these rare races and eventually speed-up the
>> ZONE_DEVICE detection. This patch now also makes sure we don't dump data
>> about memory blocks that are already offline again.
> 
> As I've already expressed already I am not really happy to have explicit
> checks for zone device pages in pfn walkers. This is simply too fragile.
> 
> pfn_to_online_page makes sense because offline pages are not really in a
> defined state. This would be worth a patch of its own. I remember there

The issue is, once I check for pfn_to_online_page(), these functions
can't handle ZONE_DEVICE at all anymore. Especially in regards to
memory_failure() I don't think this is acceptable. So while I
(personally) only care about adding pfn_to_online_page() checks, the
in-this-sense-fragile-subsection ZONE_DEVICE implementation requires me
to introduce a temporary check for initialized memmaps.

> was a discussion about the uninitialized zone device memmaps. It would
> be really good to summarize this discussion in the changelog and
> conclude why the explicit check is really good and what were other
> alternatives considered.

Yeah, I also expressed my feelings and the issues to be solved by
ZONE_DEVICE people in https://lkml.org/lkml/2019/9/6/114. However, the
discussion stalled and nobody really proposed a solution or followed up.

I want to have this fixed now, as it is much easier to trigger with
"[PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing
memory" for !ZONE_DEVICE.

As we already do have a similar check (pfn_zone_device_reserved()) that
I merely convert for now, I will summarize and make it clearer that this
is something that should be reworked by ZONE_DEVICE people rather sooner
than later.

Thanks!

> 
> Thanks!
> 
>> Reported-by: Qian Cai <cai@lca.pw>
>> Cc: Alexey Dobriyan <adobriyan@gmail.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
>> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
>> Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>> Cc: linux-fsdevel@vger.kernel.org
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  fs/proc/page.c           | 12 ++++++------
>>  include/linux/memremap.h | 11 +++++++++--
>>  mm/memory-failure.c      | 22 ++++++++++++++++------
>>  mm/memremap.c            | 19 +++++++++++--------
>>  4 files changed, 42 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/proc/page.c b/fs/proc/page.c
>> index decd3fe39674..76502af461e2 100644
>> --- a/fs/proc/page.c
>> +++ b/fs/proc/page.c
>> @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
>>  		return -EINVAL;
>>  
>>  	while (count > 0) {
>> -		if (pfn_valid(pfn))
>> +		if (pfn_valid(pfn) &&
>> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>>  			ppage = pfn_to_page(pfn);
>>  		else
>>  			ppage = NULL;
>> @@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page)
>>  	if (!page)
>>  		return BIT_ULL(KPF_NOPAGE);
>>  
>> -	if (pfn_zone_device_reserved(page_to_pfn(page)))
>> -		return BIT_ULL(KPF_RESERVED);
>> -
>>  	k = page->flags;
>>  	u = 0;
>>  
>> @@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
>>  		return -EINVAL;
>>  
>>  	while (count > 0) {
>> -		if (pfn_valid(pfn))
>> +		if (pfn_valid(pfn) &&
>> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>>  			ppage = pfn_to_page(pfn);
>>  		else
>>  			ppage = NULL;
>> @@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
>>  		return -EINVAL;
>>  
>>  	while (count > 0) {
>> -		if (pfn_valid(pfn))
>> +		if (pfn_valid(pfn) &&
>> +		    (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>>  			ppage = pfn_to_page(pfn);
>>  		else
>>  			ppage = NULL;
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index c676e33205d3..c076bb163c2f 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>>  }
>>  
>>  #ifdef CONFIG_ZONE_DEVICE
>> -bool pfn_zone_device_reserved(unsigned long pfn);
>> +bool pfn_zone_device(unsigned long pfn);
>> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
>>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
>>  #else
>> -static inline bool pfn_zone_device_reserved(unsigned long pfn)
>> +static inline bool pfn_zone_device(unsigned long pfn)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline bool __pfn_zone_device(unsigned long pfn,
>> +				     struct dev_pagemap *pgmap)
>>  {
>>  	return false;
>>  }
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 7ef849da8278..2b4cc6b67720 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>>  	loff_t start;
>>  	dax_entry_t cookie;
>>  
>> +	/* memmaps of driver reserved memory is not initialized */
>> +	if (!__pfn_zone_device(pfn, pgmap)) {
>> +		pr_err("Memory failure: %#lx: driver reserved memory\n",
>> +			pfn);
>> +		rc = -ENXIO;
>> +		goto out;
>> +	}
>> +
>>  	/*
>>  	 * Prevent the inode from being freed while we are interrogating
>>  	 * the address_space, typically this would be handled by
>> @@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags)
>>  	if (!sysctl_memory_failure_recovery)
>>  		panic("Memory failure on page %lx", pfn);
>>  
>> -	if (!pfn_valid(pfn)) {
>> +	p = pfn_to_online_page(pfn);
>> +	if (!p) {
>> +		if (pfn_valid(pfn)) {
>> +			pgmap = get_dev_pagemap(pfn, NULL);
>> +			if (pgmap)
>> +				return memory_failure_dev_pagemap(pfn, flags,
>> +								  pgmap);
>> +		}
>>  		pr_err("Memory failure: %#lx: memory outside kernel control\n",
>>  			pfn);
>>  		return -ENXIO;
>>  	}
>>  
>> -	pgmap = get_dev_pagemap(pfn, NULL);
>> -	if (pgmap)
>> -		return memory_failure_dev_pagemap(pfn, flags, pgmap);
>> -
>> -	p = pfn_to_page(pfn);
>>  	if (PageHuge(p))
>>  		return memory_failure_hugetlb(pfn, flags);
>>  	if (TestSetPageHWPoison(p)) {
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 7fed8bd32a18..9f3bb223aec7 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn)
>>  	return pfn + 1;
>>  }
>>  
>> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap)
>> +{
>> +	return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap);
>> +}
>> +
>>  /*
>> - * This returns true if the page is reserved by ZONE_DEVICE driver.
>> + * Returns true if the page was initialized to the ZONE_DEVICE (especially,
>> + * is not reserved for driver usage).
>>   */
>> -bool pfn_zone_device_reserved(unsigned long pfn)
>> +bool pfn_zone_device(unsigned long pfn)
>>  {
>>  	struct dev_pagemap *pgmap;
>> -	struct vmem_altmap *altmap;
>> -	bool ret = false;
>> +	bool ret;
>>  
>>  	pgmap = get_dev_pagemap(pfn, NULL);
>>  	if (!pgmap)
>> -		return ret;
>> -	altmap = pgmap_altmap(pgmap);
>> -	if (altmap && pfn < (altmap->base_pfn + altmap->reserve))
>> -		ret = true;
>> +		return false;
>> +	ret = __pfn_zone_device(pfn, pgmap);
>>  	put_dev_pagemap(pgmap);
>>  
>>  	return ret;
>> -- 
>> 2.21.0
> 


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09 10:19   ` David Hildenbrand
@ 2019-10-09 11:24     ` Michal Hocko
  2019-10-09 12:58       ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-10-09 11:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On Wed 09-10-19 12:19:59, David Hildenbrand wrote:
[...]
> > pfn_to_online_page makes sense because offline pages are not really in a
> > defined state. This would be worth a patch of its own. I remember there
> 
> The issue is, once I check for pfn_to_online_page(), these functions
> can't handle ZONE_DEVICE at all anymore. Especially in regards to
> memory_failure() I don't think this is acceptable.

Could you be more specific please? I am not sure I am following.

> So while I
> (personally) only care about adding pfn_to_online_page() checks, the
> in-this-sense-fragile-subsection ZONE_DEVICE implementation requires me
> to introduce a temporary check for initialized memmaps.
> 
> > was a discussion about the uninitialized zone device memmaps. It would
> > be really good to summarize this discussion in the changelog and
> > conclude why the explicit check is really good and what were other
> > alternatives considered.
> 
> Yeah, I also expressed my feelings and the issues to be solved by
> ZONE_DEVICE people in https://lkml.org/lkml/2019/9/6/114. However, the
> discussion stalled and nobody really proposed a solution or followed up.

I will try to get back to that discussion but is there any technical
reason that prevents any conclusion or it is just stuck on a lack of
time of the participants?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09 11:24     ` Michal Hocko
@ 2019-10-09 12:58       ` David Hildenbrand
  2019-10-09 13:22         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2019-10-09 12:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On 09.10.19 13:24, Michal Hocko wrote:
> On Wed 09-10-19 12:19:59, David Hildenbrand wrote:
> [...]
>>> pfn_to_online_page makes sense because offline pages are not really in a
>>> defined state. This would be worth a patch of its own. I remember there
>>
>> The issue is, once I check for pfn_to_online_page(), these functions
>> can't handle ZONE_DEVICE at all anymore. Especially in regards to
>> memory_failure() I don't think this is acceptable.
> > Could you be more specific please? I am not sure I am following.

I wasn't quite clear, let me try to be more precise:

if (pfn_to_online_page(pfn)) {
	/* memmap initialized */
} else if (pfn_valid(pfn)) {
	/* ???
	 * a) offline memory. memmap garbage.
	 * b) memremap memory: memmap initialized to ZONE_DEVICE.
	 * c) memremap memory: reserved for driver. memmap garbage.
	 * (d) memremap memory: memmap currently initializing - garbage)
	 */
}

To distinguish between a) and b/c), we can currently only use
get_dev_pagemap(pfn, NULL). To distinguish between b) and c), we can
currently only use pfn_zone_device_reserved().

That implies, that - right now - if we want to fix what is described in
the patch without introducing more users of get_dev_pagemap(pfn, NULL),
we will not be able to support ZONE_DEVICE in:
- /proc/kpagecount
- /proc/kpageflags
- /proc/kpagecgroup

if (pfn_to_online_page(pfn)) {
	/* memmap initialized */
} else {
	/* skip */
}

Now, memory_failure() already contains a get_dev_pagemap(pfn, NULL)
check and adding pfn_to_online_page(pfn) would also work.

I would be fine with this, but it means that - for now - the three
/proc/ files won't be able to deal with ZONE_DEVICE memory.

> 
>> So while I
>> (personally) only care about adding pfn_to_online_page() checks, the
>> in-this-sense-fragile-subsection ZONE_DEVICE implementation requires me
>> to introduce a temporary check for initialized memmaps.
>>
>>> was a discussion about the uninitialized zone device memmaps. It would
>>> be really good to summarize this discussion in the changelog and
>>> conclude why the explicit check is really good and what were other
>>> alternatives considered.
>>
>> Yeah, I also expressed my feelings and the issues to be solved by
>> ZONE_DEVICE people in https://lkml.org/lkml/2019/9/6/114. However, the
>> discussion stalled and nobody really proposed a solution or followed up.
> 
> I will try to get back to that discussion but is there any technical
> reason that prevents any conclusion or it is just stuck on a lack of
> time of the participants?

I think it was both. People not responding to questions and not having
decided on a solution.


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09 12:58       ` David Hildenbrand
@ 2019-10-09 13:22         ` Michal Hocko
  2019-10-09 13:24           ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-10-09 13:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On Wed 09-10-19 14:58:24, David Hildenbrand wrote:
[...]
> I would be fine with this, but it means that - for now - the three
> /proc/ files won't be able to deal with ZONE_DEVICE memory.

Thanks for the clarification. Is this an actual problem though? Do we
have any consumers of the functionality?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09 13:22         ` Michal Hocko
@ 2019-10-09 13:24           ` David Hildenbrand
  2019-10-09 13:29             ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2019-10-09 13:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On 09.10.19 15:22, Michal Hocko wrote:
> On Wed 09-10-19 14:58:24, David Hildenbrand wrote:
> [...]
>> I would be fine with this, but it means that - for now - the three
>> /proc/ files won't be able to deal with ZONE_DEVICE memory.
> 
> Thanks for the clarification. Is this an actual problem though? Do we
> have any consumers of the functionality?
> 

I don't know, that's why I was being careful in not changing its behavior.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09 13:24           ` David Hildenbrand
@ 2019-10-09 13:29             ` Michal Hocko
  2019-10-09 13:32               ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-10-09 13:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On Wed 09-10-19 15:24:05, David Hildenbrand wrote:
> On 09.10.19 15:22, Michal Hocko wrote:
> > On Wed 09-10-19 14:58:24, David Hildenbrand wrote:
> > [...]
> >> I would be fine with this, but it means that - for now - the three
> >> /proc/ files won't be able to deal with ZONE_DEVICE memory.
> > 
> > Thanks for the clarification. Is this an actual problem though? Do we
> > have any consumers of the functionality?
> > 
> 
> I don't know, that's why I was being careful in not changing its behavior.

Can we simply go with pfn_to_online_page. I would be quite surprised if
anybody was really examining zone device memory ranges as they are
static IIUC. If there is some usecase I am pretty sure we will learn
that and can address it accordingly.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09 13:29             ` Michal Hocko
@ 2019-10-09 13:32               ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2019-10-09 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan,
	Naoya Horiguchi, Andrew Morton, Stephen Rothwell,
	Toshiki Fukasawa, Konstantin Khlebnikov, Mike Rapoport,
	Anthony Yznaga, Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Ira Weiny, Aneesh Kumar K.V, linux-fsdevel

On 09.10.19 15:29, Michal Hocko wrote:
> On Wed 09-10-19 15:24:05, David Hildenbrand wrote:
>> On 09.10.19 15:22, Michal Hocko wrote:
>>> On Wed 09-10-19 14:58:24, David Hildenbrand wrote:
>>> [...]
>>>> I would be fine with this, but it means that - for now - the three
>>>> /proc/ files won't be able to deal with ZONE_DEVICE memory.
>>>
>>> Thanks for the clarification. Is this an actual problem though? Do we
>>> have any consumers of the functionality?
>>>
>>
>> I don't know, that's why I was being careful in not changing its behavior.
> 
> Can we simply go with pfn_to_online_page. I would be quite surprised if
> anybody was really examining zone device memory ranges as they are
> static IIUC. If there is some usecase I am pretty sure we will learn
> that and can address it accordingly.
> 

I consider it mostly a debug interface either way. Will rework, test and
resend. Thanks!

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-09  9:57 ` Naoya Horiguchi
@ 2019-10-10  7:30   ` David Hildenbrand
  2019-10-11  0:11     ` Naoya Horiguchi
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2019-10-10  7:30 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan, Andrew Morton,
	Stephen Rothwell, Michal Hocko, Toshiki Fukasawa,
	Konstantin Khlebnikov, Mike Rapoport, Anthony Yznaga,
	Jason Gunthorpe, Dan Williams, Logan Gunthorpe, Ira Weiny,
	Aneesh Kumar K.V, linux-fsdevel

On 09.10.19 11:57, Naoya Horiguchi wrote:
> Hi David,
> 
> On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
>> There are various places where we access uninitialized memmaps, namely:
>> - /proc/kpagecount
>> - /proc/kpageflags
>> - /proc/kpagecgroup
>> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
> 
> Ah right, memory_failure is another victim of this bug.
> 
>>
>> We have initialized memmaps either when the section is online or when
>> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
>> garbage and in the worst case trigger kernel BUGs, especially with
>> CONFIG_PAGE_POISONING.
>>
>> For example, not onlining a DIMM during boot and calling /proc/kpagecount
>> with CONFIG_PAGE_POISONING:
>> :/# cat /proc/kpagecount > tmp.test
>> [   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
>> [   95.601238] #PF: supervisor read access in kernel mode
>> [   95.601675] #PF: error_code(0x0000) - not-present page
>> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
>> [   95.602596] Oops: 0000 [#1] SMP NOPTI
>> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
>> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
>> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
>> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
>> [   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
>> [   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
>> [   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
>> [   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
>> [   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
>> [   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
>> [   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
>> [   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
>> [   95.611686] Call Trace:
>> [   95.611906]  proc_reg_read+0x3c/0x60
>> [   95.612228]  vfs_read+0xc5/0x180
>> [   95.612505]  ksys_read+0x68/0xe0
>> [   95.612785]  do_syscall_64+0x5c/0xa0
>> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Note that there are still two possible races as far as I can see:
>> - pfn_to_online_page() succeeding but the memory getting offlined and
>>   removed. get_online_mems() could help once we run into this.
>> - pfn_zone_device() succeeding but the memmap not being fully
>>   initialized yet. As the memmap is initialized outside of the memory
>>   hoptlug lock, get_online_mems() can't help.
>>
>> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
>> can later come back and fix these rare races and eventually speed-up the
>> ZONE_DEVICE detection.
> 
> Actually, Toshiki is writing code to refactor and optimize the pfn walking
> part, where we find the pfn ranges covered by zone devices by running over
> xarray pgmap_array and use the range info to reduce pointer dereferences
> to speed up pfn walk. I hope he will share it soon.

AFAIKT, Michal is not a friend of special-casing PFN walkers in that
way. We should have a mechanism to detect if a memmap was initialized
without having to go via pgmap, special-casing. See my other mail where
I draft one basic approach.


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-10  7:30   ` David Hildenbrand
@ 2019-10-11  0:11     ` Naoya Horiguchi
  2019-10-11  0:50       ` Naoya Horiguchi
  0 siblings, 1 reply; 14+ messages in thread
From: Naoya Horiguchi @ 2019-10-11  0:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan, Andrew Morton,
	Stephen Rothwell, Michal Hocko, Toshiki Fukasawa,
	Konstantin Khlebnikov, Mike Rapoport, Anthony Yznaga,
	Jason Gunthorpe, Dan Williams, Logan Gunthorpe, Ira Weiny,
	Aneesh Kumar K.V, linux-fsdevel

On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote:
> On 09.10.19 11:57, Naoya Horiguchi wrote:
> > Hi David,
> > 
> > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
> >> There are various places where we access uninitialized memmaps, namely:
> >> - /proc/kpagecount
> >> - /proc/kpageflags
> >> - /proc/kpagecgroup
> >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
> > 
> > Ah right, memory_failure is another victim of this bug.
> > 
> >>
> >> We have initialized memmaps either when the section is online or when
> >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> >> garbage and in the worst case trigger kernel BUGs, especially with
> >> CONFIG_PAGE_POISONING.
> >>
> >> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> >> with CONFIG_PAGE_POISONING:
> >> :/# cat /proc/kpagecount > tmp.test
> >> [   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
> >> [   95.601238] #PF: supervisor read access in kernel mode
> >> [   95.601675] #PF: error_code(0x0000) - not-present page
> >> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> >> [   95.602596] Oops: 0000 [#1] SMP NOPTI
> >> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
> >> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> >> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> >> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> >> [   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
> >> [   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
> >> [   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
> >> [   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
> >> [   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> >> [   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
> >> [   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
> >> [   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
> >> [   95.611686] Call Trace:
> >> [   95.611906]  proc_reg_read+0x3c/0x60
> >> [   95.612228]  vfs_read+0xc5/0x180
> >> [   95.612505]  ksys_read+0x68/0xe0
> >> [   95.612785]  do_syscall_64+0x5c/0xa0
> >> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >> Note that there are still two possible races as far as I can see:
> >> - pfn_to_online_page() succeeding but the memory getting offlined and
> >>   removed. get_online_mems() could help once we run into this.
> >> - pfn_zone_device() succeeding but the memmap not being fully
> >>   initialized yet. As the memmap is initialized outside of the memory
> >>   hoptlug lock, get_online_mems() can't help.
> >>
> >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
> >> can later come back and fix these rare races and eventually speed-up the
> >> ZONE_DEVICE detection.
> > 
> > Actually, Toshiki is writing code to refactor and optimize the pfn walking
> > part, where we find the pfn ranges covered by zone devices by running over
> > xarray pgmap_array and use the range info to reduce pointer dereferences
> > to speed up pfn walk. I hope he will share it soon.
> 
> AFAIKT, Michal is not a friend of special-casing PFN walkers in that
> way. We should have a mechanism to detect if a memmap was initialized
> without having to go via pgmap, special-casing. See my other mail where
> I draft one basic approach.

OK, so considering your v2 approach, we could have another pfn_to_page()
variant like pfn_to_zone_device_page(), where we check that a given pfn
belongs to the memory section backed by zone memory, then another check if
the pfn has initialized memmap or not, and return NULL if memmap not
initialied.  We'll try this approach then, but if you find problems/concerns,
please let me know.

Thanks,
Naoya Horiguchi


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-11  0:11     ` Naoya Horiguchi
@ 2019-10-11  0:50       ` Naoya Horiguchi
  2019-10-11  9:53         ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Naoya Horiguchi @ 2019-10-11  0:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan, Andrew Morton,
	Stephen Rothwell, Michal Hocko, Toshiki Fukasawa,
	Konstantin Khlebnikov, Mike Rapoport, Anthony Yznaga,
	Jason Gunthorpe, Dan Williams, Logan Gunthorpe, Ira Weiny,
	Aneesh Kumar K.V, linux-fsdevel

On Fri, Oct 11, 2019 at 12:11:25AM +0000, Horiguchi Naoya(^[$BKY8}^[(B ^[$BD>Li^[(B) wrote:
> On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote:
> > On 09.10.19 11:57, Naoya Horiguchi wrote:
> > > Hi David,
> > > 
> > > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
> > >> There are various places where we access uninitialized memmaps, namely:
> > >> - /proc/kpagecount
> > >> - /proc/kpageflags
> > >> - /proc/kpagecgroup
> > >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
> > > 
> > > Ah right, memory_failure is another victim of this bug.
> > > 
> > >>
> > >> We have initialized memmaps either when the section is online or when
> > >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> > >> garbage and in the worst case trigger kernel BUGs, especially with
> > >> CONFIG_PAGE_POISONING.
> > >>
> > >> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> > >> with CONFIG_PAGE_POISONING:
> > >> :/# cat /proc/kpagecount > tmp.test
> > >> [   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
> > >> [   95.601238] #PF: supervisor read access in kernel mode
> > >> [   95.601675] #PF: error_code(0x0000) - not-present page
> > >> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> > >> [   95.602596] Oops: 0000 [#1] SMP NOPTI
> > >> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
> > >> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> > >> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> > >> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> > >> [   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
> > >> [   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
> > >> [   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
> > >> [   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
> > >> [   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> > >> [   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
> > >> [   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
> > >> [   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
> > >> [   95.611686] Call Trace:
> > >> [   95.611906]  proc_reg_read+0x3c/0x60
> > >> [   95.612228]  vfs_read+0xc5/0x180
> > >> [   95.612505]  ksys_read+0x68/0xe0
> > >> [   95.612785]  do_syscall_64+0x5c/0xa0
> > >> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >>
> > >> Note that there are still two possible races as far as I can see:
> > >> - pfn_to_online_page() succeeding but the memory getting offlined and
> > >>   removed. get_online_mems() could help once we run into this.
> > >> - pfn_zone_device() succeeding but the memmap not being fully
> > >>   initialized yet. As the memmap is initialized outside of the memory
> > >>   hoptlug lock, get_online_mems() can't help.
> > >>
> > >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
> > >> can later come back and fix these rare races and eventually speed-up the
> > >> ZONE_DEVICE detection.
> > > 
> > > Actually, Toshiki is writing code to refactor and optimize the pfn walking
> > > part, where we find the pfn ranges covered by zone devices by running over
> > > xarray pgmap_array and use the range info to reduce pointer dereferences
> > > to speed up pfn walk. I hope he will share it soon.
> > 
> > AFAIKT, Michal is not a friend of special-casing PFN walkers in that
> > way. We should have a mechanism to detect if a memmap was initialized
> > without having to go via pgmap, special-casing. See my other mail where
> > I draft one basic approach.
> 
> OK, so considering your v2 approach, we could have another pfn_to_page()
> variant like pfn_to_zone_device_page(), where we check that a given pfn
> belongs to the memory section backed by zone memory, then another check if
> the pfn has initialized memmap or not, and return NULL if memmap not
> initialied.  We'll try this approach then, but if you find problems/concerns,
> please let me know.

Sorry, you already mentioned detail here,
https://lore.kernel.org/lkml/c6198acd-8ff7-c40c-cb4e-f0f12f841b38@redhat.com/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c
  2019-10-11  0:50       ` Naoya Horiguchi
@ 2019-10-11  9:53         ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2019-10-11  9:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Qian Cai, Alexey Dobriyan, Andrew Morton,
	Stephen Rothwell, Michal Hocko, Toshiki Fukasawa,
	Konstantin Khlebnikov, Mike Rapoport, Anthony Yznaga,
	Jason Gunthorpe, Dan Williams, Logan Gunthorpe, Ira Weiny,
	Aneesh Kumar K.V, linux-fsdevel

On 11.10.19 02:50, Naoya Horiguchi wrote:
> On Fri, Oct 11, 2019 at 12:11:25AM +0000, Horiguchi Naoya(^[$BKY8}^[(B ^[$BD>Li^[(B) wrote:
>> On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote:
>>> On 09.10.19 11:57, Naoya Horiguchi wrote:
>>>> Hi David,
>>>>
>>>> On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
>>>>> There are various places where we access uninitialized memmaps, namely:
>>>>> - /proc/kpagecount
>>>>> - /proc/kpageflags
>>>>> - /proc/kpagecgroup
>>>>> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
>>>>
>>>> Ah right, memory_failure is another victim of this bug.
>>>>
>>>>>
>>>>> We have initialized memmaps either when the section is online or when
>>>>> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
>>>>> garbage and in the worst case trigger kernel BUGs, especially with
>>>>> CONFIG_PAGE_POISONING.
>>>>>
>>>>> For example, not onlining a DIMM during boot and calling /proc/kpagecount
>>>>> with CONFIG_PAGE_POISONING:
>>>>> :/# cat /proc/kpagecount > tmp.test
>>>>> [   95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
>>>>> [   95.601238] #PF: supervisor read access in kernel mode
>>>>> [   95.601675] #PF: error_code(0x0000) - not-present page
>>>>> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
>>>>> [   95.602596] Oops: 0000 [#1] SMP NOPTI
>>>>> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
>>>>> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
>>>>> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
>>>>> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
>>>>> [   95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
>>>>> [   95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
>>>>> [   95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
>>>>> [   95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
>>>>> [   95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
>>>>> [   95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
>>>>> [   95.609924] FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
>>>>> [   95.610599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [   95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
>>>>> [   95.611686] Call Trace:
>>>>> [   95.611906]  proc_reg_read+0x3c/0x60
>>>>> [   95.612228]  vfs_read+0xc5/0x180
>>>>> [   95.612505]  ksys_read+0x68/0xe0
>>>>> [   95.612785]  do_syscall_64+0x5c/0xa0
>>>>> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>
>>>>> Note that there are still two possible races as far as I can see:
>>>>> - pfn_to_online_page() succeeding but the memory getting offlined and
>>>>>    removed. get_online_mems() could help once we run into this.
>>>>> - pfn_zone_device() succeeding but the memmap not being fully
>>>>>    initialized yet. As the memmap is initialized outside of the memory
>>>>>    hoptlug lock, get_online_mems() can't help.
>>>>>
>>>>> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
>>>>> can later come back and fix these rare races and eventually speed-up the
>>>>> ZONE_DEVICE detection.
>>>>
>>>> Actually, Toshiki is writing code to refactor and optimize the pfn walking
>>>> part, where we find the pfn ranges covered by zone devices by running over
>>>> xarray pgmap_array and use the range info to reduce pointer dereferences
>>>> to speed up pfn walk. I hope he will share it soon.
>>>
>>> AFAIKT, Michal is not a friend of special-casing PFN walkers in that
>>> way. We should have a mechanism to detect if a memmap was initialized
>>> without having to go via pgmap, special-casing. See my other mail where
>>> I draft one basic approach.
>>
>> OK, so considering your v2 approach, we could have another pfn_to_page()
>> variant like pfn_to_zone_device_page(), where we check that a given pfn
>> belongs to the memory section backed by zone memory, then another check if
>> the pfn has initialized memmap or not, and return NULL if memmap not
>> initialied.  We'll try this approach then, but if you find problems/concerns,
>> please let me know.
> 
> Sorry, you already mentioned detail here,
> https://lore.kernel.org/lkml/c6198acd-8ff7-c40c-cb4e-f0f12f841b38@redhat.com/
> 

I'm planning on sending a proper writeup of the overall approach and  
pitfalls maybe next week. Feel free to ping me in case I forget.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-09  9:12 [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c David Hildenbrand
2019-10-09  9:37 ` Michal Hocko
2019-10-09 10:19   ` David Hildenbrand
2019-10-09 11:24     ` Michal Hocko
2019-10-09 12:58       ` David Hildenbrand
2019-10-09 13:22         ` Michal Hocko
2019-10-09 13:24           ` David Hildenbrand
2019-10-09 13:29             ` Michal Hocko
2019-10-09 13:32               ` David Hildenbrand
2019-10-09  9:57 ` Naoya Horiguchi
2019-10-10  7:30   ` David Hildenbrand
2019-10-11  0:11     ` Naoya Horiguchi
2019-10-11  0:50       ` Naoya Horiguchi
2019-10-11  9:53         ` David Hildenbrand

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org linux-mm@archiver.kernel.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox