All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-06 20:36 ` Omar Sandoval
  0 siblings, 0 replies; 14+ messages in thread
From: Omar Sandoval @ 2022-04-06 20:36 UTC (permalink / raw)
  To: linux-mm, kexec, Andrew Morton
  Cc: Uladzislau Rezki, Christoph Hellwig, Baoquan He, x86, kernel-team

From: Omar Sandoval <osandov@fb.com>

Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
purge the vmap areas instead of doing it lazily.

Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread.
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever). For example, consider the following scenario:

1. Thread reads from /proc/vmcore. This eventually calls
   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
   vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
   pages (one page plus the guard page) to the purge list and
   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
   drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
   frees the 2 pages on the purge list. vmap_lazy_nr is now
   lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
   but doesn't find anything.
6. Repeat 5.

If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs. If there is more than
one CPU or preemption is enabled, then the worker thread will spin
forever in the background. (Note that if there were already pages to be
purged at the time that set_iounmap_nonlazy() was called, this bug is
avoided.)

This can be reproduced with anything that reads from /proc/vmcore
multiple times. E.g., vmcore-dmesg /proc/vmcore.

It turns out that improvements to vmap() over the years have obsoleted
the need for this "optimization". I benchmarked
`dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
entirely:

  |5.17  |5.18+fix|5.18+removal
4k|40.86s|  40.09s|      26.73s
1M|24.47s|  23.98s|      21.84s

The removal was the fastest (by a wide margin with 4k reads). This patch
removes set_iounmap_nonlazy().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes from v1:

- Remove set_iounmap_nonlazy() entirely instead of fixing it.

 arch/x86/include/asm/io.h       |  2 --
 arch/x86/kernel/crash_dump_64.c |  1 -
 mm/vmalloc.c                    | 11 -----------
 3 files changed, 14 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..e9736af126b2 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
 extern void iounmap(volatile void __iomem *addr);
 #define iounmap iounmap
 
-extern void set_iounmap_nonlazy(void);
-
 #ifdef __KERNEL__
 
 void memcpy_fromio(void *, const volatile void __iomem *, size_t);
diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
index a7f617a3981d..97529552dd24 100644
--- a/arch/x86/kernel/crash_dump_64.c
+++ b/arch/x86/kernel/crash_dump_64.c
@@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
 	} else
 		memcpy(buf, vaddr + offset, csize);
 
-	set_iounmap_nonlazy();
 	iounmap((void __iomem *)vaddr);
 	return csize;
 }
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e163372d3967..0b17498a34f1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);
 
-#ifdef CONFIG_X86_64
-/*
- * called before a call to iounmap() if the caller wants vm_area_struct's
- * immediately freed.
- */
-void set_iounmap_nonlazy(void)
-{
-	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
-}
-#endif /* CONFIG_X86_64 */
-
 /*
  * Purges all lazily-freed vmap areas.
  */
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-06 20:36 ` Omar Sandoval
  0 siblings, 0 replies; 14+ messages in thread
From: Omar Sandoval @ 2022-04-06 20:36 UTC (permalink / raw)
  To: kexec

From: Omar Sandoval <osandov@fb.com>

Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
purge the vmap areas instead of doing it lazily.

Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread.
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever). For example, consider the following scenario:

1. Thread reads from /proc/vmcore. This eventually calls
   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
   vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
   pages (one page plus the guard page) to the purge list and
   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
   drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
   frees the 2 pages on the purge list. vmap_lazy_nr is now
   lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
   but doesn't find anything.
6. Repeat 5.

If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs. If there is more than
one CPU or preemption is enabled, then the worker thread will spin
forever in the background. (Note that if there were already pages to be
purged at the time that set_iounmap_nonlazy() was called, this bug is
avoided.)

This can be reproduced with anything that reads from /proc/vmcore
multiple times. E.g., vmcore-dmesg /proc/vmcore.

It turns out that improvements to vmap() over the years have obsoleted
the need for this "optimization". I benchmarked
`dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
entirely:

  |5.17  |5.18+fix|5.18+removal
4k|40.86s|  40.09s|      26.73s
1M|24.47s|  23.98s|      21.84s

The removal was the fastest (by a wide margin with 4k reads). This patch
removes set_iounmap_nonlazy().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes from v1:

- Remove set_iounmap_nonlazy() entirely instead of fixing it.

 arch/x86/include/asm/io.h       |  2 --
 arch/x86/kernel/crash_dump_64.c |  1 -
 mm/vmalloc.c                    | 11 -----------
 3 files changed, 14 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..e9736af126b2 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
 extern void iounmap(volatile void __iomem *addr);
 #define iounmap iounmap
 
-extern void set_iounmap_nonlazy(void);
-
 #ifdef __KERNEL__
 
 void memcpy_fromio(void *, const volatile void __iomem *, size_t);
diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
index a7f617a3981d..97529552dd24 100644
--- a/arch/x86/kernel/crash_dump_64.c
+++ b/arch/x86/kernel/crash_dump_64.c
@@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
 	} else
 		memcpy(buf, vaddr + offset, csize);
 
-	set_iounmap_nonlazy();
 	iounmap((void __iomem *)vaddr);
 	return csize;
 }
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e163372d3967..0b17498a34f1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);
 
-#ifdef CONFIG_X86_64
-/*
- * called before a call to iounmap() if the caller wants vm_area_struct's
- * immediately freed.
- */
-void set_iounmap_nonlazy(void)
-{
-	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
-}
-#endif /* CONFIG_X86_64 */
-
 /*
  * Purges all lazily-freed vmap areas.
  */
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-06 20:36 ` Omar Sandoval
@ 2022-04-07  5:38   ` Christoph Hellwig
  -1 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2022-04-07  5:38 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-mm, kexec, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Baoquan He, x86, kernel-team

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-07  5:38   ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2022-04-07  5:38 UTC (permalink / raw)
  To: kexec

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-06 20:36 ` Omar Sandoval
@ 2022-04-07  8:11   ` Uladzislau Rezki
  -1 siblings, 0 replies; 14+ messages in thread
From: Uladzislau Rezki @ 2022-04-07  8:11 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-mm, kexec, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Baoquan He, x86, kernel-team

> From: Omar Sandoval <osandov@fb.com>
> 
> Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
> vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
> lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
> purge the vmap areas instead of doing it lazily.
> 
> Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
> context") moved the purging from the vunmap() caller to a worker thread.
> Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
> (possibly forever). For example, consider the following scenario:
> 
> 1. Thread reads from /proc/vmcore. This eventually calls
>    __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
>    vmap_lazy_nr to lazy_max_pages() + 1.
> 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
>    pages (one page plus the guard page) to the purge list and
>    vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
>    drain_vmap_work is scheduled.
> 3. Thread returns from the kernel and is scheduled out.
> 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
>    frees the 2 pages on the purge list. vmap_lazy_nr is now
>    lazy_max_pages() + 1.
> 5. This is still over the threshold, so it tries to purge areas again,
>    but doesn't find anything.
> 6. Repeat 5.
> 
> If the system is running with only one CPU (which is typicial for kdump)
> and preemption is disabled, then this will never make forward progress:
> there aren't any more pages to purge, so it hangs. If there is more than
> one CPU or preemption is enabled, then the worker thread will spin
> forever in the background. (Note that if there were already pages to be
> purged at the time that set_iounmap_nonlazy() was called, this bug is
> avoided.)
> 
> This can be reproduced with anything that reads from /proc/vmcore
> multiple times. E.g., vmcore-dmesg /proc/vmcore.
> 
> It turns out that improvements to vmap() over the years have obsoleted
> the need for this "optimization". I benchmarked
> `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
> with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
> avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
> entirely:
> 
>   |5.17  |5.18+fix|5.18+removal
> 4k|40.86s|  40.09s|      26.73s
> 1M|24.47s|  23.98s|      21.84s
> 
> The removal was the fastest (by a wide margin with 4k reads). This patch
> removes set_iounmap_nonlazy().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Changes from v1:
> 
> - Remove set_iounmap_nonlazy() entirely instead of fixing it.
> 
>  arch/x86/include/asm/io.h       |  2 --
>  arch/x86/kernel/crash_dump_64.c |  1 -
>  mm/vmalloc.c                    | 11 -----------
>  3 files changed, 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index f6d91ecb8026..e9736af126b2 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
>  extern void iounmap(volatile void __iomem *addr);
>  #define iounmap iounmap
>  
> -extern void set_iounmap_nonlazy(void);
> -
>  #ifdef __KERNEL__
>  
>  void memcpy_fromio(void *, const volatile void __iomem *, size_t);
> diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
> index a7f617a3981d..97529552dd24 100644
> --- a/arch/x86/kernel/crash_dump_64.c
> +++ b/arch/x86/kernel/crash_dump_64.c
> @@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
>  	} else
>  		memcpy(buf, vaddr + offset, csize);
>  
> -	set_iounmap_nonlazy();
>  	iounmap((void __iomem *)vaddr);
>  	return csize;
>  }
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e163372d3967..0b17498a34f1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
>  /* for per-CPU blocks */
>  static void purge_fragmented_blocks_allcpus(void);
>  
> -#ifdef CONFIG_X86_64
> -/*
> - * called before a call to iounmap() if the caller wants vm_area_struct's
> - * immediately freed.
> - */
> -void set_iounmap_nonlazy(void)
> -{
> -	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
> -}
> -#endif /* CONFIG_X86_64 */
> -
>  /*
>   * Purges all lazily-freed vmap areas.
>   */
> -- 
> 2.35.1
>
Much more better way of fixing it :)

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-07  8:11   ` Uladzislau Rezki
  0 siblings, 0 replies; 14+ messages in thread
From: Uladzislau Rezki @ 2022-04-07  8:11 UTC (permalink / raw)
  To: kexec

> From: Omar Sandoval <osandov@fb.com>
> 
> Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
> vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
> lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
> purge the vmap areas instead of doing it lazily.
> 
> Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
> context") moved the purging from the vunmap() caller to a worker thread.
> Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
> (possibly forever). For example, consider the following scenario:
> 
> 1. Thread reads from /proc/vmcore. This eventually calls
>    __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
>    vmap_lazy_nr to lazy_max_pages() + 1.
> 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
>    pages (one page plus the guard page) to the purge list and
>    vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
>    drain_vmap_work is scheduled.
> 3. Thread returns from the kernel and is scheduled out.
> 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
>    frees the 2 pages on the purge list. vmap_lazy_nr is now
>    lazy_max_pages() + 1.
> 5. This is still over the threshold, so it tries to purge areas again,
>    but doesn't find anything.
> 6. Repeat 5.
> 
> If the system is running with only one CPU (which is typicial for kdump)
> and preemption is disabled, then this will never make forward progress:
> there aren't any more pages to purge, so it hangs. If there is more than
> one CPU or preemption is enabled, then the worker thread will spin
> forever in the background. (Note that if there were already pages to be
> purged at the time that set_iounmap_nonlazy() was called, this bug is
> avoided.)
> 
> This can be reproduced with anything that reads from /proc/vmcore
> multiple times. E.g., vmcore-dmesg /proc/vmcore.
> 
> It turns out that improvements to vmap() over the years have obsoleted
> the need for this "optimization". I benchmarked
> `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
> with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
> avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
> entirely:
> 
>   |5.17  |5.18+fix|5.18+removal
> 4k|40.86s|  40.09s|      26.73s
> 1M|24.47s|  23.98s|      21.84s
> 
> The removal was the fastest (by a wide margin with 4k reads). This patch
> removes set_iounmap_nonlazy().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Changes from v1:
> 
> - Remove set_iounmap_nonlazy() entirely instead of fixing it.
> 
>  arch/x86/include/asm/io.h       |  2 --
>  arch/x86/kernel/crash_dump_64.c |  1 -
>  mm/vmalloc.c                    | 11 -----------
>  3 files changed, 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index f6d91ecb8026..e9736af126b2 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
>  extern void iounmap(volatile void __iomem *addr);
>  #define iounmap iounmap
>  
> -extern void set_iounmap_nonlazy(void);
> -
>  #ifdef __KERNEL__
>  
>  void memcpy_fromio(void *, const volatile void __iomem *, size_t);
> diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
> index a7f617a3981d..97529552dd24 100644
> --- a/arch/x86/kernel/crash_dump_64.c
> +++ b/arch/x86/kernel/crash_dump_64.c
> @@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
>  	} else
>  		memcpy(buf, vaddr + offset, csize);
>  
> -	set_iounmap_nonlazy();
>  	iounmap((void __iomem *)vaddr);
>  	return csize;
>  }
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e163372d3967..0b17498a34f1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
>  /* for per-CPU blocks */
>  static void purge_fragmented_blocks_allcpus(void);
>  
> -#ifdef CONFIG_X86_64
> -/*
> - * called before a call to iounmap() if the caller wants vm_area_struct's
> - * immediately freed.
> - */
> -void set_iounmap_nonlazy(void)
> -{
> -	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
> -}
> -#endif /* CONFIG_X86_64 */
> -
>  /*
>   * Purges all lazily-freed vmap areas.
>   */
> -- 
> 2.35.1
>
Much more better way of fixing it :)

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-06 20:36 ` Omar Sandoval
@ 2022-04-07 14:36   ` Chris Down
  -1 siblings, 0 replies; 14+ messages in thread
From: Chris Down @ 2022-04-07 14:36 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-mm, kexec, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Baoquan He, x86, kernel-team

Omar Sandoval writes:
>From: Omar Sandoval <osandov@fb.com>
>
>Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
>vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
>lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
>purge the vmap areas instead of doing it lazily.
>
>Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
>context") moved the purging from the vunmap() caller to a worker thread.
>Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
>(possibly forever). For example, consider the following scenario:
>
>1. Thread reads from /proc/vmcore. This eventually calls
>   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
>   vmap_lazy_nr to lazy_max_pages() + 1.
>2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
>   pages (one page plus the guard page) to the purge list and
>   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
>   drain_vmap_work is scheduled.
>3. Thread returns from the kernel and is scheduled out.
>4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
>   frees the 2 pages on the purge list. vmap_lazy_nr is now
>   lazy_max_pages() + 1.
>5. This is still over the threshold, so it tries to purge areas again,
>   but doesn't find anything.
>6. Repeat 5.
>
>If the system is running with only one CPU (which is typicial for kdump)
>and preemption is disabled, then this will never make forward progress:
>there aren't any more pages to purge, so it hangs. If there is more than
>one CPU or preemption is enabled, then the worker thread will spin
>forever in the background. (Note that if there were already pages to be
>purged at the time that set_iounmap_nonlazy() was called, this bug is
>avoided.)
>
>This can be reproduced with anything that reads from /proc/vmcore
>multiple times. E.g., vmcore-dmesg /proc/vmcore.
>
>It turns out that improvements to vmap() over the years have obsoleted
>the need for this "optimization". I benchmarked
>`dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
>with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
>avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
>entirely:
>
>  |5.17  |5.18+fix|5.18+removal
>4k|40.86s|  40.09s|      26.73s
>1M|24.47s|  23.98s|      21.84s
>
>The removal was the fastest (by a wide margin with 4k reads). This patch
>removes set_iounmap_nonlazy().
>
>Signed-off-by: Omar Sandoval <osandov@fb.com>

It probably doesn't matter, but maybe worth adding in a Fixes tag just to make 
sure anyone getting this without context understands that 690467c81b1a 
("mm/vmalloc: Move draining areas out of caller context") shouldn't reach 
further rcs without this. Unlikely that would happen anyway, though.

Nice use of a bug as an impetus to clean things up :-) Thanks!

Acked-by: Chris Down <chris@chrisdown.name>

>---
>Changes from v1:
>
>- Remove set_iounmap_nonlazy() entirely instead of fixing it.
>
> arch/x86/include/asm/io.h       |  2 --
> arch/x86/kernel/crash_dump_64.c |  1 -
> mm/vmalloc.c                    | 11 -----------
> 3 files changed, 14 deletions(-)
>
>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>index f6d91ecb8026..e9736af126b2 100644
>--- a/arch/x86/include/asm/io.h
>+++ b/arch/x86/include/asm/io.h
>@@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
> extern void iounmap(volatile void __iomem *addr);
> #define iounmap iounmap
>
>-extern void set_iounmap_nonlazy(void);
>-
> #ifdef __KERNEL__
>
> void memcpy_fromio(void *, const volatile void __iomem *, size_t);
>diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
>index a7f617a3981d..97529552dd24 100644
>--- a/arch/x86/kernel/crash_dump_64.c
>+++ b/arch/x86/kernel/crash_dump_64.c
>@@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
> 	} else
> 		memcpy(buf, vaddr + offset, csize);
>
>-	set_iounmap_nonlazy();
> 	iounmap((void __iomem *)vaddr);
> 	return csize;
> }
>diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>index e163372d3967..0b17498a34f1 100644
>--- a/mm/vmalloc.c
>+++ b/mm/vmalloc.c
>@@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
> /* for per-CPU blocks */
> static void purge_fragmented_blocks_allcpus(void);
>
>-#ifdef CONFIG_X86_64
>-/*
>- * called before a call to iounmap() if the caller wants vm_area_struct's
>- * immediately freed.
>- */
>-void set_iounmap_nonlazy(void)
>-{
>-	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
>-}
>-#endif /* CONFIG_X86_64 */
>-
> /*
>  * Purges all lazily-freed vmap areas.
>  */
>-- 
>2.35.1
>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-07 14:36   ` Chris Down
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Down @ 2022-04-07 14:36 UTC (permalink / raw)
  To: kexec

Omar Sandoval writes:
>From: Omar Sandoval <osandov@fb.com>
>
>Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
>vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
>lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
>purge the vmap areas instead of doing it lazily.
>
>Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
>context") moved the purging from the vunmap() caller to a worker thread.
>Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
>(possibly forever). For example, consider the following scenario:
>
>1. Thread reads from /proc/vmcore. This eventually calls
>   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
>   vmap_lazy_nr to lazy_max_pages() + 1.
>2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
>   pages (one page plus the guard page) to the purge list and
>   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
>   drain_vmap_work is scheduled.
>3. Thread returns from the kernel and is scheduled out.
>4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
>   frees the 2 pages on the purge list. vmap_lazy_nr is now
>   lazy_max_pages() + 1.
>5. This is still over the threshold, so it tries to purge areas again,
>   but doesn't find anything.
>6. Repeat 5.
>
>If the system is running with only one CPU (which is typicial for kdump)
>and preemption is disabled, then this will never make forward progress:
>there aren't any more pages to purge, so it hangs. If there is more than
>one CPU or preemption is enabled, then the worker thread will spin
>forever in the background. (Note that if there were already pages to be
>purged at the time that set_iounmap_nonlazy() was called, this bug is
>avoided.)
>
>This can be reproduced with anything that reads from /proc/vmcore
>multiple times. E.g., vmcore-dmesg /proc/vmcore.
>
>It turns out that improvements to vmap() over the years have obsoleted
>the need for this "optimization". I benchmarked
>`dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
>with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
>avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
>entirely:
>
>  |5.17  |5.18+fix|5.18+removal
>4k|40.86s|  40.09s|      26.73s
>1M|24.47s|  23.98s|      21.84s
>
>The removal was the fastest (by a wide margin with 4k reads). This patch
>removes set_iounmap_nonlazy().
>
>Signed-off-by: Omar Sandoval <osandov@fb.com>

It probably doesn't matter, but maybe worth adding in a Fixes tag just to make 
sure anyone getting this without context understands that 690467c81b1a 
("mm/vmalloc: Move draining areas out of caller context") shouldn't reach 
further rcs without this. Unlikely that would happen anyway, though.

Nice use of a bug as an impetus to clean things up :-) Thanks!

Acked-by: Chris Down <chris@chrisdown.name>

>---
>Changes from v1:
>
>- Remove set_iounmap_nonlazy() entirely instead of fixing it.
>
> arch/x86/include/asm/io.h       |  2 --
> arch/x86/kernel/crash_dump_64.c |  1 -
> mm/vmalloc.c                    | 11 -----------
> 3 files changed, 14 deletions(-)
>
>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>index f6d91ecb8026..e9736af126b2 100644
>--- a/arch/x86/include/asm/io.h
>+++ b/arch/x86/include/asm/io.h
>@@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
> extern void iounmap(volatile void __iomem *addr);
> #define iounmap iounmap
>
>-extern void set_iounmap_nonlazy(void);
>-
> #ifdef __KERNEL__
>
> void memcpy_fromio(void *, const volatile void __iomem *, size_t);
>diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
>index a7f617a3981d..97529552dd24 100644
>--- a/arch/x86/kernel/crash_dump_64.c
>+++ b/arch/x86/kernel/crash_dump_64.c
>@@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
> 	} else
> 		memcpy(buf, vaddr + offset, csize);
>
>-	set_iounmap_nonlazy();
> 	iounmap((void __iomem *)vaddr);
> 	return csize;
> }
>diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>index e163372d3967..0b17498a34f1 100644
>--- a/mm/vmalloc.c
>+++ b/mm/vmalloc.c
>@@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
> /* for per-CPU blocks */
> static void purge_fragmented_blocks_allcpus(void);
>
>-#ifdef CONFIG_X86_64
>-/*
>- * called before a call to iounmap() if the caller wants vm_area_struct's
>- * immediately freed.
>- */
>-void set_iounmap_nonlazy(void)
>-{
>-	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
>-}
>-#endif /* CONFIG_X86_64 */
>-
> /*
>  * Purges all lazily-freed vmap areas.
>  */
>-- 
>2.35.1
>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-07 14:36   ` Chris Down
@ 2022-04-08  3:02     ` Baoquan He
  -1 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-04-08  3:02 UTC (permalink / raw)
  To: Chris Down, Omar Sandoval
  Cc: linux-mm, kexec, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, x86, kernel-team

On 04/07/22 at 03:36pm, Chris Down wrote:
> Omar Sandoval writes:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
> > vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
> > lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
> > purge the vmap areas instead of doing it lazily.
> > 
> > Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
> > context") moved the purging from the vunmap() caller to a worker thread.
> > Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
> > (possibly forever). For example, consider the following scenario:
> > 
> > 1. Thread reads from /proc/vmcore. This eventually calls
> >   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
> >   vmap_lazy_nr to lazy_max_pages() + 1.
> > 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
> >   pages (one page plus the guard page) to the purge list and
> >   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
> >   drain_vmap_work is scheduled.
> > 3. Thread returns from the kernel and is scheduled out.
> > 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
> >   frees the 2 pages on the purge list. vmap_lazy_nr is now
> >   lazy_max_pages() + 1.
> > 5. This is still over the threshold, so it tries to purge areas again,
> >   but doesn't find anything.
> > 6. Repeat 5.
> > 
> > If the system is running with only one CPU (which is typicial for kdump)
> > and preemption is disabled, then this will never make forward progress:
> > there aren't any more pages to purge, so it hangs. If there is more than
> > one CPU or preemption is enabled, then the worker thread will spin
> > forever in the background. (Note that if there were already pages to be
> > purged at the time that set_iounmap_nonlazy() was called, this bug is
> > avoided.)
> > 
> > This can be reproduced with anything that reads from /proc/vmcore
> > multiple times. E.g., vmcore-dmesg /proc/vmcore.
> > 
> > It turns out that improvements to vmap() over the years have obsoleted
> > the need for this "optimization". I benchmarked
> > `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
> > with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
> > avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
> > entirely:
> > 
> >  |5.17  |5.18+fix|5.18+removal
> > 4k|40.86s|  40.09s|      26.73s
> > 1M|24.47s|  23.98s|      21.84s
> > 
> > The removal was the fastest (by a wide margin with 4k reads). This patch
> > removes set_iounmap_nonlazy().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> It probably doesn't matter, but maybe worth adding in a Fixes tag just to
> make sure anyone getting this without context understands that 690467c81b1a
> ("mm/vmalloc: Move draining areas out of caller context") shouldn't reach
> further rcs without this. Unlikely that would happen anyway, though.
> 
> Nice use of a bug as an impetus to clean things up :-) Thanks!

Since redhat mail server has issue, the body content of patch is empty
from my mail client. So reply here to add comment.

As replied in v1 to Omar, I think this is a great fix. That would be
also great to state if this is a real issue which is breaking thing,
then add 'Fixes' tag and Cc stable like "Cc: <stable@vger.kernel.org> # 5.17",
or a fantastic improvement from code inspecting.

Concern this because in distros, e.g in our rhel8, we maintain old kernel
and back port necessary patches into the kernel, those patches with
'Fixes' tag definitely are good candidate. This is important too to LTS
kernel.

Thanks
Baoquan


> 
> > ---
> > Changes from v1:
> > 
> > - Remove set_iounmap_nonlazy() entirely instead of fixing it.
> > 
> > arch/x86/include/asm/io.h       |  2 --
> > arch/x86/kernel/crash_dump_64.c |  1 -
> > mm/vmalloc.c                    | 11 -----------
> > 3 files changed, 14 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> > index f6d91ecb8026..e9736af126b2 100644
> > --- a/arch/x86/include/asm/io.h
> > +++ b/arch/x86/include/asm/io.h
> > @@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
> > extern void iounmap(volatile void __iomem *addr);
> > #define iounmap iounmap
> > 
> > -extern void set_iounmap_nonlazy(void);
> > -
> > #ifdef __KERNEL__
> > 
> > void memcpy_fromio(void *, const volatile void __iomem *, size_t);
> > diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
> > index a7f617a3981d..97529552dd24 100644
> > --- a/arch/x86/kernel/crash_dump_64.c
> > +++ b/arch/x86/kernel/crash_dump_64.c
> > @@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
> > 	} else
> > 		memcpy(buf, vaddr + offset, csize);
> > 
> > -	set_iounmap_nonlazy();
> > 	iounmap((void __iomem *)vaddr);
> > 	return csize;
> > }
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index e163372d3967..0b17498a34f1 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
> > /* for per-CPU blocks */
> > static void purge_fragmented_blocks_allcpus(void);
> > 
> > -#ifdef CONFIG_X86_64
> > -/*
> > - * called before a call to iounmap() if the caller wants vm_area_struct's
> > - * immediately freed.
> > - */
> > -void set_iounmap_nonlazy(void)
> > -{
> > -	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
> > -}
> > -#endif /* CONFIG_X86_64 */
> > -
> > /*
> >  * Purges all lazily-freed vmap areas.
> >  */
> > -- 
> > 2.35.1
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-08  3:02     ` Baoquan He
  0 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-04-08  3:02 UTC (permalink / raw)
  To: kexec

On 04/07/22 at 03:36pm, Chris Down wrote:
> Omar Sandoval writes:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
> > vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
> > lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
> > purge the vmap areas instead of doing it lazily.
> > 
> > Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
> > context") moved the purging from the vunmap() caller to a worker thread.
> > Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
> > (possibly forever). For example, consider the following scenario:
> > 
> > 1. Thread reads from /proc/vmcore. This eventually calls
> >   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
> >   vmap_lazy_nr to lazy_max_pages() + 1.
> > 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
> >   pages (one page plus the guard page) to the purge list and
> >   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
> >   drain_vmap_work is scheduled.
> > 3. Thread returns from the kernel and is scheduled out.
> > 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
> >   frees the 2 pages on the purge list. vmap_lazy_nr is now
> >   lazy_max_pages() + 1.
> > 5. This is still over the threshold, so it tries to purge areas again,
> >   but doesn't find anything.
> > 6. Repeat 5.
> > 
> > If the system is running with only one CPU (which is typicial for kdump)
> > and preemption is disabled, then this will never make forward progress:
> > there aren't any more pages to purge, so it hangs. If there is more than
> > one CPU or preemption is enabled, then the worker thread will spin
> > forever in the background. (Note that if there were already pages to be
> > purged at the time that set_iounmap_nonlazy() was called, this bug is
> > avoided.)
> > 
> > This can be reproduced with anything that reads from /proc/vmcore
> > multiple times. E.g., vmcore-dmesg /proc/vmcore.
> > 
> > It turns out that improvements to vmap() over the years have obsoleted
> > the need for this "optimization". I benchmarked
> > `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
> > with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
> > avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
> > entirely:
> > 
> >  |5.17  |5.18+fix|5.18+removal
> > 4k|40.86s|  40.09s|      26.73s
> > 1M|24.47s|  23.98s|      21.84s
> > 
> > The removal was the fastest (by a wide margin with 4k reads). This patch
> > removes set_iounmap_nonlazy().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> It probably doesn't matter, but maybe worth adding in a Fixes tag just to
> make sure anyone getting this without context understands that 690467c81b1a
> ("mm/vmalloc: Move draining areas out of caller context") shouldn't reach
> further rcs without this. Unlikely that would happen anyway, though.
> 
> Nice use of a bug as an impetus to clean things up :-) Thanks!

Since redhat mail server has issue, the body content of patch is empty
from my mail client. So reply here to add comment.

As replied in v1 to Omar, I think this is a great fix. That would be
also great to state if this is a real issue which is breaking thing,
then add 'Fixes' tag and Cc stable like "Cc: <stable@vger.kernel.org> # 5.17",
or a fantastic improvement from code inspecting.

Concern this because in distros, e.g in our rhel8, we maintain old kernel
and back port necessary patches into the kernel, those patches with
'Fixes' tag definitely are good candidate. This is important too to LTS
kernel.

Thanks
Baoquan


> 
> > ---
> > Changes from v1:
> > 
> > - Remove set_iounmap_nonlazy() entirely instead of fixing it.
> > 
> > arch/x86/include/asm/io.h       |  2 --
> > arch/x86/kernel/crash_dump_64.c |  1 -
> > mm/vmalloc.c                    | 11 -----------
> > 3 files changed, 14 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> > index f6d91ecb8026..e9736af126b2 100644
> > --- a/arch/x86/include/asm/io.h
> > +++ b/arch/x86/include/asm/io.h
> > @@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t offset, unsigned long size);
> > extern void iounmap(volatile void __iomem *addr);
> > #define iounmap iounmap
> > 
> > -extern void set_iounmap_nonlazy(void);
> > -
> > #ifdef __KERNEL__
> > 
> > void memcpy_fromio(void *, const volatile void __iomem *, size_t);
> > diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
> > index a7f617a3981d..97529552dd24 100644
> > --- a/arch/x86/kernel/crash_dump_64.c
> > +++ b/arch/x86/kernel/crash_dump_64.c
> > @@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
> > 	} else
> > 		memcpy(buf, vaddr + offset, csize);
> > 
> > -	set_iounmap_nonlazy();
> > 	iounmap((void __iomem *)vaddr);
> > 	return csize;
> > }
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index e163372d3967..0b17498a34f1 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
> > /* for per-CPU blocks */
> > static void purge_fragmented_blocks_allcpus(void);
> > 
> > -#ifdef CONFIG_X86_64
> > -/*
> > - * called before a call to iounmap() if the caller wants vm_area_struct's
> > - * immediately freed.
> > - */
> > -void set_iounmap_nonlazy(void)
> > -{
> > -	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
> > -}
> > -#endif /* CONFIG_X86_64 */
> > -
> > /*
> >  * Purges all lazily-freed vmap areas.
> >  */
> > -- 
> > 2.35.1
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-08  3:02     ` Baoquan He
@ 2022-04-13 16:24       ` Omar Sandoval
  -1 siblings, 0 replies; 14+ messages in thread
From: Omar Sandoval @ 2022-04-13 16:24 UTC (permalink / raw)
  To: Baoquan He
  Cc: Chris Down, linux-mm, kexec, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, x86, kernel-team

On Fri, Apr 08, 2022 at 11:02:47AM +0800, Baoquan He wrote:
> On 04/07/22 at 03:36pm, Chris Down wrote:
> > Omar Sandoval writes:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
> > > vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
> > > lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
> > > purge the vmap areas instead of doing it lazily.
> > > 
> > > Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
> > > context") moved the purging from the vunmap() caller to a worker thread.
> > > Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
> > > (possibly forever). For example, consider the following scenario:
> > > 
> > > 1. Thread reads from /proc/vmcore. This eventually calls
> > >   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
> > >   vmap_lazy_nr to lazy_max_pages() + 1.
> > > 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
> > >   pages (one page plus the guard page) to the purge list and
> > >   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
> > >   drain_vmap_work is scheduled.
> > > 3. Thread returns from the kernel and is scheduled out.
> > > 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
> > >   frees the 2 pages on the purge list. vmap_lazy_nr is now
> > >   lazy_max_pages() + 1.
> > > 5. This is still over the threshold, so it tries to purge areas again,
> > >   but doesn't find anything.
> > > 6. Repeat 5.
> > > 
> > > If the system is running with only one CPU (which is typicial for kdump)
> > > and preemption is disabled, then this will never make forward progress:
> > > there aren't any more pages to purge, so it hangs. If there is more than
> > > one CPU or preemption is enabled, then the worker thread will spin
> > > forever in the background. (Note that if there were already pages to be
> > > purged at the time that set_iounmap_nonlazy() was called, this bug is
> > > avoided.)
> > > 
> > > This can be reproduced with anything that reads from /proc/vmcore
> > > multiple times. E.g., vmcore-dmesg /proc/vmcore.
> > > 
> > > It turns out that improvements to vmap() over the years have obsoleted
> > > the need for this "optimization". I benchmarked
> > > `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
> > > with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
> > > avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
> > > entirely:
> > > 
> > >  |5.17  |5.18+fix|5.18+removal
> > > 4k|40.86s|  40.09s|      26.73s
> > > 1M|24.47s|  23.98s|      21.84s
> > > 
> > > The removal was the fastest (by a wide margin with 4k reads). This patch
> > > removes set_iounmap_nonlazy().
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > 
> > It probably doesn't matter, but maybe worth adding in a Fixes tag just to
> > make sure anyone getting this without context understands that 690467c81b1a
> > ("mm/vmalloc: Move draining areas out of caller context") shouldn't reach
> > further rcs without this. Unlikely that would happen anyway, though.
> > 
> > Nice use of a bug as an impetus to clean things up :-) Thanks!
> 
> Since redhat mail server has issue, the body content of patch is empty
> from my mail client. So reply here to add comment.
> 
> As replied in v1 to Omar, I think this is a great fix. That would be
> also great to state if this is a real issue which is breaking thing,
> then add 'Fixes' tag and Cc stable like "Cc: <stable@vger.kernel.org> # 5.17",
> or a fantastic improvement from code inspecting.
> 
> Concern this because in distros, e.g in our rhel8, we maintain old kernel
> and back port necessary patches into the kernel, those patches with
> 'Fixes' tag definitely are good candidate. This is important too to LTS
> kernel.
> 
> Thanks
> Baoquan

Hi, Baoquan,

Sorry I missed your replies. I'll answer your questions from your first
email.

> I am wondering if this is a real issue you met, or you just found it
> by code inspecting

I hit this issue with the test suite for drgn
(https://github.com/osandov/drgn). We run the test cases in a virtual
machine on various kernel versions
(https://github.com/osandov/drgn/tree/main/vmtest). Part of the test
suite crashes the kernel to run some tests against /proc/vmcore
(https://github.com/osandov/drgn/blob/13144eda119790cdbc11f360c15a04efdf81ae9a/setup.py#L213,
https://github.com/osandov/drgn/blob/main/vmtest/enter_kdump.py,
https://github.com/osandov/drgn/tree/main/tests/linux_kernel/vmcore).
When I tried v5.18-rc1 configured with !SMP and !PREEMPT, that part of
the test suite got stuck, which is how I found this issue.

> I am wondering how your vmcore dumping is handled. Asking this because
> we usually use makedumpfile utility

In production at Facebook, we don't run drgn directly against
/proc/vmcore. We use makedumpfile and inspect the captured file with
drgn once we reboot.

> While using makedumpfile, we use mmap which is 4M at one time by
> default, then process the content. So the copy_oldmem_page() may only
> be called during elfcorehdr and notes reading.

We also use vmcore-dmesg
(https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/vmcore-dmesg)
on /proc/vmcore before calling makedumpfile. From what I can tell, that
uses read()/pread()
(https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/util_lib/elf_info.c),
so it would also hit this issue.

I'll send a v3 adding Fixes: 690467c81b1a ("mm/vmalloc: Move draining
areas out of caller context"). I don't think a stable tag is necessary
since this was introduced in v5.18-rc1 and hasn't been backported as far
as I can tell.

Thanks,
Omar


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-13 16:24       ` Omar Sandoval
  0 siblings, 0 replies; 14+ messages in thread
From: Omar Sandoval @ 2022-04-13 16:24 UTC (permalink / raw)
  To: kexec

On Fri, Apr 08, 2022 at 11:02:47AM +0800, Baoquan He wrote:
> On 04/07/22 at 03:36pm, Chris Down wrote:
> > Omar Sandoval writes:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
> > > vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
> > > lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
> > > purge the vmap areas instead of doing it lazily.
> > > 
> > > Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
> > > context") moved the purging from the vunmap() caller to a worker thread.
> > > Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
> > > (possibly forever). For example, consider the following scenario:
> > > 
> > > 1. Thread reads from /proc/vmcore. This eventually calls
> > >   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
> > >   vmap_lazy_nr to lazy_max_pages() + 1.
> > > 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
> > >   pages (one page plus the guard page) to the purge list and
> > >   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
> > >   drain_vmap_work is scheduled.
> > > 3. Thread returns from the kernel and is scheduled out.
> > > 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
> > >   frees the 2 pages on the purge list. vmap_lazy_nr is now
> > >   lazy_max_pages() + 1.
> > > 5. This is still over the threshold, so it tries to purge areas again,
> > >   but doesn't find anything.
> > > 6. Repeat 5.
> > > 
> > > If the system is running with only one CPU (which is typicial for kdump)
> > > and preemption is disabled, then this will never make forward progress:
> > > there aren't any more pages to purge, so it hangs. If there is more than
> > > one CPU or preemption is enabled, then the worker thread will spin
> > > forever in the background. (Note that if there were already pages to be
> > > purged at the time that set_iounmap_nonlazy() was called, this bug is
> > > avoided.)
> > > 
> > > This can be reproduced with anything that reads from /proc/vmcore
> > > multiple times. E.g., vmcore-dmesg /proc/vmcore.
> > > 
> > > It turns out that improvements to vmap() over the years have obsoleted
> > > the need for this "optimization". I benchmarked
> > > `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
> > > with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
> > > avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
> > > entirely:
> > > 
> > >  |5.17  |5.18+fix|5.18+removal
> > > 4k|40.86s|  40.09s|      26.73s
> > > 1M|24.47s|  23.98s|      21.84s
> > > 
> > > The removal was the fastest (by a wide margin with 4k reads). This patch
> > > removes set_iounmap_nonlazy().
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > 
> > It probably doesn't matter, but maybe worth adding in a Fixes tag just to
> > make sure anyone getting this without context understands that 690467c81b1a
> > ("mm/vmalloc: Move draining areas out of caller context") shouldn't reach
> > further rcs without this. Unlikely that would happen anyway, though.
> > 
> > Nice use of a bug as an impetus to clean things up :-) Thanks!
> 
> Since redhat mail server has issue, the body content of patch is empty
> from my mail client. So reply here to add comment.
> 
> As replied in v1 to Omar, I think this is a great fix. That would be
> also great to state if this is a real issue which is breaking thing,
> then add 'Fixes' tag and Cc stable like "Cc: <stable@vger.kernel.org> # 5.17",
> or a fantastic improvement from code inspecting.
> 
> Concern this because in distros, e.g in our rhel8, we maintain old kernel
> and back port necessary patches into the kernel, those patches with
> 'Fixes' tag definitely are good candidate. This is important too to LTS
> kernel.
> 
> Thanks
> Baoquan

Hi, Baoquan,

Sorry I missed your replies. I'll answer your questions from your first
email.

> I am wondering if this is a real issue you met, or you just found it
> by code inspecting

I hit this issue with the test suite for drgn
(https://github.com/osandov/drgn). We run the test cases in a virtual
machine on various kernel versions
(https://github.com/osandov/drgn/tree/main/vmtest). Part of the test
suite crashes the kernel to run some tests against /proc/vmcore
(https://github.com/osandov/drgn/blob/13144eda119790cdbc11f360c15a04efdf81ae9a/setup.py#L213,
https://github.com/osandov/drgn/blob/main/vmtest/enter_kdump.py,
https://github.com/osandov/drgn/tree/main/tests/linux_kernel/vmcore).
When I tried v5.18-rc1 configured with !SMP and !PREEMPT, that part of
the test suite got stuck, which is how I found this issue.

> I am wondering how your vmcore dumping is handled. Asking this because
> we usually use makedumpfile utility

In production at Facebook, we don't run drgn directly against
/proc/vmcore. We use makedumpfile and inspect the captured file with
drgn once we reboot.

> While using makedumpfile, we use mmap which is 4M at one time by
> default, then process the content. So the copy_oldmem_page() may only
> be called during elfcorehdr and notes reading.

We also use vmcore-dmesg
(https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/vmcore-dmesg)
on /proc/vmcore before calling makedumpfile. From what I can tell, that
uses read()/pread()
(https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/util_lib/elf_info.c),
so it would also hit this issue.

I'll send a v3 adding Fixes: 690467c81b1a ("mm/vmalloc: Move draining
areas out of caller context"). I don't think a stable tag is necessary
since this was introduced in v5.18-rc1 and hasn't been backported as far
as I can tell.

Thanks,
Omar


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-13 16:24       ` Omar Sandoval
@ 2022-04-14 10:32         ` Baoquan He
  -1 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-04-14 10:32 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Chris Down, linux-mm, kexec, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, x86, kernel-team

On 04/13/22 at 09:24am, Omar Sandoval wrote:
> On Fri, Apr 08, 2022 at 11:02:47AM +0800, Baoquan He wrote:
......
> > Since redhat mail server has issue, the body content of patch is empty
> > from my mail client. So reply here to add comment.
> > 
> > As replied in v1 to Omar, I think this is a great fix. That would be
> > also great to state if this is a real issue which is breaking thing,
> > then add 'Fixes' tag and Cc stable like "Cc: <stable@vger.kernel.org> # 5.17",
> > or a fantastic improvement from code inspecting.
> > 
> > Concern this because in distros, e.g in our rhel8, we maintain old kernel
> > and back port necessary patches into the kernel, those patches with
> > 'Fixes' tag definitely are good candidate. This is important too to LTS
> > kernel.
> > 
> > Thanks
> > Baoquan
> 
> Hi, Baoquan,
> 
> Sorry I missed your replies. I'll answer your questions from your first
> email.
> 
> > I am wondering if this is a real issue you met, or you just found it
> > by code inspecting
> 
> I hit this issue with the test suite for drgn
> (https://github.com/osandov/drgn). We run the test cases in a virtual
> machine on various kernel versions
> (https://github.com/osandov/drgn/tree/main/vmtest). Part of the test
> suite crashes the kernel to run some tests against /proc/vmcore
> (https://github.com/osandov/drgn/blob/13144eda119790cdbc11f360c15a04efdf81ae9a/setup.py#L213,
> https://github.com/osandov/drgn/blob/main/vmtest/enter_kdump.py,
> https://github.com/osandov/drgn/tree/main/tests/linux_kernel/vmcore).
> When I tried v5.18-rc1 configured with !SMP and !PREEMPT, that part of
> the test suite got stuck, which is how I found this issue.
> 
> > I am wondering how your vmcore dumping is handled. Asking this because
> > we usually use makedumpfile utility
> 
> In production at Facebook, we don't run drgn directly against
> /proc/vmcore. We use makedumpfile and inspect the captured file with
> drgn once we reboot.
> 
> > While using makedumpfile, we use mmap which is 4M at one time by
> > default, then process the content. So the copy_oldmem_page() may only
> > be called during elfcorehdr and notes reading.
> 
> We also use vmcore-dmesg
> (https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/vmcore-dmesg)
> on /proc/vmcore before calling makedumpfile. From what I can tell, that
> uses read()/pread()
> (https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/util_lib/elf_info.c),
> so it would also hit this issue.

Thanks for these details and great patch. It's clear to me now about the
situation and motivation.

We also use vmcore-dmesg to collect dmesg log before running
makedumpfile. That could be a small probability event, but worth adding
Fixes just in case.

> 
> I'll send a v3 adding Fixes: 690467c81b1a ("mm/vmalloc: Move draining
> areas out of caller context"). I don't think a stable tag is necessary
> since this was introduced in v5.18-rc1 and hasn't been backported as far
> as I can tell.
> 
> Thanks,
> Omar
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-14 10:32         ` Baoquan He
  0 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-04-14 10:32 UTC (permalink / raw)
  To: kexec

On 04/13/22 at 09:24am, Omar Sandoval wrote:
> On Fri, Apr 08, 2022 at 11:02:47AM +0800, Baoquan He wrote:
......
> > Since redhat mail server has issue, the body content of patch is empty
> > from my mail client. So reply here to add comment.
> > 
> > As replied in v1 to Omar, I think this is a great fix. That would be
> > also great to state if this is a real issue which is breaking thing,
> > then add 'Fixes' tag and Cc stable like "Cc: <stable@vger.kernel.org> # 5.17",
> > or a fantastic improvement from code inspecting.
> > 
> > Concern this because in distros, e.g in our rhel8, we maintain old kernel
> > and back port necessary patches into the kernel, those patches with
> > 'Fixes' tag definitely are good candidate. This is important too to LTS
> > kernel.
> > 
> > Thanks
> > Baoquan
> 
> Hi, Baoquan,
> 
> Sorry I missed your replies. I'll answer your questions from your first
> email.
> 
> > I am wondering if this is a real issue you met, or you just found it
> > by code inspecting
> 
> I hit this issue with the test suite for drgn
> (https://github.com/osandov/drgn). We run the test cases in a virtual
> machine on various kernel versions
> (https://github.com/osandov/drgn/tree/main/vmtest). Part of the test
> suite crashes the kernel to run some tests against /proc/vmcore
> (https://github.com/osandov/drgn/blob/13144eda119790cdbc11f360c15a04efdf81ae9a/setup.py#L213,
> https://github.com/osandov/drgn/blob/main/vmtest/enter_kdump.py,
> https://github.com/osandov/drgn/tree/main/tests/linux_kernel/vmcore).
> When I tried v5.18-rc1 configured with !SMP and !PREEMPT, that part of
> the test suite got stuck, which is how I found this issue.
> 
> > I am wondering how your vmcore dumping is handled. Asking this because
> > we usually use makedumpfile utility
> 
> In production at Facebook, we don't run drgn directly against
> /proc/vmcore. We use makedumpfile and inspect the captured file with
> drgn once we reboot.
> 
> > While using makedumpfile, we use mmap which is 4M at one time by
> > default, then process the content. So the copy_oldmem_page() may only
> > be called during elfcorehdr and notes reading.
> 
> We also use vmcore-dmesg
> (https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/vmcore-dmesg)
> on /proc/vmcore before calling makedumpfile. From what I can tell, that
> uses read()/pread()
> (https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/util_lib/elf_info.c),
> so it would also hit this issue.

Thanks for these details and great patch. It's clear to me now about the
situation and motivation.

We also use vmcore-dmesg to collect dmesg log before running
makedumpfile. That could be a small probability event, but worth adding
Fixes just in case.

> 
> I'll send a v3 adding Fixes: 690467c81b1a ("mm/vmalloc: Move draining
> areas out of caller context"). I don't think a stable tag is necessary
> since this was introduced in v5.18-rc1 and hasn't been backported as far
> as I can tell.
> 
> Thanks,
> Omar
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-04-14 10:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06 20:36 [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore Omar Sandoval
2022-04-06 20:36 ` Omar Sandoval
2022-04-07  5:38 ` Christoph Hellwig
2022-04-07  5:38   ` Christoph Hellwig
2022-04-07  8:11 ` Uladzislau Rezki
2022-04-07  8:11   ` Uladzislau Rezki
2022-04-07 14:36 ` Chris Down
2022-04-07 14:36   ` Chris Down
2022-04-08  3:02   ` Baoquan He
2022-04-08  3:02     ` Baoquan He
2022-04-13 16:24     ` Omar Sandoval
2022-04-13 16:24       ` Omar Sandoval
2022-04-14 10:32       ` Baoquan He
2022-04-14 10:32         ` Baoquan He

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.