linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
@ 2021-06-22 22:50 Rafael Aquini
  2021-06-23  8:43 ` Nicholas Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Rafael Aquini @ 2021-06-22 22:50 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel

On non-preemptible kernel builds the watchdog can complain
about soft lockups when vfree() is called against large
vmalloc areas:

[  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
[  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
[  238.662716] Modules linked in: kvmalloc_test(OE-) ...
[  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
[  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
[  238.792383] RIP: 0010:free_unref_page+0x52/0x60
[  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
[  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
[  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
[  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
[  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
[  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
[  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
[  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
[  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
[  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  238.903397] PKRU: 55555554
[  238.906417] Call Trace:
[  238.909149]  __vunmap+0x17c/0x220
[  238.912851]  __x64_sys_delete_module+0x13a/0x250
[  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
[  238.923746]  do_syscall_64+0x39/0x80
[  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Like in other range zapping routines that iterate over
a large list, lets just add cond_resched() within __vunmap()'s
page-releasing loop in order to avoid the watchdog splats.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 mm/vmalloc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a13ac524f6ff..cd4b23d65748 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
 
 			BUG_ON(!page);
 			__free_pages(page, page_order);
+			cond_resched();
 		}
 		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
 
-- 
2.26.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-22 22:50 [PATCH] mm: vmalloc: add cond_resched() in __vunmap() Rafael Aquini
@ 2021-06-23  8:43 ` Nicholas Piggin
  2021-06-23 17:30   ` Rafael Aquini
  2021-06-23 11:27 ` Uladzislau Rezki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Nicholas Piggin @ 2021-06-23  8:43 UTC (permalink / raw)
  To: Rafael Aquini, linux-mm; +Cc: Andrew Morton, linux-kernel

Excerpts from Rafael Aquini's message of June 23, 2021 8:50 am:
> On non-preemptible kernel builds the watchdog can complain
> about soft lockups when vfree() is called against large
> vmalloc areas:
> 
> [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded

This is vfreeing()ing 2TB of memory? Maybe not too realistic but
I guess it doesn't hurt to add.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  238.903397] PKRU: 55555554
> [  238.906417] Call Trace:
> [  238.909149]  __vunmap+0x17c/0x220
> [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> [  238.923746]  do_syscall_64+0x39/0x80
> [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Like in other range zapping routines that iterate over
> a large list, lets just add cond_resched() within __vunmap()'s
> page-releasing loop in order to avoid the watchdog splats.
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  mm/vmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a13ac524f6ff..cd4b23d65748 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
>  
>  			BUG_ON(!page);
>  			__free_pages(page, page_order);
> +			cond_resched();
>  		}
>  		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
>  
> -- 
> 2.26.3
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-22 22:50 [PATCH] mm: vmalloc: add cond_resched() in __vunmap() Rafael Aquini
  2021-06-23  8:43 ` Nicholas Piggin
@ 2021-06-23 11:27 ` Uladzislau Rezki
  2021-06-23 17:34   ` Rafael Aquini
  2021-06-23 12:11 ` Aaron Tomlin
  2021-06-24 12:21 ` Michal Hocko
  3 siblings, 1 reply; 14+ messages in thread
From: Uladzislau Rezki @ 2021-06-23 11:27 UTC (permalink / raw)
  To: Rafael Aquini; +Cc: linux-mm, Andrew Morton, linux-kernel

> On non-preemptible kernel builds the watchdog can complain
> about soft lockups when vfree() is called against large
> vmalloc areas:
> 
> [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  238.903397] PKRU: 55555554
> [  238.906417] Call Trace:
> [  238.909149]  __vunmap+0x17c/0x220
> [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> [  238.923746]  do_syscall_64+0x39/0x80
> [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Like in other range zapping routines that iterate over
> a large list, lets just add cond_resched() within __vunmap()'s
> page-releasing loop in order to avoid the watchdog splats.
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  mm/vmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a13ac524f6ff..cd4b23d65748 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
>  
>  			BUG_ON(!page);
>  			__free_pages(page, page_order);
> +			cond_resched();
>  		}
>  		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
>  
> -- 
> 2.26.3
> 
I have a question about a test case you run to trigger such soft lockup.

Is that test_vmalloc.sh test-suite or something local? Do you use a huge
vmalloc mappings so high-order pages are used?

Thanks!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-22 22:50 [PATCH] mm: vmalloc: add cond_resched() in __vunmap() Rafael Aquini
  2021-06-23  8:43 ` Nicholas Piggin
  2021-06-23 11:27 ` Uladzislau Rezki
@ 2021-06-23 12:11 ` Aaron Tomlin
  2021-06-24 12:21 ` Michal Hocko
  3 siblings, 0 replies; 14+ messages in thread
From: Aaron Tomlin @ 2021-06-23 12:11 UTC (permalink / raw)
  To: Rafael Aquini; +Cc: linux-mm, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

On Tue 2021-06-22 18:50 -0400, Rafael Aquini wrote:
> Like in other range zapping routines that iterate over
> a large list, lets just add cond_resched() within __vunmap()'s
> page-releasing loop in order to avoid the watchdog splats.
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  mm/vmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a13ac524f6ff..cd4b23d65748 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
>  
>  			BUG_ON(!page);
>  			__free_pages(page, page_order);
> +			cond_resched();
>  		}
>  		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
>  
> -- 
> 2.26.3
> 

Good catch.

Reviewed-by: Aaron Tomlin <atomlin@redhat.com>

-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-23  8:43 ` Nicholas Piggin
@ 2021-06-23 17:30   ` Rafael Aquini
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael Aquini @ 2021-06-23 17:30 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linux-mm, Andrew Morton, linux-kernel

On Wed, Jun 23, 2021 at 06:43:50PM +1000, Nicholas Piggin wrote:
> Excerpts from Rafael Aquini's message of June 23, 2021 8:50 am:
> > On non-preemptible kernel builds the watchdog can complain
> > about soft lockups when vfree() is called against large
> > vmalloc areas:
> > 
> > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> 
> This is vfreeing()ing 2TB of memory? Maybe not too realistic but
> I guess it doesn't hurt to add.
>

Andrew has recently addressed https://bugzilla.kernel.org/show_bug.cgi?id=210023
via 34fe653716b0d ("mm/vmalloc.c:__vmalloc_area_node(): avoid 32-bit overflow")

The person that filed that bug also filed a case requesting the issue
to be addressed in RHEL. So the case for large vmaps such as that one
in the test case is real, albeit not being a frequent use case perhaps.

Realistically, though, that vunmap loop can still burn considerable amount
of CPU time, even when vfreeing smaller areas, thus as you well noted, 
it's better to play nice and back off a bit and let others having a 
chance to run as well. 

> Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks!

-- Rafael


> Thanks,
> Nick
> 
> > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  238.903397] PKRU: 55555554
> > [  238.906417] Call Trace:
> > [  238.909149]  __vunmap+0x17c/0x220
> > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > [  238.923746]  do_syscall_64+0x39/0x80
> > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > Like in other range zapping routines that iterate over
> > a large list, lets just add cond_resched() within __vunmap()'s
> > page-releasing loop in order to avoid the watchdog splats.
> > 
> > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > ---
> >  mm/vmalloc.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index a13ac524f6ff..cd4b23d65748 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
> >  
> >  			BUG_ON(!page);
> >  			__free_pages(page, page_order);
> > +			cond_resched();
> >  		}
> >  		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
> >  
> > -- 
> > 2.26.3
> > 
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-23 11:27 ` Uladzislau Rezki
@ 2021-06-23 17:34   ` Rafael Aquini
  2021-06-23 20:14     ` Uladzislau Rezki
  0 siblings, 1 reply; 14+ messages in thread
From: Rafael Aquini @ 2021-06-23 17:34 UTC (permalink / raw)
  To: Uladzislau Rezki; +Cc: linux-mm, Andrew Morton, linux-kernel

On Wed, Jun 23, 2021 at 01:27:04PM +0200, Uladzislau Rezki wrote:
> > On non-preemptible kernel builds the watchdog can complain
> > about soft lockups when vfree() is called against large
> > vmalloc areas:
> > 
> > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  238.903397] PKRU: 55555554
> > [  238.906417] Call Trace:
> > [  238.909149]  __vunmap+0x17c/0x220
> > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > [  238.923746]  do_syscall_64+0x39/0x80
> > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > Like in other range zapping routines that iterate over
> > a large list, lets just add cond_resched() within __vunmap()'s
> > page-releasing loop in order to avoid the watchdog splats.
> > 
> > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > ---
> >  mm/vmalloc.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index a13ac524f6ff..cd4b23d65748 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
> >  
> >  			BUG_ON(!page);
> >  			__free_pages(page, page_order);
> > +			cond_resched();
> >  		}
> >  		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
> >  
> > -- 
> > 2.26.3
> > 
> I have a question about a test case you run to trigger such soft lockup.
> 
> Is that test_vmalloc.sh test-suite or something local? Do you use a huge
> vmalloc mappings so high-order pages are used?
>

Vlad,

It's a variant of the simple testcase presented with Kernel Bug 210023:
https://bugzilla.kernel.org/show_bug.cgi?id=210023#c7

Cheers,
-- Rafael


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-23 17:34   ` Rafael Aquini
@ 2021-06-23 20:14     ` Uladzislau Rezki
  0 siblings, 0 replies; 14+ messages in thread
From: Uladzislau Rezki @ 2021-06-23 20:14 UTC (permalink / raw)
  To: Rafael Aquini; +Cc: Uladzislau Rezki, linux-mm, Andrew Morton, linux-kernel

On Wed, Jun 23, 2021 at 01:34:33PM -0400, Rafael Aquini wrote:
> On Wed, Jun 23, 2021 at 01:27:04PM +0200, Uladzislau Rezki wrote:
> > > On non-preemptible kernel builds the watchdog can complain
> > > about soft lockups when vfree() is called against large
> > > vmalloc areas:
> > > 
> > > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> > > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [  238.903397] PKRU: 55555554
> > > [  238.906417] Call Trace:
> > > [  238.909149]  __vunmap+0x17c/0x220
> > > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > > [  238.923746]  do_syscall_64+0x39/0x80
> > > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > 
> > > Like in other range zapping routines that iterate over
> > > a large list, lets just add cond_resched() within __vunmap()'s
> > > page-releasing loop in order to avoid the watchdog splats.
> > > 
> > > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > > ---
> > >  mm/vmalloc.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index a13ac524f6ff..cd4b23d65748 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
> > >  
> > >  			BUG_ON(!page);
> > >  			__free_pages(page, page_order);
> > > +			cond_resched();
> > >  		}
> > >  		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
> > >  
> > > -- 
> > > 2.26.3
> > > 
> > I have a question about a test case you run to trigger such soft lockup.
> > 
> > Is that test_vmalloc.sh test-suite or something local? Do you use a huge
> > vmalloc mappings so high-order pages are used?
> >
> 
> Vlad,
> 
> It's a variant of the simple testcase presented with Kernel Bug 210023:
> https://bugzilla.kernel.org/show_bug.cgi?id=210023#c7
> 
OK, now i see how you get ~23 seconds soft lockup :)

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Thanks!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-22 22:50 [PATCH] mm: vmalloc: add cond_resched() in __vunmap() Rafael Aquini
                   ` (2 preceding siblings ...)
  2021-06-23 12:11 ` Aaron Tomlin
@ 2021-06-24 12:21 ` Michal Hocko
  2021-06-24 14:23   ` Uladzislau Rezki
  3 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2021-06-24 12:21 UTC (permalink / raw)
  To: Rafael Aquini; +Cc: linux-mm, Andrew Morton, linux-kernel

On Tue 22-06-21 18:50:30, Rafael Aquini wrote:
> On non-preemptible kernel builds the watchdog can complain
> about soft lockups when vfree() is called against large
> vmalloc areas:
> 
> [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  238.903397] PKRU: 55555554
> [  238.906417] Call Trace:
> [  238.909149]  __vunmap+0x17c/0x220
> [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> [  238.923746]  do_syscall_64+0x39/0x80
> [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Like in other range zapping routines that iterate over
> a large list, lets just add cond_resched() within __vunmap()'s
> page-releasing loop in order to avoid the watchdog splats.

cond_resched makes a lot of sense. We do not want vmalloc to be visible
the userspace (e.g. by stalling it) so all time consuming operations
should yield regularly whenever possible. I would expect that any
susbsystem which needs huge vmalloc areas would have it for the whole
boot life time so such large vfrees should be really rare.

> Signed-off-by: Rafael Aquini <aquini@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a13ac524f6ff..cd4b23d65748 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2564,6 +2564,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
>  
>  			BUG_ON(!page);
>  			__free_pages(page, page_order);
> +			cond_resched();
>  		}
>  		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
>  
> -- 
> 2.26.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-24 12:21 ` Michal Hocko
@ 2021-06-24 14:23   ` Uladzislau Rezki
  2021-06-25  8:51     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Uladzislau Rezki @ 2021-06-24 14:23 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman, Matthew Wilcox, Rafael Aquini
  Cc: Rafael Aquini, linux-mm, Andrew Morton, linux-kernel

On Thu, Jun 24, 2021 at 02:21:21PM +0200, Michal Hocko wrote:
> On Tue 22-06-21 18:50:30, Rafael Aquini wrote:
> > On non-preemptible kernel builds the watchdog can complain
> > about soft lockups when vfree() is called against large
> > vmalloc areas:
> > 
> > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  238.903397] PKRU: 55555554
> > [  238.906417] Call Trace:
> > [  238.909149]  __vunmap+0x17c/0x220
> > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > [  238.923746]  do_syscall_64+0x39/0x80
> > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > Like in other range zapping routines that iterate over
> > a large list, lets just add cond_resched() within __vunmap()'s
> > page-releasing loop in order to avoid the watchdog splats.
> 
> cond_resched makes a lot of sense. We do not want vmalloc to be visible
> the userspace (e.g. by stalling it) so all time consuming operations
> should yield regularly whenever possible. I would expect that any
> susbsystem which needs huge vmalloc areas would have it for the whole
> boot life time so such large vfrees should be really rare.
> 
There is at least one more place with potentially similar issue. I see that
the bulk allocator disables irqs during obtaining pages and filling an array.

So i suspect if we request a huge size to allocate over vmalloc same soft
lockup should occur. For example 10G alloactions simultaneously on different
CPUs.

I will try to reproduce it on !CONFIG_PREEMPT kernel and post feedback here.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-24 14:23   ` Uladzislau Rezki
@ 2021-06-25  8:51     ` Michal Hocko
  2021-06-25 16:00       ` Uladzislau Rezki
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2021-06-25  8:51 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Mel Gorman, Matthew Wilcox, Rafael Aquini,
	linux-mm, linux-kernel

On Thu 24-06-21 16:23:39, Uladzislau Rezki wrote:
> On Thu, Jun 24, 2021 at 02:21:21PM +0200, Michal Hocko wrote:
> > On Tue 22-06-21 18:50:30, Rafael Aquini wrote:
> > > On non-preemptible kernel builds the watchdog can complain
> > > about soft lockups when vfree() is called against large
> > > vmalloc areas:
> > > 
> > > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> > > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [  238.903397] PKRU: 55555554
> > > [  238.906417] Call Trace:
> > > [  238.909149]  __vunmap+0x17c/0x220
> > > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > > [  238.923746]  do_syscall_64+0x39/0x80
> > > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > 
> > > Like in other range zapping routines that iterate over
> > > a large list, lets just add cond_resched() within __vunmap()'s
> > > page-releasing loop in order to avoid the watchdog splats.
> > 
> > cond_resched makes a lot of sense. We do not want vmalloc to be visible
> > the userspace (e.g. by stalling it) so all time consuming operations
> > should yield regularly whenever possible. I would expect that any
> > susbsystem which needs huge vmalloc areas would have it for the whole
> > boot life time so such large vfrees should be really rare.
> > 
> There is at least one more place with potentially similar issue. I see that
> the bulk allocator disables irqs during obtaining pages and filling an array.
> 
> So i suspect if we request a huge size to allocate over vmalloc same soft
> lockup should occur. For example 10G alloactions simultaneously on different
> CPUs.

I haven't payed a close attention to the changes regarding the bulk
allocator but my high level understanding is that it only allocates from
from pcp lists so the amount of allocatable pages is quite limited.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-25  8:51     ` Michal Hocko
@ 2021-06-25 16:00       ` Uladzislau Rezki
  2021-06-28 12:46         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Uladzislau Rezki @ 2021-06-25 16:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Andrew Morton, Mel Gorman, Matthew Wilcox,
	Rafael Aquini, linux-mm, linux-kernel

On Fri, Jun 25, 2021 at 10:51:08AM +0200, Michal Hocko wrote:
> On Thu 24-06-21 16:23:39, Uladzislau Rezki wrote:
> > On Thu, Jun 24, 2021 at 02:21:21PM +0200, Michal Hocko wrote:
> > > On Tue 22-06-21 18:50:30, Rafael Aquini wrote:
> > > > On non-preemptible kernel builds the watchdog can complain
> > > > about soft lockups when vfree() is called against large
> > > > vmalloc areas:
> > > > 
> > > > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> > > > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > > > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > > > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > > > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > > > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > > > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > > > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > > > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > > > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > > > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > > > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > > > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > > > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > > > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > > > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > [  238.903397] PKRU: 55555554
> > > > [  238.906417] Call Trace:
> > > > [  238.909149]  __vunmap+0x17c/0x220
> > > > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > > > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > > > [  238.923746]  do_syscall_64+0x39/0x80
> > > > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > 
> > > > Like in other range zapping routines that iterate over
> > > > a large list, lets just add cond_resched() within __vunmap()'s
> > > > page-releasing loop in order to avoid the watchdog splats.
> > > 
> > > cond_resched makes a lot of sense. We do not want vmalloc to be visible
> > > the userspace (e.g. by stalling it) so all time consuming operations
> > > should yield regularly whenever possible. I would expect that any
> > > susbsystem which needs huge vmalloc areas would have it for the whole
> > > boot life time so such large vfrees should be really rare.
> > > 
> > There is at least one more place with potentially similar issue. I see that
> > the bulk allocator disables irqs during obtaining pages and filling an array.
> > 
> > So i suspect if we request a huge size to allocate over vmalloc same soft
> > lockup should occur. For example 10G alloactions simultaneously on different
> > CPUs.
> 
> I haven't payed a close attention to the changes regarding the bulk
> allocator but my high level understanding is that it only allocates from
> from pcp lists so the amount of allocatable pages is quite limited.

I am able to trigger it. To simulate it i run 10 threads to allocate and vfree
~1GB(262144 pages) of vmalloced memory at the same time: 

<snip>
[   62.512621] RIP: 0010:__alloc_pages_bulk+0xa9f/0xbb0
[   62.512628] Code: ff 8b 44 24 48 44 29 f8 83 f8 01 0f 84 ea fe ff ff e9 07 f6 ff ff 48 8b 44 24 60 48 89 28 e9 00 f9 ff ff fb 66 0f 1f 44 00 00 <e9> e8 fd ff ff 65 48 01 51 10 e9 3e fe ff ff 48 8b 44 24 78 4d 89
[   62.512629] RSP: 0018:ffffa7bfc29ffd20 EFLAGS: 00000206
[   62.512631] RAX: 0000000000000200 RBX: ffffcd5405421888 RCX: ffff8c36ffdeb928
[   62.512632] RDX: 0000000000040000 RSI: ffffa896f06b2ff8 RDI: ffffcd5405421880
[   62.512633] RBP: ffffcd5405421880 R08: 000000000000007d R09: ffffffffffffffff
[   62.512634] R10: ffffffff9d63c084 R11: 00000000ffffffff R12: ffff8c373ffaeb80
[   62.512635] R13: ffff8c36ffdf65f8 R14: ffff8c373ffaeb80 R15: 0000000000040000
[   62.512637] FS:  0000000000000000(0000) GS:ffff8c36ffdc0000(0000) knlGS:0000000000000000
[   62.512638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.512639] CR2: 000055c8e2fe8610 CR3: 0000000c13e10000 CR4: 00000000000006e0
[   62.512641] Call Trace:
[   62.512646]  __vmalloc_node_range+0x11c/0x2d0
[   62.512649]  ? full_fit_alloc_test+0x140/0x140 [test_vmalloc]
[   62.512654]  __vmalloc_node+0x4b/0x70
[   62.512656]  ? fix_size_alloc_test+0x44/0x60 [test_vmalloc]
[   62.512659]  fix_size_alloc_test+0x44/0x60 [test_vmalloc]
[   62.512662]  test_func+0xe7/0x1f0 [test_vmalloc]
[   62.512666]  ? fix_align_alloc_test+0x50/0x50 [test_vmalloc]
[   62.512668]  kthread+0x11a/0x140
[   62.512671]  ? set_kthread_struct+0x40/0x40
[   62.512672]  ret_from_fork+0x22/0x30
<snip>

As for how much a bulk allocator can allocate, it is quite a lot. In my case i see
that 262144 pages can be obtained per one bulk call, if pcp-list is empty it is
refilled.

From the other hand allocating 1GB on 10 CPUs simultaneously is not common test
case in real world.

Not sure if we can do something with it and if it is worth to fix. At least we can
invoke a bulk allocator several times doing it per specific batch, for example 50
pages.

Any thoughts about it?

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-25 16:00       ` Uladzislau Rezki
@ 2021-06-28 12:46         ` Michal Hocko
  2021-06-28 16:40           ` Uladzislau Rezki
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2021-06-28 12:46 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Mel Gorman, Matthew Wilcox, Rafael Aquini,
	linux-mm, linux-kernel

On Fri 25-06-21 18:00:11, Uladzislau Rezki wrote:
> On Fri, Jun 25, 2021 at 10:51:08AM +0200, Michal Hocko wrote:
> > On Thu 24-06-21 16:23:39, Uladzislau Rezki wrote:
> > > On Thu, Jun 24, 2021 at 02:21:21PM +0200, Michal Hocko wrote:
> > > > On Tue 22-06-21 18:50:30, Rafael Aquini wrote:
> > > > > On non-preemptible kernel builds the watchdog can complain
> > > > > about soft lockups when vfree() is called against large
> > > > > vmalloc areas:
> > > > > 
> > > > > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> > > > > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > > > > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > > > > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > > > > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > > > > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > > > > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > > > > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > > > > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > > > > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > > > > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > > > > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > > > > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > > > > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > > > > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > > > > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > [  238.903397] PKRU: 55555554
> > > > > [  238.906417] Call Trace:
> > > > > [  238.909149]  __vunmap+0x17c/0x220
> > > > > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > > > > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > > > > [  238.923746]  do_syscall_64+0x39/0x80
> > > > > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > > 
> > > > > Like in other range zapping routines that iterate over
> > > > > a large list, lets just add cond_resched() within __vunmap()'s
> > > > > page-releasing loop in order to avoid the watchdog splats.
> > > > 
> > > > cond_resched makes a lot of sense. We do not want vmalloc to be visible
> > > > the userspace (e.g. by stalling it) so all time consuming operations
> > > > should yield regularly whenever possible. I would expect that any
> > > > susbsystem which needs huge vmalloc areas would have it for the whole
> > > > boot life time so such large vfrees should be really rare.
> > > > 
> > > There is at least one more place with potentially similar issue. I see that
> > > the bulk allocator disables irqs during obtaining pages and filling an array.
> > > 
> > > So i suspect if we request a huge size to allocate over vmalloc same soft
> > > lockup should occur. For example 10G alloactions simultaneously on different
> > > CPUs.
> > 
> > I haven't payed a close attention to the changes regarding the bulk
> > allocator but my high level understanding is that it only allocates from
> > from pcp lists so the amount of allocatable pages is quite limited.
> 
> I am able to trigger it. To simulate it i run 10 threads to allocate and vfree
> ~1GB(262144 pages) of vmalloced memory at the same time: 
> 
> <snip>
> [   62.512621] RIP: 0010:__alloc_pages_bulk+0xa9f/0xbb0
> [   62.512628] Code: ff 8b 44 24 48 44 29 f8 83 f8 01 0f 84 ea fe ff ff e9 07 f6 ff ff 48 8b 44 24 60 48 89 28 e9 00 f9 ff ff fb 66 0f 1f 44 00 00 <e9> e8 fd ff ff 65 48 01 51 10 e9 3e fe ff ff 48 8b 44 24 78 4d 89
> [   62.512629] RSP: 0018:ffffa7bfc29ffd20 EFLAGS: 00000206
> [   62.512631] RAX: 0000000000000200 RBX: ffffcd5405421888 RCX: ffff8c36ffdeb928
> [   62.512632] RDX: 0000000000040000 RSI: ffffa896f06b2ff8 RDI: ffffcd5405421880
> [   62.512633] RBP: ffffcd5405421880 R08: 000000000000007d R09: ffffffffffffffff
> [   62.512634] R10: ffffffff9d63c084 R11: 00000000ffffffff R12: ffff8c373ffaeb80
> [   62.512635] R13: ffff8c36ffdf65f8 R14: ffff8c373ffaeb80 R15: 0000000000040000
> [   62.512637] FS:  0000000000000000(0000) GS:ffff8c36ffdc0000(0000) knlGS:0000000000000000
> [   62.512638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   62.512639] CR2: 000055c8e2fe8610 CR3: 0000000c13e10000 CR4: 00000000000006e0
> [   62.512641] Call Trace:
> [   62.512646]  __vmalloc_node_range+0x11c/0x2d0
> [   62.512649]  ? full_fit_alloc_test+0x140/0x140 [test_vmalloc]
> [   62.512654]  __vmalloc_node+0x4b/0x70
> [   62.512656]  ? fix_size_alloc_test+0x44/0x60 [test_vmalloc]
> [   62.512659]  fix_size_alloc_test+0x44/0x60 [test_vmalloc]
> [   62.512662]  test_func+0xe7/0x1f0 [test_vmalloc]
> [   62.512666]  ? fix_align_alloc_test+0x50/0x50 [test_vmalloc]
> [   62.512668]  kthread+0x11a/0x140
> [   62.512671]  ? set_kthread_struct+0x40/0x40
> [   62.512672]  ret_from_fork+0x22/0x30
> <snip>
> 
> As for how much a bulk allocator can allocate, it is quite a lot. In my case i see
> that 262144 pages can be obtained per one bulk call, if pcp-list is empty it is
> refilled.

Hmm, that is surprising. I would have to take a closer look but I
thought the pcp list won't get refilled while there is a consumer on
that cpu. So it should really be just about the number of pages on pcp
lists. 1GB worth of memory there sounds way too much.

> >From the other hand allocating 1GB on 10 CPUs simultaneously is not common test
> case in real world.
> 
> Not sure if we can do something with it and if it is worth to fix. At least we can
> invoke a bulk allocator several times doing it per specific batch, for example 50
> pages.
> 
> Any thoughts about it?

On the other hand the bulk allocator is meant to be optimized for speed
and it assumes a certain level of reasonability from its callers so it
makes some sense to do reasonable sized batches at the vmalloc end.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-28 12:46         ` Michal Hocko
@ 2021-06-28 16:40           ` Uladzislau Rezki
  2021-06-29 14:11             ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Uladzislau Rezki @ 2021-06-28 16:40 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Uladzislau Rezki, Andrew Morton, Mel Gorman, Matthew Wilcox,
	Rafael Aquini, linux-mm, linux-kernel

On Mon, Jun 28, 2021 at 02:46:06PM +0200, Michal Hocko wrote:
> On Fri 25-06-21 18:00:11, Uladzislau Rezki wrote:
> > On Fri, Jun 25, 2021 at 10:51:08AM +0200, Michal Hocko wrote:
> > > On Thu 24-06-21 16:23:39, Uladzislau Rezki wrote:
> > > > On Thu, Jun 24, 2021 at 02:21:21PM +0200, Michal Hocko wrote:
> > > > > On Tue 22-06-21 18:50:30, Rafael Aquini wrote:
> > > > > > On non-preemptible kernel builds the watchdog can complain
> > > > > > about soft lockups when vfree() is called against large
> > > > > > vmalloc areas:
> > > > > > 
> > > > > > [  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
> > > > > > [  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
> > > > > > [  238.662716] Modules linked in: kvmalloc_test(OE-) ...
> > > > > > [  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
> > > > > > [  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
> > > > > > [  238.792383] RIP: 0010:free_unref_page+0x52/0x60
> > > > > > [  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
> > > > > > [  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
> > > > > > [  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
> > > > > > [  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
> > > > > > [  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
> > > > > > [  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
> > > > > > [  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
> > > > > > [  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
> > > > > > [  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > [  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
> > > > > > [  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > [  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > [  238.903397] PKRU: 55555554
> > > > > > [  238.906417] Call Trace:
> > > > > > [  238.909149]  __vunmap+0x17c/0x220
> > > > > > [  238.912851]  __x64_sys_delete_module+0x13a/0x250
> > > > > > [  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
> > > > > > [  238.923746]  do_syscall_64+0x39/0x80
> > > > > > [  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > > > 
> > > > > > Like in other range zapping routines that iterate over
> > > > > > a large list, lets just add cond_resched() within __vunmap()'s
> > > > > > page-releasing loop in order to avoid the watchdog splats.
> > > > > 
> > > > > cond_resched makes a lot of sense. We do not want vmalloc to be visible
> > > > > the userspace (e.g. by stalling it) so all time consuming operations
> > > > > should yield regularly whenever possible. I would expect that any
> > > > > susbsystem which needs huge vmalloc areas would have it for the whole
> > > > > boot life time so such large vfrees should be really rare.
> > > > > 
> > > > There is at least one more place with potentially similar issue. I see that
> > > > the bulk allocator disables irqs during obtaining pages and filling an array.
> > > > 
> > > > So i suspect if we request a huge size to allocate over vmalloc same soft
> > > > lockup should occur. For example 10G alloactions simultaneously on different
> > > > CPUs.
> > > 
> > > I haven't payed a close attention to the changes regarding the bulk
> > > allocator but my high level understanding is that it only allocates from
> > > from pcp lists so the amount of allocatable pages is quite limited.
> > 
> > I am able to trigger it. To simulate it i run 10 threads to allocate and vfree
> > ~1GB(262144 pages) of vmalloced memory at the same time: 
> > 
> > <snip>
> > [   62.512621] RIP: 0010:__alloc_pages_bulk+0xa9f/0xbb0
> > [   62.512628] Code: ff 8b 44 24 48 44 29 f8 83 f8 01 0f 84 ea fe ff ff e9 07 f6 ff ff 48 8b 44 24 60 48 89 28 e9 00 f9 ff ff fb 66 0f 1f 44 00 00 <e9> e8 fd ff ff 65 48 01 51 10 e9 3e fe ff ff 48 8b 44 24 78 4d 89
> > [   62.512629] RSP: 0018:ffffa7bfc29ffd20 EFLAGS: 00000206
> > [   62.512631] RAX: 0000000000000200 RBX: ffffcd5405421888 RCX: ffff8c36ffdeb928
> > [   62.512632] RDX: 0000000000040000 RSI: ffffa896f06b2ff8 RDI: ffffcd5405421880
> > [   62.512633] RBP: ffffcd5405421880 R08: 000000000000007d R09: ffffffffffffffff
> > [   62.512634] R10: ffffffff9d63c084 R11: 00000000ffffffff R12: ffff8c373ffaeb80
> > [   62.512635] R13: ffff8c36ffdf65f8 R14: ffff8c373ffaeb80 R15: 0000000000040000
> > [   62.512637] FS:  0000000000000000(0000) GS:ffff8c36ffdc0000(0000) knlGS:0000000000000000
> > [   62.512638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   62.512639] CR2: 000055c8e2fe8610 CR3: 0000000c13e10000 CR4: 00000000000006e0
> > [   62.512641] Call Trace:
> > [   62.512646]  __vmalloc_node_range+0x11c/0x2d0
> > [   62.512649]  ? full_fit_alloc_test+0x140/0x140 [test_vmalloc]
> > [   62.512654]  __vmalloc_node+0x4b/0x70
> > [   62.512656]  ? fix_size_alloc_test+0x44/0x60 [test_vmalloc]
> > [   62.512659]  fix_size_alloc_test+0x44/0x60 [test_vmalloc]
> > [   62.512662]  test_func+0xe7/0x1f0 [test_vmalloc]
> > [   62.512666]  ? fix_align_alloc_test+0x50/0x50 [test_vmalloc]
> > [   62.512668]  kthread+0x11a/0x140
> > [   62.512671]  ? set_kthread_struct+0x40/0x40
> > [   62.512672]  ret_from_fork+0x22/0x30
> > <snip>
> > 
> > As for how much a bulk allocator can allocate, it is quite a lot. In my case i see
> > that 262144 pages can be obtained per one bulk call, if pcp-list is empty it is
> > refilled.
> 
> Hmm, that is surprising. I would have to take a closer look but I
> thought the pcp list won't get refilled while there is a consumer on
> that cpu. So it should really be just about the number of pages on pcp
> lists. 1GB worth of memory there sounds way too much.
> 
> > >From the other hand allocating 1GB on 10 CPUs simultaneously is not common test
> > case in real world.
> > 
> > Not sure if we can do something with it and if it is worth to fix. At least we can
> > invoke a bulk allocator several times doing it per specific batch, for example 50
> > pages.
> > 
> > Any thoughts about it?
> 
> On the other hand the bulk allocator is meant to be optimized for speed
> and it assumes a certain level of reasonability from its callers so it
> makes some sense to do reasonable sized batches at the vmalloc end.
>
OK, i see your point. That we can do on a vmalloc end for sure.

Also another option can be:

if a pcp list is fully consumed, so a refilling is required to proceed with array
populating, leave atomic section(enable irq), do a breath by invoking cond_resched()?

Thoughts?

--
Vlad Rezki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: vmalloc: add cond_resched() in __vunmap()
  2021-06-28 16:40           ` Uladzislau Rezki
@ 2021-06-29 14:11             ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2021-06-29 14:11 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Mel Gorman, Andrew Morton, Matthew Wilcox, Rafael Aquini,
	linux-mm, linux-kernel

On Mon 28-06-21 18:40:02, Uladzislau Rezki wrote:
> Also another option can be:
> 
> if a pcp list is fully consumed, so a refilling is required to proceed with array
> populating, leave atomic section(enable irq), do a breath by invoking cond_resched()?

I do not know. I would rather not medle with the pcp batches. It is a hot
path and caller can decide the fallback mechanism.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-06-29 14:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-22 22:50 [PATCH] mm: vmalloc: add cond_resched() in __vunmap() Rafael Aquini
2021-06-23  8:43 ` Nicholas Piggin
2021-06-23 17:30   ` Rafael Aquini
2021-06-23 11:27 ` Uladzislau Rezki
2021-06-23 17:34   ` Rafael Aquini
2021-06-23 20:14     ` Uladzislau Rezki
2021-06-23 12:11 ` Aaron Tomlin
2021-06-24 12:21 ` Michal Hocko
2021-06-24 14:23   ` Uladzislau Rezki
2021-06-25  8:51     ` Michal Hocko
2021-06-25 16:00       ` Uladzislau Rezki
2021-06-28 12:46         ` Michal Hocko
2021-06-28 16:40           ` Uladzislau Rezki
2021-06-29 14:11             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).