[PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
@ 2022-03-02  1:38 wangjianxing
  2022-03-02 23:34 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: wangjianxing @ 2022-03-02  1:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, wangjianxing

free a large list of pages maybe cause rcu_sched starved on
non-preemptible kernels

rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
[...]
Call Trace:
  free_unref_page_list+0x19c/0x270
  release_pages+0x3cc/0x498
  tlb_flush_mmu_free+0x44/0x70
  zap_pte_range+0x450/0x738
  unmap_page_range+0x108/0x240
  unmap_vmas+0x74/0xf0
  unmap_region+0xb0/0x120
  do_munmap+0x264/0x438
  vm_munmap+0x58/0xa0
  sys_munmap+0x10/0x20
  syscall_common+0x24/0x38

Signed-off-by: wangjianxing <wangjianxing@loongson.cn>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6..1b96421c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3479,6 +3479,9 @@ void free_unref_page_list(struct list_head *list)
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
 			local_unlock_irqrestore(&pagesets.lock, flags);
+
+			cond_resched();
+
 			batch_count = 0;
 			local_lock_irqsave(&pagesets.lock, flags);
 		}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-02  1:38 [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list wangjianxing
@ 2022-03-02 23:34 ` Andrew Morton
  2022-03-03  2:02   ` wangjianxing
  2022-03-08 16:04   ` Vlastimil Babka
  2022-03-08 16:05 ` Vlastimil Babka
  2022-03-08 16:19 ` Matthew Wilcox
  2 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2022-03-02 23:34 UTC (permalink / raw)
  To: wangjianxing; +Cc: linux-mm, linux-kernel

On Tue,  1 Mar 2022 20:38:25 -0500 wangjianxing <wangjianxing@loongson.cn> wrote:

> free a large list of pages maybe cause rcu_sched starved on
> non-preemptible kernels
> 
> rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
> [...]
> Call Trace:
>   free_unref_page_list+0x19c/0x270
>   release_pages+0x3cc/0x498
>   tlb_flush_mmu_free+0x44/0x70
>   zap_pte_range+0x450/0x738
>   unmap_page_range+0x108/0x240
>   unmap_vmas+0x74/0xf0
>   unmap_region+0xb0/0x120
>   do_munmap+0x264/0x438
>   vm_munmap+0x58/0xa0
>   sys_munmap+0x10/0x20
>   syscall_common+0x24/0x38

Thanks.

How did this large list of pages come about?

Will people be seeing this message in upstream kernels, or is it
specific to some caller code which you have added?

Please always include details such as this so that others can determine
whether the fix should be backported into -stable kernels.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-02 23:34 ` Andrew Morton
@ 2022-03-03  2:02   ` wangjianxing
  2022-03-08 16:04   ` Vlastimil Babka
  1 sibling, 0 replies; 11+ messages in thread
From: wangjianxing @ 2022-03-03  2:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On 03/03/2022 07:34 AM, Andrew Morton wrote:
> On Tue,  1 Mar 2022 20:38:25 -0500 wangjianxing <wangjianxing@loongson.cn> wrote:
>
>> free a large list of pages maybe cause rcu_sched starved on
>> non-preemptible kernels
>>
>> rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
>> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
>> [...]
>> Call Trace:
>>    free_unref_page_list+0x19c/0x270
>>    release_pages+0x3cc/0x498
>>    tlb_flush_mmu_free+0x44/0x70
>>    zap_pte_range+0x450/0x738
>>    unmap_page_range+0x108/0x240
>>    unmap_vmas+0x74/0xf0
>>    unmap_region+0xb0/0x120
>>    do_munmap+0x264/0x438
>>    vm_munmap+0x58/0xa0
>>    sys_munmap+0x10/0x20
>>    syscall_common+0x24/0x38
> Thanks.
>
> How did this large list of pages come about?
>
> Will people be seeing this message in upstream kernels, or is it
> specific to some caller code which you have added?
>
> Please always include details such as this so that others can determine
> whether the fix should be backported into -stable kernels.
Thanks.

I try to increase the overcommit ratio of cpu to 1:2~1:3 in KVM 
hypervisor, per-vm has the same number of vcpu with host cpu, then setup 
2 or 3 vm.
Run ltpstress test in per vm, both host and guest is non-preemptiable 
kernel, vm dmesg will throw some rcu_sched warning.

ltp version is 20180926, but until now I didn't analysis ltpstress code 
deeply.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-02 23:34 ` Andrew Morton
  2022-03-03  2:02   ` wangjianxing
@ 2022-03-08 16:04   ` Vlastimil Babka
  1 sibling, 0 replies; 11+ messages in thread
From: Vlastimil Babka @ 2022-03-08 16:04 UTC (permalink / raw)
  To: Andrew Morton, wangjianxing
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman

On 3/3/22 00:34, Andrew Morton wrote:
> On Tue,  1 Mar 2022 20:38:25 -0500 wangjianxing <wangjianxing@loongson.cn> wrote:
> 
>> free a large list of pages maybe cause rcu_sched starved on
>> non-preemptible kernels
>>
>> rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
>> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
>> [...]
>> Call Trace:
>>   free_unref_page_list+0x19c/0x270
>>   release_pages+0x3cc/0x498
>>   tlb_flush_mmu_free+0x44/0x70
>>   zap_pte_range+0x450/0x738
>>   unmap_page_range+0x108/0x240
>>   unmap_vmas+0x74/0xf0
>>   unmap_region+0xb0/0x120
>>   do_munmap+0x264/0x438
>>   vm_munmap+0x58/0xa0
>>   sys_munmap+0x10/0x20
>>   syscall_common+0x24/0x38
> 
> Thanks.
> 
> How did this large list of pages come about?

Looks like it came from TLB batching. But I got lost in the maze of it
trying to figure out how large the batch can grow.

> Will people be seeing this message in upstream kernels, or is it
> specific to some caller code which you have added?
> 
> Please always include details such as this so that others can determine
> whether the fix should be backported into -stable kernels.
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-02  1:38 [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list wangjianxing
  2022-03-02 23:34 ` Andrew Morton
@ 2022-03-08 16:05 ` Vlastimil Babka
  2022-03-08 16:19 ` Matthew Wilcox
  2 siblings, 0 replies; 11+ messages in thread
From: Vlastimil Babka @ 2022-03-08 16:05 UTC (permalink / raw)
  To: wangjianxing, Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman

On 3/2/22 02:38, wangjianxing wrote:
> free a large list of pages maybe cause rcu_sched starved on
> non-preemptible kernels
> 
> rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
> [...]
> Call Trace:
>   free_unref_page_list+0x19c/0x270
>   release_pages+0x3cc/0x498
>   tlb_flush_mmu_free+0x44/0x70
>   zap_pte_range+0x450/0x738
>   unmap_page_range+0x108/0x240
>   unmap_vmas+0x74/0xf0
>   unmap_region+0xb0/0x120
>   do_munmap+0x264/0x438
>   vm_munmap+0x58/0xa0
>   sys_munmap+0x10/0x20
>   syscall_common+0x24/0x38
> 
> Signed-off-by: wangjianxing <wangjianxing@loongson.cn>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6..1b96421c8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3479,6 +3479,9 @@ void free_unref_page_list(struct list_head *list)
>  		 */
>  		if (++batch_count == SWAP_CLUSTER_MAX) {
>  			local_unlock_irqrestore(&pagesets.lock, flags);
> +
> +			cond_resched();
> +
>  			batch_count = 0;
>  			local_lock_irqsave(&pagesets.lock, flags);
>  		}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-02  1:38 [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list wangjianxing
  2022-03-02 23:34 ` Andrew Morton
  2022-03-08 16:05 ` Vlastimil Babka
@ 2022-03-08 16:19 ` Matthew Wilcox
  2022-03-10  1:05   ` Andrew Morton
  2 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2022-03-08 16:19 UTC (permalink / raw)
  To: wangjianxing; +Cc: Andrew Morton, linux-mm, linux-kernel

On Tue, Mar 01, 2022 at 08:38:25PM -0500, wangjianxing wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6..1b96421c8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3479,6 +3479,9 @@ void free_unref_page_list(struct list_head *list)
>  		 */
>  		if (++batch_count == SWAP_CLUSTER_MAX) {
>  			local_unlock_irqrestore(&pagesets.lock, flags);
> +
> +			cond_resched();

This isn't safe.  This path can be called from interrupt context
(otherwise we'd be using local_unlock_irq() instead of irqrestore()).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-08 16:19 ` Matthew Wilcox
@ 2022-03-10  1:05   ` Andrew Morton
  2022-03-10  2:48     ` wangjianxing
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2022-03-10  1:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: wangjianxing, linux-mm, linux-kernel

On Tue, 8 Mar 2022 16:19:33 +0000 Matthew Wilcox <willy@infradead.org> wrote:

> On Tue, Mar 01, 2022 at 08:38:25PM -0500, wangjianxing wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3589febc6..1b96421c8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3479,6 +3479,9 @@ void free_unref_page_list(struct list_head *list)
> >  		 */
> >  		if (++batch_count == SWAP_CLUSTER_MAX) {
> >  			local_unlock_irqrestore(&pagesets.lock, flags);
> > +
> > +			cond_resched();
> 
> This isn't safe.  This path can be called from interrupt context
> (otherwise we'd be using local_unlock_irq() instead of irqrestore()).

What a shame it is that we don't document our interfaces :(

I can't immediately find such callers, but I could imagine
put_pages_list() (which didn't document its interface this way either)
being called from IRQ.

And drivers/iommu/dma-iommu.c:fq_ring_free() calls put_pages_list()
from inside spin_lock().  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-10  1:05   ` Andrew Morton
@ 2022-03-10  2:48     ` wangjianxing
  2022-03-10  3:29       ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: wangjianxing @ 2022-03-10  2:48 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox; +Cc: linux-mm, linux-kernel, vbabka

spin_lock will preempt_disable(), interrupt context will 
__irq_enter/local_bh_disable and also add preempt count with offset.

cond_resched check whether if preempt_count == 0 in first and won't 
schedule in previous context.

Is this right?


With another way, could we add some condition to avoid call cond_resched 
in interrupt context or spin_lock()?

+ if (preemptible())
+       cond_resched();

On 03/10/2022 09:05 AM, Andrew Morton wrote:
> On Tue, 8 Mar 2022 16:19:33 +0000 Matthew Wilcox <willy@infradead.org> wrote:
>
>> On Tue, Mar 01, 2022 at 08:38:25PM -0500, wangjianxing wrote:
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3589febc6..1b96421c8 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -3479,6 +3479,9 @@ void free_unref_page_list(struct list_head *list)
>>>   		 */
>>>   		if (++batch_count == SWAP_CLUSTER_MAX) {
>>>   			local_unlock_irqrestore(&pagesets.lock, flags);
>>> +
>>> +			cond_resched();
>> This isn't safe.  This path can be called from interrupt context
>> (otherwise we'd be using local_unlock_irq() instead of irqrestore()).
> What a shame it is that we don't document our interfaces :(
>
> I can't immediately find such callers, but I could imagine
> put_pages_list() (which didn't document its interface this way either)
> being called from IRQ.
>
> And drivers/iommu/dma-iommu.c:fq_ring_free() calls put_pages_list()
> from inside spin_lock().


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-10  2:48     ` wangjianxing
@ 2022-03-10  3:29       ` Andrew Morton
  2022-03-10  9:11         ` Vlastimil Babka
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2022-03-10  3:29 UTC (permalink / raw)
  To: wangjianxing; +Cc: Matthew Wilcox, linux-mm, linux-kernel, vbabka

On Thu, 10 Mar 2022 10:48:41 +0800 wangjianxing <wangjianxing@loongson.cn> wrote:

> spin_lock will preempt_disable(), interrupt context will 
> __irq_enter/local_bh_disable and also add preempt count with offset.
> 
> cond_resched check whether if preempt_count == 0 in first and won't 
> schedule in previous context.
> 
> Is this right?
> 
> 
> With another way, could we add some condition to avoid call cond_resched 
> in interrupt context or spin_lock()?
> 
> + if (preemptible())
> +       cond_resched();
> 

None of this works with CONFIG_PREEMPTION=n.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-10  3:29       ` Andrew Morton
@ 2022-03-10  9:11         ` Vlastimil Babka
  2022-03-11  3:22           ` wangjianxing
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2022-03-10  9:11 UTC (permalink / raw)
  To: Andrew Morton, wangjianxing; +Cc: Matthew Wilcox, linux-mm, linux-kernel

On 3/10/22 04:29, Andrew Morton wrote:
> On Thu, 10 Mar 2022 10:48:41 +0800 wangjianxing <wangjianxing@loongson.cn> wrote:
> 
>> spin_lock will preempt_disable(), interrupt context will 
>> __irq_enter/local_bh_disable and also add preempt count with offset.
>> 
>> cond_resched check whether if preempt_count == 0 in first and won't 
>> schedule in previous context.
>> 
>> Is this right?
>> 
>> 
>> With another way, could we add some condition to avoid call cond_resched 
>> in interrupt context or spin_lock()?
>> 
>> + if (preemptible())
>> +       cond_resched();
>> 
> 
> None of this works with CONFIG_PREEMPTION=n.

Yeah I think we have at least two options.

1) check all callers, maybe realize all have enabled interrupts anyway,
rewrite the locking to only assume those

2) find out how long the tlb batches actually are and make them smaller

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list
  2022-03-10  9:11         ` Vlastimil Babka
@ 2022-03-11  3:22           ` wangjianxing
  0 siblings, 0 replies; 11+ messages in thread
From: wangjianxing @ 2022-03-11  3:22 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton; +Cc: Matthew Wilcox, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1299 bytes --]

>2) find out how long the tlb batches actually are and make them smaller

PAGE_SIZE is 16k other than 4k in my kernel configuration, batch count 
max is 2015.
currently MAX_GATHER_BATCH depends on PAGE_SIZE, I will make batch's max 
size smaller without depending on PAGE_SIZE.

#define MAX_GATHER_BATCH   ((PAGE_SIZE - sizeof(struct 
mmu_gather_batch)) / sizeof(void *))

On 03/10/2022 05:11 PM, Vlastimil Babka wrote:
> On 3/10/22 04:29, Andrew Morton wrote:
>> On Thu, 10 Mar 2022 10:48:41 +0800 wangjianxing<wangjianxing@loongson.cn>  wrote:
>>
>>> spin_lock will preempt_disable(), interrupt context will
>>> __irq_enter/local_bh_disable and also add preempt count with offset.
>>>
>>> cond_resched check whether if preempt_count == 0 in first and won't
>>> schedule in previous context.
>>>
>>> Is this right?
>>>
>>>
>>> With another way, could we add some condition to avoid call cond_resched
>>> in interrupt context or spin_lock()?
>>>
>>> + if (preemptible())
>>> +       cond_resched();
>>>
>> None of this works with CONFIG_PREEMPTION=n.
> Yeah I think we have at least two options.
>
> 1) check all callers, maybe realize all have enabled interrupts anyway,
> rewrite the locking to only assume those
>
> 2) find out how long the tlb batches actually are and make them smaller


[-- Attachment #2: Type: text/html, Size: 1957 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-03-11  3:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-02  1:38 [PATCH 1/1] mm/page_alloc: add scheduling point to free_unref_page_list wangjianxing
2022-03-02 23:34 ` Andrew Morton
2022-03-03  2:02   ` wangjianxing
2022-03-08 16:04   ` Vlastimil Babka
2022-03-08 16:05 ` Vlastimil Babka
2022-03-08 16:19 ` Matthew Wilcox
2022-03-10  1:05   ` Andrew Morton
2022-03-10  2:48     ` wangjianxing
2022-03-10  3:29       ` Andrew Morton
2022-03-10  9:11         ` Vlastimil Babka
2022-03-11  3:22           ` wangjianxing

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.