[PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
@ 2017-12-14 19:53 Yang Shi
  2017-12-15  4:34 ` Anshuman Khandual
  2017-12-15 10:27 ` Michal Hocko
  0 siblings, 2 replies; 11+ messages in thread
From: Yang Shi @ 2017-12-14 19:53 UTC (permalink / raw)
  To: kirill.shutemov, mhocko, hughd, aarcange, akpm
  Cc: Yang Shi, linux-mm, linux-kernel

In the current design, khugepaged need acquire mmap_sem before scanning
mm, but in some corner case, khugepaged may scan the current running
process which might be modifying memory mapping, so khugepaged might
block in uninterruptible state. But, the process might hold the mmap_sem
for long time when modifying a huge memory space, then it may trigger
the below khugepaged hung issue:

INFO: task khugepaged:270 blocked for more than 120 seconds. 
Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
khugepaged D 0 270 2 0x00000000 
ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
Call Trace: 
[<ffffffff817079a5>] ? __schedule+0x235/0x6e0 
[<ffffffff81707e86>] schedule+0x36/0x80
[<ffffffff8170a970>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81384998>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff8170a1c0>] down_read+0x20/0x40
[<ffffffff81226836>] khugepaged+0x476/0x11d0
[<ffffffff810c9d0e>] ? idle_balance+0x1ce/0x300
[<ffffffff810d0850>] ? prepare_to_wait_event+0x100/0x100
[<ffffffff812263c0>] ? collapse_shmem+0xbf0/0xbf0
[<ffffffff810a8d46>] kthread+0xe6/0x100
[<ffffffff810a8c60>] ? kthread_park+0x60/0x60
[<ffffffff8170cd15>] ret_from_fork+0x25/0x30

So, it sounds pointless to just block for waiting for the semaphore for
khugepaged, here replace down_read() to down_read_trylock() to move to
scan next mm quickly instead of just blocking on the semaphore so that
other processes can get more chances to install THP.
Then khugepaged can come back to scan the skipped mm when finish the
current round full_scan.

And, it soudns the change can improve khugepaged efficiency a little
bit.

The below is the test result with running LTP on a 24 cores 4GB memory 2
nodes NUMA VM:

				pristine	 w/ trylock
full_scan                         197               187
pages_collapsed                   21                26
thp_fault_alloc                   40818             44466
thp_fault_fallback                18413             16679
thp_collapse_alloc                21                150
thp_collapse_alloc_failed         14                16
thp_file_alloc                    369               369

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/khugepaged.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ea4ff25..ecc2b68 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1674,7 +1674,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	spin_unlock(&khugepaged_mm_lock);
 
 	mm = mm_slot->mm;
-	down_read(&mm->mmap_sem);
+	/*
+ 	 * Not wait for semaphore to avoid long time waiting, just move
+ 	 * to the next mm on the list.
+ 	 */
+	if (unlikely(!down_read_trylock(&mm->mmap_sem)))
+		goto breakouterloop_mmap_sem;
 	if (unlikely(khugepaged_test_exit(mm)))
 		vma = NULL;
 	else
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-14 19:53 [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block Yang Shi
@ 2017-12-15  4:34 ` Anshuman Khandual
  2017-12-15  9:33   ` Kirill A. Shutemov
  2017-12-15 10:27 ` Michal Hocko
  1 sibling, 1 reply; 11+ messages in thread
From: Anshuman Khandual @ 2017-12-15  4:34 UTC (permalink / raw)
  To: Yang Shi, kirill.shutemov, mhocko, hughd, aarcange, akpm
  Cc: linux-mm, linux-kernel

On 12/15/2017 01:23 AM, Yang Shi wrote:
> In the current design, khugepaged need acquire mmap_sem before scanning
> mm, but in some corner case, khugepaged may scan the current running
> process which might be modifying memory mapping, so khugepaged might
> block in uninterruptible state. But, the process might hold the mmap_sem
> for long time when modifying a huge memory space, then it may trigger
> the below khugepaged hung issue:
> 
> INFO: task khugepaged:270 blocked for more than 120 seconds. 
> Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> khugepaged D 0 270 2 0x00000000 
> ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
> ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
> 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
> Call Trace: 
> [<ffffffff817079a5>] ? __schedule+0x235/0x6e0 
> [<ffffffff81707e86>] schedule+0x36/0x80
> [<ffffffff8170a970>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81384998>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff8170a1c0>] down_read+0x20/0x40
> [<ffffffff81226836>] khugepaged+0x476/0x11d0
> [<ffffffff810c9d0e>] ? idle_balance+0x1ce/0x300
> [<ffffffff810d0850>] ? prepare_to_wait_event+0x100/0x100
> [<ffffffff812263c0>] ? collapse_shmem+0xbf0/0xbf0
> [<ffffffff810a8d46>] kthread+0xe6/0x100
> [<ffffffff810a8c60>] ? kthread_park+0x60/0x60
> [<ffffffff8170cd15>] ret_from_fork+0x25/0x30
> 
> So, it sounds pointless to just block for waiting for the semaphore for
> khugepaged, here replace down_read() to down_read_trylock() to move to
> scan next mm quickly instead of just blocking on the semaphore so that
> other processes can get more chances to install THP.
> Then khugepaged can come back to scan the skipped mm when finish the
> current round full_scan.

That may be too harsh on the process which now has to wait for a complete
round of full scan before the khugepaged comes back. What if the mmap_sem
contention because of VMA changes in the process was just temporary ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-15  4:34 ` Anshuman Khandual
@ 2017-12-15  9:33   ` Kirill A. Shutemov
  0 siblings, 0 replies; 11+ messages in thread
From: Kirill A. Shutemov @ 2017-12-15  9:33 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Yang Shi, kirill.shutemov, mhocko, hughd, aarcange, akpm,
	linux-mm, linux-kernel

On Fri, Dec 15, 2017 at 10:04:27AM +0530, Anshuman Khandual wrote:
> On 12/15/2017 01:23 AM, Yang Shi wrote:
> > In the current design, khugepaged need acquire mmap_sem before scanning
> > mm, but in some corner case, khugepaged may scan the current running
> > process which might be modifying memory mapping, so khugepaged might
> > block in uninterruptible state. But, the process might hold the mmap_sem
> > for long time when modifying a huge memory space, then it may trigger
> > the below khugepaged hung issue:
> > 
> > INFO: task khugepaged:270 blocked for more than 120 seconds. 
> > Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> > khugepaged D 0 270 2 0x00000000 
> > ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
> > ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
> > 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
> > Call Trace: 
> > [<ffffffff817079a5>] ? __schedule+0x235/0x6e0 
> > [<ffffffff81707e86>] schedule+0x36/0x80
> > [<ffffffff8170a970>] rwsem_down_read_failed+0xf0/0x150
> > [<ffffffff81384998>] call_rwsem_down_read_failed+0x18/0x30
> > [<ffffffff8170a1c0>] down_read+0x20/0x40
> > [<ffffffff81226836>] khugepaged+0x476/0x11d0
> > [<ffffffff810c9d0e>] ? idle_balance+0x1ce/0x300
> > [<ffffffff810d0850>] ? prepare_to_wait_event+0x100/0x100
> > [<ffffffff812263c0>] ? collapse_shmem+0xbf0/0xbf0
> > [<ffffffff810a8d46>] kthread+0xe6/0x100
> > [<ffffffff810a8c60>] ? kthread_park+0x60/0x60
> > [<ffffffff8170cd15>] ret_from_fork+0x25/0x30

What holds the lock for this long? I think the other side also worth fixing.

> > 
> > So, it sounds pointless to just block for waiting for the semaphore for
> > khugepaged, here replace down_read() to down_read_trylock() to move to
> > scan next mm quickly instead of just blocking on the semaphore so that
> > other processes can get more chances to install THP.
> > Then khugepaged can come back to scan the skipped mm when finish the
> > current round full_scan.
> 
> That may be too harsh on the process which now has to wait for a complete
> round of full scan before the khugepaged comes back. What if the mmap_sem
> contention because of VMA changes in the process was just temporary ?

It's always temporary. Unless something is very broken. :)

If the mmap_sem is taken for write, it may also mean that memory layout of the
process is not yet settled and we can just waste the time collapsing the pages
that about to go away.

And it's better for khugepaged to do the job than just waiting for the lock.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-14 19:53 [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block Yang Shi
  2017-12-15  4:34 ` Anshuman Khandual
@ 2017-12-15 10:27 ` Michal Hocko
  2017-12-15 20:04   ` Yang Shi
  1 sibling, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2017-12-15 10:27 UTC (permalink / raw)
  To: Yang Shi; +Cc: kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel

On Fri 15-12-17 03:53:23, Yang Shi wrote:
> In the current design, khugepaged need acquire mmap_sem before scanning
> mm, but in some corner case, khugepaged may scan the current running
> process which might be modifying memory mapping, so khugepaged might
> block in uninterruptible state. But, the process might hold the mmap_sem
> for long time when modifying a huge memory space, then it may trigger
> the below khugepaged hung issue:
> 
> INFO: task khugepaged:270 blocked for more than 120 seconds. 
> Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> khugepaged D 0 270 2 0x00000000 
> ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
> ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
> 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
> Call Trace: 
> [<ffffffff817079a5>] ? __schedule+0x235/0x6e0 
> [<ffffffff81707e86>] schedule+0x36/0x80
> [<ffffffff8170a970>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81384998>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff8170a1c0>] down_read+0x20/0x40
> [<ffffffff81226836>] khugepaged+0x476/0x11d0
> [<ffffffff810c9d0e>] ? idle_balance+0x1ce/0x300
> [<ffffffff810d0850>] ? prepare_to_wait_event+0x100/0x100
> [<ffffffff812263c0>] ? collapse_shmem+0xbf0/0xbf0
> [<ffffffff810a8d46>] kthread+0xe6/0x100
> [<ffffffff810a8c60>] ? kthread_park+0x60/0x60
> [<ffffffff8170cd15>] ret_from_fork+0x25/0x30

I am definitely interested in what the holder of the write lock does
here for such a long time. 
 
> So, it sounds pointless to just block for waiting for the semaphore for
> khugepaged, here replace down_read() to down_read_trylock() to move to
> scan next mm quickly instead of just blocking on the semaphore so that
> other processes can get more chances to install THP.
> Then khugepaged can come back to scan the skipped mm when finish the
> current round full_scan.
> 
> And, it soudns the change can improve khugepaged efficiency a little
> bit.
> 
> The below is the test result with running LTP on a 24 cores 4GB memory 2
> nodes NUMA VM:
> 
> 				pristine	 w/ trylock
> full_scan                         197               187
> pages_collapsed                   21                26
> thp_fault_alloc                   40818             44466
> thp_fault_fallback                18413             16679
> thp_collapse_alloc                21                150
> thp_collapse_alloc_failed         14                16
> thp_file_alloc                    369               369
> 
> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>

The patch makes sense to me
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/khugepaged.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ea4ff25..ecc2b68 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1674,7 +1674,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>  	spin_unlock(&khugepaged_mm_lock);
>  
>  	mm = mm_slot->mm;
> -	down_read(&mm->mmap_sem);
> +	/*
> + 	 * Not wait for semaphore to avoid long time waiting, just move
> + 	 * to the next mm on the list.
> + 	 */
> +	if (unlikely(!down_read_trylock(&mm->mmap_sem)))
> +		goto breakouterloop_mmap_sem;
>  	if (unlikely(khugepaged_test_exit(mm)))
>  		vma = NULL;
>  	else
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-15 10:27 ` Michal Hocko
@ 2017-12-15 20:04   ` Yang Shi
  2017-12-16 11:45     ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Yang Shi @ 2017-12-15 20:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel

Hi Kirill & Michal,

Since both of you raised the same question about who holds the semaphore 
for that long time, I just reply here to both of you.

The backtrace shows vm-scalability is running with 300G memory and it is 
doing munmap as below:

[188995.241865] CPU: 15 PID: 8063 Comm: usemem Tainted: G            E 
4.9.65-006.ali3000.alios7.x86_64 #1
[188995.242252] Hardware name: Huawei Technologies Co., Ltd. Tecal 
RH2288H V2-12L/BC11SRSG1, BIOS RMIBV368 11/01/2013
[188995.242637] task: ffff883f610a5b00 task.stack: ffffc90037280000
[188995.242838] RIP: 0010:[<ffffffff811e2319>] .c [<ffffffff811e2319>] 
unmap_page_range+0x619/0x940
[188995.243231] RSP: 0018:ffffc90037283c98  EFLAGS: 00000282
[188995.243429] RAX: 00002b760ac57000 RBX: 00002b760ac56000 RCX: 
0000000003eb13ca
[188995.243820] RDX: ffffea003971e420 RSI: 00002b760ac56000 RDI: 
ffff8837cb832e80
[188995.244211] RBP: ffffc90037283d78 R08: ffff883ebf8fc3c0 R09: 
0000000000008000
[188995.244600] R10: 00000000826b7e00 R11: 0000000000000000 R12: 
ffff8821e70f72b0
[188995.244993] R13: ffffea00fac4f280 R14: ffffc90037283e00 R15: 
00002b760ac57000
[188995.245390] FS:  00002b34b4861700(0000) GS:ffff883f7d3c0000(0000) 
knlGS:0000000000000000
[188995.245788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[188995.245990] CR2: 00002b7092160fed CR3: 0000000977850000 CR4: 
00000000001406e0
[188995.246388] Stack:
[188995.246581]  00002b92f71edfff.c 00002b7fffffffff.c 
00002b92f71ee000.c ffff8809778502b0.c
[188995.246981]  00002b763fffffff.c ffff8802e1895ec0.c 
ffffc90037283d48.c ffff883f610a5b00.c
[188995.247365]  ffffc90037283d70.c 00002b8000000000.c 
ffffc00000000fff.c ffffea00879c3df0.c
[188995.247759] Call Trace:
[188995.247957]  [<ffffffff811e26bd>] unmap_single_vma+0x7d/0xe0
[188995.248161]  [<ffffffff811e2a11>] unmap_vmas+0x51/0xa0
[188995.248367]  [<ffffffff811e98ed>] unmap_region+0xbd/0x130
[188995.248571]  [<ffffffff8170b04c>] ? 
rwsem_down_write_failed_killable+0x31c/0x3f0
[188995.248961]  [<ffffffff811eb94c>] do_munmap+0x26c/0x420
[188995.249162]  [<ffffffff811ebbc0>] SyS_munmap+0x50/0x70
[188995.249361]  [<ffffffff8170cab7>] entry_SYSCALL_64_fastpath+0x1a/0xa9

By analyzing vmcore, khugepaged is waiting for vm-scalability process's 
mmap_sem.

unmap_vmas will unmap every vma in the memory space, it sounds the test 
generated huge amount of vmas.

Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? 
It may improve the responsiveness a little bit for non-preempt kernel, 
although it still can't release the semaphore.

Thanks,
Yang

On 12/15/17 2:27 AM, Michal Hocko wrote:
> On Fri 15-12-17 03:53:23, Yang Shi wrote:
>> In the current design, khugepaged need acquire mmap_sem before scanning
>> mm, but in some corner case, khugepaged may scan the current running
>> process which might be modifying memory mapping, so khugepaged might
>> block in uninterruptible state. But, the process might hold the mmap_sem
>> for long time when modifying a huge memory space, then it may trigger
>> the below khugepaged hung issue:
>>
>> INFO: task khugepaged:270 blocked for more than 120 seconds.
>> Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> khugepaged D 0 270 2 0x00000000
>> ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
>> ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
>> 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
>> Call Trace:
>> [<ffffffff817079a5>] ? __schedule+0x235/0x6e0
>> [<ffffffff81707e86>] schedule+0x36/0x80
>> [<ffffffff8170a970>] rwsem_down_read_failed+0xf0/0x150
>> [<ffffffff81384998>] call_rwsem_down_read_failed+0x18/0x30
>> [<ffffffff8170a1c0>] down_read+0x20/0x40
>> [<ffffffff81226836>] khugepaged+0x476/0x11d0
>> [<ffffffff810c9d0e>] ? idle_balance+0x1ce/0x300
>> [<ffffffff810d0850>] ? prepare_to_wait_event+0x100/0x100
>> [<ffffffff812263c0>] ? collapse_shmem+0xbf0/0xbf0
>> [<ffffffff810a8d46>] kthread+0xe6/0x100
>> [<ffffffff810a8c60>] ? kthread_park+0x60/0x60
>> [<ffffffff8170cd15>] ret_from_fork+0x25/0x30
> 
> I am definitely interested in what the holder of the write lock does
> here for such a long time.
>   
>> So, it sounds pointless to just block for waiting for the semaphore for
>> khugepaged, here replace down_read() to down_read_trylock() to move to
>> scan next mm quickly instead of just blocking on the semaphore so that
>> other processes can get more chances to install THP.
>> Then khugepaged can come back to scan the skipped mm when finish the
>> current round full_scan.
>>
>> And, it soudns the change can improve khugepaged efficiency a little
>> bit.
>>
>> The below is the test result with running LTP on a 24 cores 4GB memory 2
>> nodes NUMA VM:
>>
>> 				pristine	 w/ trylock
>> full_scan                         197               187
>> pages_collapsed                   21                26
>> thp_fault_alloc                   40818             44466
>> thp_fault_fallback                18413             16679
>> thp_collapse_alloc                21                150
>> thp_collapse_alloc_failed         14                16
>> thp_file_alloc                    369               369
>>
>> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> The patch makes sense to me
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
>> ---
>>   mm/khugepaged.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index ea4ff25..ecc2b68 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1674,7 +1674,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>>   	spin_unlock(&khugepaged_mm_lock);
>>   
>>   	mm = mm_slot->mm;
>> -	down_read(&mm->mmap_sem);
>> +	/*
>> + 	 * Not wait for semaphore to avoid long time waiting, just move
>> + 	 * to the next mm on the list.
>> + 	 */
>> +	if (unlikely(!down_read_trylock(&mm->mmap_sem)))
>> +		goto breakouterloop_mmap_sem;
>>   	if (unlikely(khugepaged_test_exit(mm)))
>>   		vma = NULL;
>>   	else
>> -- 
>> 1.8.3.1
>>
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-15 20:04   ` Yang Shi
@ 2017-12-16 11:45     ` Michal Hocko
  2017-12-16 20:09       ` Kirill A. Shutemov
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2017-12-16 11:45 UTC (permalink / raw)
  To: Yang Shi; +Cc: kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel

On Sat 16-12-17 04:04:10, Yang Shi wrote:
> Hi Kirill & Michal,
> 
> Since both of you raised the same question about who holds the semaphore for
> that long time, I just reply here to both of you.
> 
> The backtrace shows vm-scalability is running with 300G memory and it is
> doing munmap as below:
> 
> [188995.241865] CPU: 15 PID: 8063 Comm: usemem Tainted: G            E
> 4.9.65-006.ali3000.alios7.x86_64 #1
> [188995.242252] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2288H
> V2-12L/BC11SRSG1, BIOS RMIBV368 11/01/2013
> [188995.242637] task: ffff883f610a5b00 task.stack: ffffc90037280000
> [188995.242838] RIP: 0010:[<ffffffff811e2319>] .c [<ffffffff811e2319>]
> unmap_page_range+0x619/0x940
> [188995.243231] RSP: 0018:ffffc90037283c98  EFLAGS: 00000282
> [188995.243429] RAX: 00002b760ac57000 RBX: 00002b760ac56000 RCX:
> 0000000003eb13ca
> [188995.243820] RDX: ffffea003971e420 RSI: 00002b760ac56000 RDI:
> ffff8837cb832e80
> [188995.244211] RBP: ffffc90037283d78 R08: ffff883ebf8fc3c0 R09:
> 0000000000008000
> [188995.244600] R10: 00000000826b7e00 R11: 0000000000000000 R12:
> ffff8821e70f72b0
> [188995.244993] R13: ffffea00fac4f280 R14: ffffc90037283e00 R15:
> 00002b760ac57000
> [188995.245390] FS:  00002b34b4861700(0000) GS:ffff883f7d3c0000(0000)
> knlGS:0000000000000000
> [188995.245788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [188995.245990] CR2: 00002b7092160fed CR3: 0000000977850000 CR4:
> 00000000001406e0
> [188995.246388] Stack:
> [188995.246581]  00002b92f71edfff.c 00002b7fffffffff.c 00002b92f71ee000.c
> ffff8809778502b0.c
> [188995.246981]  00002b763fffffff.c ffff8802e1895ec0.c ffffc90037283d48.c
> ffff883f610a5b00.c
> [188995.247365]  ffffc90037283d70.c 00002b8000000000.c ffffc00000000fff.c
> ffffea00879c3df0.c
> [188995.247759] Call Trace:
> [188995.247957]  [<ffffffff811e26bd>] unmap_single_vma+0x7d/0xe0
> [188995.248161]  [<ffffffff811e2a11>] unmap_vmas+0x51/0xa0
> [188995.248367]  [<ffffffff811e98ed>] unmap_region+0xbd/0x130
> [188995.248571]  [<ffffffff8170b04c>] ?
> rwsem_down_write_failed_killable+0x31c/0x3f0
> [188995.248961]  [<ffffffff811eb94c>] do_munmap+0x26c/0x420
> [188995.249162]  [<ffffffff811ebbc0>] SyS_munmap+0x50/0x70
> [188995.249361]  [<ffffffff8170cab7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
> 
> By analyzing vmcore, khugepaged is waiting for vm-scalability process's
> mmap_sem.

OK, I see.
 
> unmap_vmas will unmap every vma in the memory space, it sounds the test
> generated huge amount of vmas.

I would expect that it just takes some time to munmap 300G address
range.

> Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? It
> may improve the responsiveness a little bit for non-preempt kernel, although
> it still can't release the semaphore.

We already do, once per pmd (see zap_pmd_range).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-16 11:45     ` Michal Hocko
@ 2017-12-16 20:09       ` Kirill A. Shutemov
  2017-12-16 22:22         ` Yang Shi
  2017-12-18  8:41         ` Michal Hocko
  0 siblings, 2 replies; 11+ messages in thread
From: Kirill A. Shutemov @ 2017-12-16 20:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel

On Sat, Dec 16, 2017 at 12:45:25PM +0100, Michal Hocko wrote:
> On Sat 16-12-17 04:04:10, Yang Shi wrote:
> > Hi Kirill & Michal,
> > 
> > Since both of you raised the same question about who holds the semaphore for
> > that long time, I just reply here to both of you.
> > 
> > The backtrace shows vm-scalability is running with 300G memory and it is
> > doing munmap as below:
> > 
> > [188995.241865] CPU: 15 PID: 8063 Comm: usemem Tainted: G            E
> > 4.9.65-006.ali3000.alios7.x86_64 #1
> > [188995.242252] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2288H
> > V2-12L/BC11SRSG1, BIOS RMIBV368 11/01/2013
> > [188995.242637] task: ffff883f610a5b00 task.stack: ffffc90037280000
> > [188995.242838] RIP: 0010:[<ffffffff811e2319>] .c [<ffffffff811e2319>]
> > unmap_page_range+0x619/0x940
> > [188995.243231] RSP: 0018:ffffc90037283c98  EFLAGS: 00000282
> > [188995.243429] RAX: 00002b760ac57000 RBX: 00002b760ac56000 RCX:
> > 0000000003eb13ca
> > [188995.243820] RDX: ffffea003971e420 RSI: 00002b760ac56000 RDI:
> > ffff8837cb832e80
> > [188995.244211] RBP: ffffc90037283d78 R08: ffff883ebf8fc3c0 R09:
> > 0000000000008000
> > [188995.244600] R10: 00000000826b7e00 R11: 0000000000000000 R12:
> > ffff8821e70f72b0
> > [188995.244993] R13: ffffea00fac4f280 R14: ffffc90037283e00 R15:
> > 00002b760ac57000
> > [188995.245390] FS:  00002b34b4861700(0000) GS:ffff883f7d3c0000(0000)
> > knlGS:0000000000000000
> > [188995.245788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [188995.245990] CR2: 00002b7092160fed CR3: 0000000977850000 CR4:
> > 00000000001406e0
> > [188995.246388] Stack:
> > [188995.246581]  00002b92f71edfff.c 00002b7fffffffff.c 00002b92f71ee000.c
> > ffff8809778502b0.c
> > [188995.246981]  00002b763fffffff.c ffff8802e1895ec0.c ffffc90037283d48.c
> > ffff883f610a5b00.c
> > [188995.247365]  ffffc90037283d70.c 00002b8000000000.c ffffc00000000fff.c
> > ffffea00879c3df0.c
> > [188995.247759] Call Trace:
> > [188995.247957]  [<ffffffff811e26bd>] unmap_single_vma+0x7d/0xe0
> > [188995.248161]  [<ffffffff811e2a11>] unmap_vmas+0x51/0xa0
> > [188995.248367]  [<ffffffff811e98ed>] unmap_region+0xbd/0x130
> > [188995.248571]  [<ffffffff8170b04c>] ?
> > rwsem_down_write_failed_killable+0x31c/0x3f0
> > [188995.248961]  [<ffffffff811eb94c>] do_munmap+0x26c/0x420
> > [188995.249162]  [<ffffffff811ebbc0>] SyS_munmap+0x50/0x70
> > [188995.249361]  [<ffffffff8170cab7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
> > 
> > By analyzing vmcore, khugepaged is waiting for vm-scalability process's
> > mmap_sem.
> 
> OK, I see.
>  
> > unmap_vmas will unmap every vma in the memory space, it sounds the test
> > generated huge amount of vmas.
> 
> I would expect that it just takes some time to munmap 300G address
> range.
> 
> > Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? It
> > may improve the responsiveness a little bit for non-preempt kernel, although
> > it still can't release the semaphore.
> 
> We already do, once per pmd (see zap_pmd_range).

It doesn't help. We would need to find a way to drop mmap_sem, if we're
holding it way too long. And doing it on per-vma count basis is not right
call. It won't address issue with single huge vma.

Do we have any instrumentation that would help detect starvation on a
rw_semaphore?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-16 20:09       ` Kirill A. Shutemov
@ 2017-12-16 22:22         ` Yang Shi
  2017-12-18  8:41         ` Michal Hocko
  1 sibling, 0 replies; 11+ messages in thread
From: Yang Shi @ 2017-12-16 22:22 UTC (permalink / raw)
  To: Kirill A. Shutemov, Michal Hocko
  Cc: kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel



On 12/16/17 12:09 PM, Kirill A. Shutemov wrote:
> On Sat, Dec 16, 2017 at 12:45:25PM +0100, Michal Hocko wrote:
>> On Sat 16-12-17 04:04:10, Yang Shi wrote:
>>> Hi Kirill & Michal,
>>>
>>> Since both of you raised the same question about who holds the semaphore for
>>> that long time, I just reply here to both of you.
>>>
>>> The backtrace shows vm-scalability is running with 300G memory and it is
>>> doing munmap as below:
>>>
>>> [188995.241865] CPU: 15 PID: 8063 Comm: usemem Tainted: G            E
>>> 4.9.65-006.ali3000.alios7.x86_64 #1
>>> [188995.242252] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2288H
>>> V2-12L/BC11SRSG1, BIOS RMIBV368 11/01/2013
>>> [188995.242637] task: ffff883f610a5b00 task.stack: ffffc90037280000
>>> [188995.242838] RIP: 0010:[<ffffffff811e2319>] .c [<ffffffff811e2319>]
>>> unmap_page_range+0x619/0x940
>>> [188995.243231] RSP: 0018:ffffc90037283c98  EFLAGS: 00000282
>>> [188995.243429] RAX: 00002b760ac57000 RBX: 00002b760ac56000 RCX:
>>> 0000000003eb13ca
>>> [188995.243820] RDX: ffffea003971e420 RSI: 00002b760ac56000 RDI:
>>> ffff8837cb832e80
>>> [188995.244211] RBP: ffffc90037283d78 R08: ffff883ebf8fc3c0 R09:
>>> 0000000000008000
>>> [188995.244600] R10: 00000000826b7e00 R11: 0000000000000000 R12:
>>> ffff8821e70f72b0
>>> [188995.244993] R13: ffffea00fac4f280 R14: ffffc90037283e00 R15:
>>> 00002b760ac57000
>>> [188995.245390] FS:  00002b34b4861700(0000) GS:ffff883f7d3c0000(0000)
>>> knlGS:0000000000000000
>>> [188995.245788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [188995.245990] CR2: 00002b7092160fed CR3: 0000000977850000 CR4:
>>> 00000000001406e0
>>> [188995.246388] Stack:
>>> [188995.246581]  00002b92f71edfff.c 00002b7fffffffff.c 00002b92f71ee000.c
>>> ffff8809778502b0.c
>>> [188995.246981]  00002b763fffffff.c ffff8802e1895ec0.c ffffc90037283d48.c
>>> ffff883f610a5b00.c
>>> [188995.247365]  ffffc90037283d70.c 00002b8000000000.c ffffc00000000fff.c
>>> ffffea00879c3df0.c
>>> [188995.247759] Call Trace:
>>> [188995.247957]  [<ffffffff811e26bd>] unmap_single_vma+0x7d/0xe0
>>> [188995.248161]  [<ffffffff811e2a11>] unmap_vmas+0x51/0xa0
>>> [188995.248367]  [<ffffffff811e98ed>] unmap_region+0xbd/0x130
>>> [188995.248571]  [<ffffffff8170b04c>] ?
>>> rwsem_down_write_failed_killable+0x31c/0x3f0
>>> [188995.248961]  [<ffffffff811eb94c>] do_munmap+0x26c/0x420
>>> [188995.249162]  [<ffffffff811ebbc0>] SyS_munmap+0x50/0x70
>>> [188995.249361]  [<ffffffff8170cab7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
>>>
>>> By analyzing vmcore, khugepaged is waiting for vm-scalability process's
>>> mmap_sem.
>>
>> OK, I see.
>>   
>>> unmap_vmas will unmap every vma in the memory space, it sounds the test
>>> generated huge amount of vmas.
>>
>> I would expect that it just takes some time to munmap 300G address
>> range.
>>
>>> Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? It
>>> may improve the responsiveness a little bit for non-preempt kernel, although
>>> it still can't release the semaphore.
>>
>> We already do, once per pmd (see zap_pmd_range).
> 
> It doesn't help. We would need to find a way to drop mmap_sem, if we're
> holding it way too long. And doing it on per-vma count basis is not right
> call. It won't address issue with single huge vma.

Yes, I agree. We have to find a way to release the semaphore at some 
point instead of holding it for the whole unmap.

> 
> Do we have any instrumentation that would help detect starvation on a
> rw_semaphore?

ebpf might help?

Thanks,
Yang

> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-16 20:09       ` Kirill A. Shutemov
  2017-12-16 22:22         ` Yang Shi
@ 2017-12-18  8:41         ` Michal Hocko
  2017-12-18 10:14           ` Kirill A. Shutemov
  2017-12-20  1:35           ` Yang Shi
  1 sibling, 2 replies; 11+ messages in thread
From: Michal Hocko @ 2017-12-18  8:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Yang Shi, kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel

On Sat 16-12-17 23:09:25, Kirill A. Shutemov wrote:
> On Sat, Dec 16, 2017 at 12:45:25PM +0100, Michal Hocko wrote:
> > On Sat 16-12-17 04:04:10, Yang Shi wrote:
[...]
> > > Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? It
> > > may improve the responsiveness a little bit for non-preempt kernel, although
> > > it still can't release the semaphore.
> > 
> > We already do, once per pmd (see zap_pmd_range).
> 
> It doesn't help. We would need to find a way to drop mmap_sem, if we're
> holding it way too long. And doing it on per-vma count basis is not right
> call. It won't address issue with single huge vma.

Absolutely agreed. I just wanted to point out that a new cond_resched is
not really needed. One way to reduce the lock starvation is to use range
locking.

> Do we have any instrumentation that would help detect starvation on a
> rw_semaphore?

I am afraid we don't.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-18  8:41         ` Michal Hocko
@ 2017-12-18 10:14           ` Kirill A. Shutemov
  2017-12-20  1:35           ` Yang Shi
  1 sibling, 0 replies; 11+ messages in thread
From: Kirill A. Shutemov @ 2017-12-18 10:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel

On Mon, Dec 18, 2017 at 09:41:19AM +0100, Michal Hocko wrote:
> On Sat 16-12-17 23:09:25, Kirill A. Shutemov wrote:
> > On Sat, Dec 16, 2017 at 12:45:25PM +0100, Michal Hocko wrote:
> > > On Sat 16-12-17 04:04:10, Yang Shi wrote:
> [...]
> > > > Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? It
> > > > may improve the responsiveness a little bit for non-preempt kernel, although
> > > > it still can't release the semaphore.
> > > 
> > > We already do, once per pmd (see zap_pmd_range).
> > 
> > It doesn't help. We would need to find a way to drop mmap_sem, if we're
> > holding it way too long. And doing it on per-vma count basis is not right
> > call. It won't address issue with single huge vma.
> 
> Absolutely agreed. I just wanted to point out that a new cond_resched is
> not really needed. One way to reduce the lock starvation is to use range
> locking.
> 
> > Do we have any instrumentation that would help detect starvation on a
> > rw_semaphore?
> 
> I am afraid we don't.

I guess we have enough info in mmu_gather to decide if we are doing munmap way
too long. Although, getting everything right would be tricky...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block
  2017-12-18  8:41         ` Michal Hocko
  2017-12-18 10:14           ` Kirill A. Shutemov
@ 2017-12-20  1:35           ` Yang Shi
  1 sibling, 0 replies; 11+ messages in thread
From: Yang Shi @ 2017-12-20  1:35 UTC (permalink / raw)
  To: Michal Hocko, Kirill A. Shutemov
  Cc: kirill.shutemov, hughd, aarcange, akpm, linux-mm, linux-kernel



On 12/18/17 12:41 AM, Michal Hocko wrote:
> On Sat 16-12-17 23:09:25, Kirill A. Shutemov wrote:
>> On Sat, Dec 16, 2017 at 12:45:25PM +0100, Michal Hocko wrote:
>>> On Sat 16-12-17 04:04:10, Yang Shi wrote:
> [...]
>>>> Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? It
>>>> may improve the responsiveness a little bit for non-preempt kernel, although
>>>> it still can't release the semaphore.
>>>
>>> We already do, once per pmd (see zap_pmd_range).
>>
>> It doesn't help. We would need to find a way to drop mmap_sem, if we're
>> holding it way too long. And doing it on per-vma count basis is not right
>> call. It won't address issue with single huge vma.
> 
> Absolutely agreed. I just wanted to point out that a new cond_resched is
> not really needed. One way to reduce the lock starvation is to use range
> locking.

It looks the range locking development is stalled?

Yang

> 
>> Do we have any instrumentation that would help detect starvation on a
>> rw_semaphore?
> 
> I am afraid we don't.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-12-20  1:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-14 19:53 [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block Yang Shi
2017-12-15  4:34 ` Anshuman Khandual
2017-12-15  9:33   ` Kirill A. Shutemov
2017-12-15 10:27 ` Michal Hocko
2017-12-15 20:04   ` Yang Shi
2017-12-16 11:45     ` Michal Hocko
2017-12-16 20:09       ` Kirill A. Shutemov
2017-12-16 22:22         ` Yang Shi
2017-12-18  8:41         ` Michal Hocko
2017-12-18 10:14           ` Kirill A. Shutemov
2017-12-20  1:35           ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).