All of lore.kernel.org
 help / color / mirror / Atom feed
* swap_cluster_info lockdep splat
@ 2017-02-16  5:22 ` Minchan Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2017-02-16  5:22 UTC (permalink / raw)
  To: Huang, Ying; +Cc: Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel, linux-mm

Hi Huang,

With changing from bit lock to spinlock of swap_cluster_info, my zram
test failed with below message. It seems nested lock problem so need to
play with lockdep.

Thanks.

=============================================
[ INFO: possible recursive locking detected ]
4.10.0-rc8-next-20170214-zram #24 Not tainted
---------------------------------------------
as/6557 is trying to acquire lock:
 (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70

but task is already holding lock:
 (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&((cluster_info + ci)->lock))->rlock);
  lock(&(&((cluster_info + ci)->lock))->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by as/6557:
 #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
 #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
 #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330

stack backtrace:
CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x85/0xc2
 __lock_acquire+0x15ea/0x1640
 lock_acquire+0x100/0x1f0
 ? cluster_list_add_tail.part.31+0x33/0x70
 _raw_spin_lock+0x38/0x50
 ? cluster_list_add_tail.part.31+0x33/0x70
 cluster_list_add_tail.part.31+0x33/0x70
 swapcache_free_entries+0x2f9/0x330
 free_swap_slot+0xf8/0x110
 swapcache_free+0x36/0x40
 delete_from_swap_cache+0x5f/0xa0
 try_to_free_swap+0x6e/0xa0
 free_pages_and_swap_cache+0x7d/0xb0
 tlb_flush_mmu_free+0x36/0x60
 tlb_finish_mmu+0x1c/0x50
 exit_mmap+0xc7/0x150
 mmput+0x51/0x110
 do_exit+0x2b2/0xc30
 ? trace_hardirqs_on_caller+0x129/0x1b0
 do_group_exit+0x50/0xd0
 SyS_exit_group+0x14/0x20
 entry_SYSCALL_64_fastpath+0x23/0xc6
RIP: 0033:0x2b9a2dbdf309
RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001

^ permalink raw reply	[flat|nested] 23+ messages in thread

* swap_cluster_info lockdep splat
@ 2017-02-16  5:22 ` Minchan Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2017-02-16  5:22 UTC (permalink / raw)
  To: Huang, Ying; +Cc: Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel, linux-mm

Hi Huang,

With changing from bit lock to spinlock of swap_cluster_info, my zram
test failed with below message. It seems nested lock problem so need to
play with lockdep.

Thanks.

=============================================
[ INFO: possible recursive locking detected ]
4.10.0-rc8-next-20170214-zram #24 Not tainted
---------------------------------------------
as/6557 is trying to acquire lock:
 (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70

but task is already holding lock:
 (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&((cluster_info + ci)->lock))->rlock);
  lock(&(&((cluster_info + ci)->lock))->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by as/6557:
 #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
 #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
 #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330

stack backtrace:
CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x85/0xc2
 __lock_acquire+0x15ea/0x1640
 lock_acquire+0x100/0x1f0
 ? cluster_list_add_tail.part.31+0x33/0x70
 _raw_spin_lock+0x38/0x50
 ? cluster_list_add_tail.part.31+0x33/0x70
 cluster_list_add_tail.part.31+0x33/0x70
 swapcache_free_entries+0x2f9/0x330
 free_swap_slot+0xf8/0x110
 swapcache_free+0x36/0x40
 delete_from_swap_cache+0x5f/0xa0
 try_to_free_swap+0x6e/0xa0
 free_pages_and_swap_cache+0x7d/0xb0
 tlb_flush_mmu_free+0x36/0x60
 tlb_finish_mmu+0x1c/0x50
 exit_mmap+0xc7/0x150
 mmput+0x51/0x110
 do_exit+0x2b2/0xc30
 ? trace_hardirqs_on_caller+0x129/0x1b0
 do_group_exit+0x50/0xd0
 SyS_exit_group+0x14/0x20
 entry_SYSCALL_64_fastpath+0x23/0xc6
RIP: 0033:0x2b9a2dbdf309
RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-16  5:22 ` Minchan Kim
@ 2017-02-16  7:13   ` Huang, Ying
  -1 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-16  7:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel,
	linux-mm

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>
> With changing from bit lock to spinlock of swap_cluster_info, my zram
> test failed with below message. It seems nested lock problem so need to
> play with lockdep.

Thanks a lot for your testing and report.  There is at least one nested
locking in cluster_list_add_tail(), and there are comments to describe
why it is safe.  I will try to reproduce this and fix it.

Best Regards,
Huang, Ying

> Thanks.
>
> =============================================
> [ INFO: possible recursive locking detected ]
> 4.10.0-rc8-next-20170214-zram #24 Not tainted
> ---------------------------------------------
> as/6557 is trying to acquire lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
>
> but task is already holding lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(&(&((cluster_info + ci)->lock))->rlock);
>   lock(&(&((cluster_info + ci)->lock))->rlock);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 3 locks held by as/6557:
>  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
>  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
>  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>
> stack backtrace:
> CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> Call Trace:
>  dump_stack+0x85/0xc2
>  __lock_acquire+0x15ea/0x1640
>  lock_acquire+0x100/0x1f0
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  _raw_spin_lock+0x38/0x50
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  cluster_list_add_tail.part.31+0x33/0x70
>  swapcache_free_entries+0x2f9/0x330
>  free_swap_slot+0xf8/0x110
>  swapcache_free+0x36/0x40
>  delete_from_swap_cache+0x5f/0xa0
>  try_to_free_swap+0x6e/0xa0
>  free_pages_and_swap_cache+0x7d/0xb0
>  tlb_flush_mmu_free+0x36/0x60
>  tlb_finish_mmu+0x1c/0x50
>  exit_mmap+0xc7/0x150
>  mmput+0x51/0x110
>  do_exit+0x2b2/0xc30
>  ? trace_hardirqs_on_caller+0x129/0x1b0
>  do_group_exit+0x50/0xd0
>  SyS_exit_group+0x14/0x20
>  entry_SYSCALL_64_fastpath+0x23/0xc6
> RIP: 0033:0x2b9a2dbdf309
> RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
> R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
> R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-16  7:13   ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-16  7:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel,
	linux-mm

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>
> With changing from bit lock to spinlock of swap_cluster_info, my zram
> test failed with below message. It seems nested lock problem so need to
> play with lockdep.

Thanks a lot for your testing and report.  There is at least one nested
locking in cluster_list_add_tail(), and there are comments to describe
why it is safe.  I will try to reproduce this and fix it.

Best Regards,
Huang, Ying

> Thanks.
>
> =============================================
> [ INFO: possible recursive locking detected ]
> 4.10.0-rc8-next-20170214-zram #24 Not tainted
> ---------------------------------------------
> as/6557 is trying to acquire lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
>
> but task is already holding lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(&(&((cluster_info + ci)->lock))->rlock);
>   lock(&(&((cluster_info + ci)->lock))->rlock);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 3 locks held by as/6557:
>  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
>  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
>  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>
> stack backtrace:
> CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> Call Trace:
>  dump_stack+0x85/0xc2
>  __lock_acquire+0x15ea/0x1640
>  lock_acquire+0x100/0x1f0
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  _raw_spin_lock+0x38/0x50
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  cluster_list_add_tail.part.31+0x33/0x70
>  swapcache_free_entries+0x2f9/0x330
>  free_swap_slot+0xf8/0x110
>  swapcache_free+0x36/0x40
>  delete_from_swap_cache+0x5f/0xa0
>  try_to_free_swap+0x6e/0xa0
>  free_pages_and_swap_cache+0x7d/0xb0
>  tlb_flush_mmu_free+0x36/0x60
>  tlb_finish_mmu+0x1c/0x50
>  exit_mmap+0xc7/0x150
>  mmput+0x51/0x110
>  do_exit+0x2b2/0xc30
>  ? trace_hardirqs_on_caller+0x129/0x1b0
>  do_group_exit+0x50/0xd0
>  SyS_exit_group+0x14/0x20
>  entry_SYSCALL_64_fastpath+0x23/0xc6
> RIP: 0033:0x2b9a2dbdf309
> RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
> R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
> R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-16  5:22 ` Minchan Kim
@ 2017-02-16  8:44   ` Huang, Ying
  -1 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-16  8:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel,
	linux-mm

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>
> With changing from bit lock to spinlock of swap_cluster_info, my zram
> test failed with below message. It seems nested lock problem so need to
> play with lockdep.

Sorry, I could not reproduce the warning in my tests.  Could you try the
patches as below?   And could you share your test case?

Best Regards,
Huang, Ying

------------------------------------------------------------->
>From 2b9e2f78a6e389442f308c4f9e8d5ac40fe6aa2f Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Thu, 16 Feb 2017 16:38:17 +0800
Subject: [PATCH] mm, swap: Annotate nested locking for cluster lock

There is a nested locking in cluster_list_add_tail() for cluster lock,
which caused lockdep to complain as below.  The nested locking is safe
because both cluster locks are only acquired when we held the
swap_info_struct->lock.  Annotated the nested locking via
spin_lock_nested() to fix the complain of lockdep.

=============================================
[ INFO: possible recursive locking detected ]
4.10.0-rc8-next-20170214-zram #24 Not tainted
---------------------------------------------
as/6557 is trying to acquire lock:
 (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70

but task is already holding lock:
 (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&((cluster_info + ci)->lock))->rlock);
  lock(&(&((cluster_info + ci)->lock))->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by as/6557:
 #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
 #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
 #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330

stack backtrace:
CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x85/0xc2
 __lock_acquire+0x15ea/0x1640
 lock_acquire+0x100/0x1f0
 ? cluster_list_add_tail.part.31+0x33/0x70
 _raw_spin_lock+0x38/0x50
 ? cluster_list_add_tail.part.31+0x33/0x70
 cluster_list_add_tail.part.31+0x33/0x70
 swapcache_free_entries+0x2f9/0x330
 free_swap_slot+0xf8/0x110
 swapcache_free+0x36/0x40
 delete_from_swap_cache+0x5f/0xa0
 try_to_free_swap+0x6e/0xa0
 free_pages_and_swap_cache+0x7d/0xb0
 tlb_flush_mmu_free+0x36/0x60
 tlb_finish_mmu+0x1c/0x50
 exit_mmap+0xc7/0x150
 mmput+0x51/0x110
 do_exit+0x2b2/0xc30
 ? trace_hardirqs_on_caller+0x129/0x1b0
 do_group_exit+0x50/0xd0
 SyS_exit_group+0x14/0x20
 entry_SYSCALL_64_fastpath+0x23/0xc6
RIP: 0033:0x2b9a2dbdf309
RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001

Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h | 6 ++++++
 mm/swapfile.c        | 8 +++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d12b381821f..ef044ea8fe79 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -166,6 +166,12 @@ enum {
 #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
 #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
 
+enum swap_cluster_lock_class
+{
+	SWAP_CLUSTER_LOCK_NORMAL,  /* implicitly used by plain spin_lock() APIs. */
+	SWAP_CLUSTER_LOCK_NESTED,
+};
+
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5ac2cb40dbd3..0a52e9b2f843 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -263,6 +263,12 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
 	spin_lock(&ci->lock);
 }
 
+static inline void __lock_cluster_nested(struct swap_cluster_info *ci,
+					 unsigned subclass)
+{
+	spin_lock_nested(&ci->lock, subclass);
+}
+
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -336,7 +342,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		 * only acquired when we held swap_info_struct->lock
 		 */
 		ci_tail = ci + tail;
-		__lock_cluster(ci_tail);
+		__lock_cluster_nested(ci_tail, SWAP_CLUSTER_LOCK_NESTED);
 		cluster_set_next(ci_tail, idx);
 		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-16  8:44   ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-16  8:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel,
	linux-mm

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>
> With changing from bit lock to spinlock of swap_cluster_info, my zram
> test failed with below message. It seems nested lock problem so need to
> play with lockdep.

Sorry, I could not reproduce the warning in my tests.  Could you try the
patches as below?   And could you share your test case?

Best Regards,
Huang, Ying

------------------------------------------------------------->

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-16  8:44   ` Huang, Ying
@ 2017-02-16 19:00     ` Hugh Dickins
  -1 siblings, 0 replies; 23+ messages in thread
From: Hugh Dickins @ 2017-02-16 19:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel,
	linux-mm

On Thu, 16 Feb 2017, Huang, Ying wrote:

> Hi, Minchan,
> 
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Hi Huang,
> >
> > With changing from bit lock to spinlock of swap_cluster_info, my zram
> > test failed with below message. It seems nested lock problem so need to
> > play with lockdep.
> 
> Sorry, I could not reproduce the warning in my tests.  Could you try the
> patches as below?   And could you share your test case?
> 
> Best Regards,
> Huang, Ying
> 
> ------------------------------------------------------------->
> From 2b9e2f78a6e389442f308c4f9e8d5ac40fe6aa2f Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Thu, 16 Feb 2017 16:38:17 +0800
> Subject: [PATCH] mm, swap: Annotate nested locking for cluster lock
> 
> There is a nested locking in cluster_list_add_tail() for cluster lock,
> which caused lockdep to complain as below.  The nested locking is safe
> because both cluster locks are only acquired when we held the
> swap_info_struct->lock.  Annotated the nested locking via
> spin_lock_nested() to fix the complain of lockdep.
> 
> =============================================
> [ INFO: possible recursive locking detected ]
> 4.10.0-rc8-next-20170214-zram #24 Not tainted
> ---------------------------------------------
> as/6557 is trying to acquire lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
> 
> but task is already holding lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(&(&((cluster_info + ci)->lock))->rlock);
>   lock(&(&((cluster_info + ci)->lock))->rlock);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 3 locks held by as/6557:
>  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
>  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
>  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> 
> stack backtrace:
> CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> Call Trace:
>  dump_stack+0x85/0xc2
>  __lock_acquire+0x15ea/0x1640
>  lock_acquire+0x100/0x1f0
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  _raw_spin_lock+0x38/0x50
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  cluster_list_add_tail.part.31+0x33/0x70
>  swapcache_free_entries+0x2f9/0x330
>  free_swap_slot+0xf8/0x110
>  swapcache_free+0x36/0x40
>  delete_from_swap_cache+0x5f/0xa0
>  try_to_free_swap+0x6e/0xa0
>  free_pages_and_swap_cache+0x7d/0xb0
>  tlb_flush_mmu_free+0x36/0x60
>  tlb_finish_mmu+0x1c/0x50
>  exit_mmap+0xc7/0x150
>  mmput+0x51/0x110
>  do_exit+0x2b2/0xc30
>  ? trace_hardirqs_on_caller+0x129/0x1b0
>  do_group_exit+0x50/0xd0
>  SyS_exit_group+0x14/0x20
>  entry_SYSCALL_64_fastpath+0x23/0xc6
> RIP: 0033:0x2b9a2dbdf309
> RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
> R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
> R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001
> 
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  include/linux/swap.h | 6 ++++++
>  mm/swapfile.c        | 8 +++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4d12b381821f..ef044ea8fe79 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -166,6 +166,12 @@ enum {
>  #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
>  #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
>  
> +enum swap_cluster_lock_class
> +{
> +	SWAP_CLUSTER_LOCK_NORMAL,  /* implicitly used by plain spin_lock() APIs. */
> +	SWAP_CLUSTER_LOCK_NESTED,
> +};
> +
>  /*
>   * We use this to track usage of a cluster. A cluster is a block of swap disk
>   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5ac2cb40dbd3..0a52e9b2f843 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -263,6 +263,12 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
>  	spin_lock(&ci->lock);
>  }
>  
> +static inline void __lock_cluster_nested(struct swap_cluster_info *ci,
> +					 unsigned subclass)
> +{
> +	spin_lock_nested(&ci->lock, subclass);
> +}
> +
>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>  						     unsigned long offset)
>  {
> @@ -336,7 +342,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
>  		 * only acquired when we held swap_info_struct->lock
>  		 */
>  		ci_tail = ci + tail;
> -		__lock_cluster(ci_tail);
> +		__lock_cluster_nested(ci_tail, SWAP_CLUSTER_LOCK_NESTED);
>  		cluster_set_next(ci_tail, idx);
>  		unlock_cluster(ci_tail);
>  		cluster_set_next_flag(&list->tail, idx, 0);
> -- 
> 2.11.0

I do not understand your zest for putting wrappers around every little
thing, making it all harder to follow than it need be.  Here's the patch
I've been running with (but you have a leak somewhere, and I don't have
time to search out and fix it: please try sustained swapping and swapoff).

[PATCH] mm, swap: Annotate nested locking for cluster lock

Fix swap cluster lockdep warnings.

Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/swapfile.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

--- 4.10-rc7-mm1/mm/swapfile.c	2017-02-08 10:56:23.359358518 -0800
+++ linux/mm/swapfile.c	2017-02-08 11:25:55.513241067 -0800
@@ -258,11 +258,6 @@ static inline void cluster_set_null(stru
 	info->data = 0;
 }
 
-static inline void __lock_cluster(struct swap_cluster_info *ci)
-{
-	spin_lock(&ci->lock);
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -271,7 +266,7 @@ static inline struct swap_cluster_info *
 	ci = si->cluster_info;
 	if (ci) {
 		ci += offset / SWAPFILE_CLUSTER;
-		__lock_cluster(ci);
+		spin_lock(&ci->lock);
 	}
 	return ci;
 }
@@ -336,7 +331,7 @@ static void cluster_list_add_tail(struct
 		 * only acquired when we held swap_info_struct->lock
 		 */
 		ci_tail = ci + tail;
-		__lock_cluster(ci_tail);
+		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
 		cluster_set_next(ci_tail, idx);
 		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-16 19:00     ` Hugh Dickins
  0 siblings, 0 replies; 23+ messages in thread
From: Hugh Dickins @ 2017-02-16 19:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Andrew Morton, Tim Chen, Hugh Dickins, linux-kernel,
	linux-mm

On Thu, 16 Feb 2017, Huang, Ying wrote:

> Hi, Minchan,
> 
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Hi Huang,
> >
> > With changing from bit lock to spinlock of swap_cluster_info, my zram
> > test failed with below message. It seems nested lock problem so need to
> > play with lockdep.
> 
> Sorry, I could not reproduce the warning in my tests.  Could you try the
> patches as below?   And could you share your test case?
> 
> Best Regards,
> Huang, Ying
> 
> ------------------------------------------------------------->
> From 2b9e2f78a6e389442f308c4f9e8d5ac40fe6aa2f Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Thu, 16 Feb 2017 16:38:17 +0800
> Subject: [PATCH] mm, swap: Annotate nested locking for cluster lock
> 
> There is a nested locking in cluster_list_add_tail() for cluster lock,
> which caused lockdep to complain as below.  The nested locking is safe
> because both cluster locks are only acquired when we held the
> swap_info_struct->lock.  Annotated the nested locking via
> spin_lock_nested() to fix the complain of lockdep.
> 
> =============================================
> [ INFO: possible recursive locking detected ]
> 4.10.0-rc8-next-20170214-zram #24 Not tainted
> ---------------------------------------------
> as/6557 is trying to acquire lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
> 
> but task is already holding lock:
>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(&(&((cluster_info + ci)->lock))->rlock);
>   lock(&(&((cluster_info + ci)->lock))->rlock);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 3 locks held by as/6557:
>  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
>  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
>  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> 
> stack backtrace:
> CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> Call Trace:
>  dump_stack+0x85/0xc2
>  __lock_acquire+0x15ea/0x1640
>  lock_acquire+0x100/0x1f0
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  _raw_spin_lock+0x38/0x50
>  ? cluster_list_add_tail.part.31+0x33/0x70
>  cluster_list_add_tail.part.31+0x33/0x70
>  swapcache_free_entries+0x2f9/0x330
>  free_swap_slot+0xf8/0x110
>  swapcache_free+0x36/0x40
>  delete_from_swap_cache+0x5f/0xa0
>  try_to_free_swap+0x6e/0xa0
>  free_pages_and_swap_cache+0x7d/0xb0
>  tlb_flush_mmu_free+0x36/0x60
>  tlb_finish_mmu+0x1c/0x50
>  exit_mmap+0xc7/0x150
>  mmput+0x51/0x110
>  do_exit+0x2b2/0xc30
>  ? trace_hardirqs_on_caller+0x129/0x1b0
>  do_group_exit+0x50/0xd0
>  SyS_exit_group+0x14/0x20
>  entry_SYSCALL_64_fastpath+0x23/0xc6
> RIP: 0033:0x2b9a2dbdf309
> RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
> R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
> R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001
> 
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  include/linux/swap.h | 6 ++++++
>  mm/swapfile.c        | 8 +++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4d12b381821f..ef044ea8fe79 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -166,6 +166,12 @@ enum {
>  #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
>  #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
>  
> +enum swap_cluster_lock_class
> +{
> +	SWAP_CLUSTER_LOCK_NORMAL,  /* implicitly used by plain spin_lock() APIs. */
> +	SWAP_CLUSTER_LOCK_NESTED,
> +};
> +
>  /*
>   * We use this to track usage of a cluster. A cluster is a block of swap disk
>   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5ac2cb40dbd3..0a52e9b2f843 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -263,6 +263,12 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
>  	spin_lock(&ci->lock);
>  }
>  
> +static inline void __lock_cluster_nested(struct swap_cluster_info *ci,
> +					 unsigned subclass)
> +{
> +	spin_lock_nested(&ci->lock, subclass);
> +}
> +
>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>  						     unsigned long offset)
>  {
> @@ -336,7 +342,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
>  		 * only acquired when we held swap_info_struct->lock
>  		 */
>  		ci_tail = ci + tail;
> -		__lock_cluster(ci_tail);
> +		__lock_cluster_nested(ci_tail, SWAP_CLUSTER_LOCK_NESTED);
>  		cluster_set_next(ci_tail, idx);
>  		unlock_cluster(ci_tail);
>  		cluster_set_next_flag(&list->tail, idx, 0);
> -- 
> 2.11.0

I do not understand your zest for putting wrappers around every little
thing, making it all harder to follow than it need be.  Here's the patch
I've been running with (but you have a leak somewhere, and I don't have
time to search out and fix it: please try sustained swapping and swapoff).

[PATCH] mm, swap: Annotate nested locking for cluster lock

Fix swap cluster lockdep warnings.

Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/swapfile.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

--- 4.10-rc7-mm1/mm/swapfile.c	2017-02-08 10:56:23.359358518 -0800
+++ linux/mm/swapfile.c	2017-02-08 11:25:55.513241067 -0800
@@ -258,11 +258,6 @@ static inline void cluster_set_null(stru
 	info->data = 0;
 }
 
-static inline void __lock_cluster(struct swap_cluster_info *ci)
-{
-	spin_lock(&ci->lock);
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -271,7 +266,7 @@ static inline struct swap_cluster_info *
 	ci = si->cluster_info;
 	if (ci) {
 		ci += offset / SWAPFILE_CLUSTER;
-		__lock_cluster(ci);
+		spin_lock(&ci->lock);
 	}
 	return ci;
 }
@@ -336,7 +331,7 @@ static void cluster_list_add_tail(struct
 		 * only acquired when we held swap_info_struct->lock
 		 */
 		ci_tail = ci + tail;
-		__lock_cluster(ci_tail);
+		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
 		cluster_set_next(ci_tail, idx);
 		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-16 19:00     ` Hugh Dickins
@ 2017-02-16 19:34       ` Tim Chen
  -1 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2017-02-16 19:34 UTC (permalink / raw)
  To: Hugh Dickins, Huang, Ying
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm


> I do not understand your zest for putting wrappers around every little
> thing, making it all harder to follow than it need be.  Here's the patch
> I've been running with (but you have a leak somewhere, and I don't have
> time to search out and fix it: please try sustained swapping and swapoff).
> 

Hugh, trying to duplicate your test case.  So you were doing swapping,
then swap off, swap on the swap device and restart swapping?

Tim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-16 19:34       ` Tim Chen
  0 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2017-02-16 19:34 UTC (permalink / raw)
  To: Hugh Dickins, Huang, Ying
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm


> I do not understand your zest for putting wrappers around every little
> thing, making it all harder to follow than it need be.A  Here's the patch
> I've been running with (but you have a leak somewhere, and I don't have
> time to search out and fix it: please try sustained swapping and swapoff).
> 

Hugh, trying to duplicate your test case. A So you were doing swapping,
then swap off, swap on the swap device and restart swapping?

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-16 19:00     ` Hugh Dickins
@ 2017-02-16 23:45       ` Minchan Kim
  -1 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2017-02-16 23:45 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Huang, Ying, Andrew Morton, Tim Chen, linux-kernel, linux-mm

Hi Huang and Hugh,

Thanks for the quick reponse!

On Thu, Feb 16, 2017 at 11:00:00AM -0800, Hugh Dickins wrote:
> On Thu, 16 Feb 2017, Huang, Ying wrote:
> 
> > Hi, Minchan,
> > 
> > Minchan Kim <minchan@kernel.org> writes:
> > 
> > > Hi Huang,
> > >
> > > With changing from bit lock to spinlock of swap_cluster_info, my zram
> > > test failed with below message. It seems nested lock problem so need to
> > > play with lockdep.
> > 
> > Sorry, I could not reproduce the warning in my tests.  Could you try the
> > patches as below?   And could you share your test case?

It's a simple kernel build test in small memory system.
4-core and 750M memory with zram-4G swap.

> > 
> > Best Regards,
> > Huang, Ying
> > 
> > ------------------------------------------------------------->
> > From 2b9e2f78a6e389442f308c4f9e8d5ac40fe6aa2f Mon Sep 17 00:00:00 2001
> > From: Huang Ying <ying.huang@intel.com>
> > Date: Thu, 16 Feb 2017 16:38:17 +0800
> > Subject: [PATCH] mm, swap: Annotate nested locking for cluster lock
> > 
> > There is a nested locking in cluster_list_add_tail() for cluster lock,
> > which caused lockdep to complain as below.  The nested locking is safe
> > because both cluster locks are only acquired when we held the
> > swap_info_struct->lock.  Annotated the nested locking via
> > spin_lock_nested() to fix the complain of lockdep.
> > 
> > =============================================
> > [ INFO: possible recursive locking detected ]
> > 4.10.0-rc8-next-20170214-zram #24 Not tainted
> > ---------------------------------------------
> > as/6557 is trying to acquire lock:
> >  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
> > 
> > but task is already holding lock:
> >  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> > 
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> > 
> >        CPU0
> >        ----
> >   lock(&(&((cluster_info + ci)->lock))->rlock);
> >   lock(&(&((cluster_info + ci)->lock))->rlock);
> > 
> >  *** DEADLOCK ***
> > 
> >  May be due to missing lock nesting notation
> > 
> > 3 locks held by as/6557:
> >  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
> >  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
> >  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> > 
> > stack backtrace:
> > CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > Call Trace:
> >  dump_stack+0x85/0xc2
> >  __lock_acquire+0x15ea/0x1640
> >  lock_acquire+0x100/0x1f0
> >  ? cluster_list_add_tail.part.31+0x33/0x70
> >  _raw_spin_lock+0x38/0x50
> >  ? cluster_list_add_tail.part.31+0x33/0x70
> >  cluster_list_add_tail.part.31+0x33/0x70
> >  swapcache_free_entries+0x2f9/0x330
> >  free_swap_slot+0xf8/0x110
> >  swapcache_free+0x36/0x40
> >  delete_from_swap_cache+0x5f/0xa0
> >  try_to_free_swap+0x6e/0xa0
> >  free_pages_and_swap_cache+0x7d/0xb0
> >  tlb_flush_mmu_free+0x36/0x60
> >  tlb_finish_mmu+0x1c/0x50
> >  exit_mmap+0xc7/0x150
> >  mmput+0x51/0x110
> >  do_exit+0x2b2/0xc30
> >  ? trace_hardirqs_on_caller+0x129/0x1b0
> >  do_group_exit+0x50/0xd0
> >  SyS_exit_group+0x14/0x20
> >  entry_SYSCALL_64_fastpath+0x23/0xc6
> > RIP: 0033:0x2b9a2dbdf309
> > RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
> > R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
> > R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001
> > 
> > Reported-by: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > ---
> >  include/linux/swap.h | 6 ++++++
> >  mm/swapfile.c        | 8 +++++++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4d12b381821f..ef044ea8fe79 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -166,6 +166,12 @@ enum {
> >  #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
> >  #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
> >  
> > +enum swap_cluster_lock_class
> > +{
> > +	SWAP_CLUSTER_LOCK_NORMAL,  /* implicitly used by plain spin_lock() APIs. */
> > +	SWAP_CLUSTER_LOCK_NESTED,
> > +};
> > +
> >  /*
> >   * We use this to track usage of a cluster. A cluster is a block of swap disk
> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5ac2cb40dbd3..0a52e9b2f843 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -263,6 +263,12 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
> >  	spin_lock(&ci->lock);
> >  }
> >  
> > +static inline void __lock_cluster_nested(struct swap_cluster_info *ci,
> > +					 unsigned subclass)
> > +{
> > +	spin_lock_nested(&ci->lock, subclass);
> > +}
> > +
> >  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> >  						     unsigned long offset)
> >  {
> > @@ -336,7 +342,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
> >  		 * only acquired when we held swap_info_struct->lock
> >  		 */
> >  		ci_tail = ci + tail;
> > -		__lock_cluster(ci_tail);
> > +		__lock_cluster_nested(ci_tail, SWAP_CLUSTER_LOCK_NESTED);
> >  		cluster_set_next(ci_tail, idx);
> >  		unlock_cluster(ci_tail);
> >  		cluster_set_next_flag(&list->tail, idx, 0);
> > -- 
> > 2.11.0
> 
> I do not understand your zest for putting wrappers around every little
> thing, making it all harder to follow than it need be.  Here's the patch
> I've been running with (but you have a leak somewhere, and I don't have
> time to search out and fix it: please try sustained swapping and swapoff).
> 
> [PATCH] mm, swap: Annotate nested locking for cluster lock
> 
> Fix swap cluster lockdep warnings.
> 
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acutually, before the reporting, I tested below hunk and confirmed it doesn't
make lockdep warn any more. But I doubted it's okay for non-nested case
(i.e., setup_swap_map_and_extends) for lockdep subclass working.
I guess it's no problem but not sure so I just reported it without fixing
by myself. :)
If it's no problem, I'm sure both patches from you guys would work well
but I prefer Hugh's patch which makes it simple/clear.

Thanks.

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5ac2cb4..348b9c5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -263,6 +263,11 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
 	spin_lock(&ci->lock);
 }
 
+static inline void __lock_cluster_nested(struct swap_cluster_info *ci)
+{
+	spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING);
+}
+
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -336,7 +341,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		 * only acquired when we held swap_info_struct->lock
 		 */
 		ci_tail = ci + tail;
-		__lock_cluster(ci_tail);
+		__lock_cluster_nested(ci_tail);
 		cluster_set_next(ci_tail, idx);
 		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-16 23:45       ` Minchan Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2017-02-16 23:45 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Huang, Ying, Andrew Morton, Tim Chen, linux-kernel, linux-mm

Hi Huang and Hugh,

Thanks for the quick reponse!

On Thu, Feb 16, 2017 at 11:00:00AM -0800, Hugh Dickins wrote:
> On Thu, 16 Feb 2017, Huang, Ying wrote:
> 
> > Hi, Minchan,
> > 
> > Minchan Kim <minchan@kernel.org> writes:
> > 
> > > Hi Huang,
> > >
> > > With changing from bit lock to spinlock of swap_cluster_info, my zram
> > > test failed with below message. It seems nested lock problem so need to
> > > play with lockdep.
> > 
> > Sorry, I could not reproduce the warning in my tests.  Could you try the
> > patches as below?   And could you share your test case?

It's a simple kernel build test in small memory system.
4-core and 750M memory with zram-4G swap.

> > 
> > Best Regards,
> > Huang, Ying
> > 
> > ------------------------------------------------------------->
> > From 2b9e2f78a6e389442f308c4f9e8d5ac40fe6aa2f Mon Sep 17 00:00:00 2001
> > From: Huang Ying <ying.huang@intel.com>
> > Date: Thu, 16 Feb 2017 16:38:17 +0800
> > Subject: [PATCH] mm, swap: Annotate nested locking for cluster lock
> > 
> > There is a nested locking in cluster_list_add_tail() for cluster lock,
> > which caused lockdep to complain as below.  The nested locking is safe
> > because both cluster locks are only acquired when we held the
> > swap_info_struct->lock.  Annotated the nested locking via
> > spin_lock_nested() to fix the complain of lockdep.
> > 
> > =============================================
> > [ INFO: possible recursive locking detected ]
> > 4.10.0-rc8-next-20170214-zram #24 Not tainted
> > ---------------------------------------------
> > as/6557 is trying to acquire lock:
> >  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
> > 
> > but task is already holding lock:
> >  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> > 
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> > 
> >        CPU0
> >        ----
> >   lock(&(&((cluster_info + ci)->lock))->rlock);
> >   lock(&(&((cluster_info + ci)->lock))->rlock);
> > 
> >  *** DEADLOCK ***
> > 
> >  May be due to missing lock nesting notation
> > 
> > 3 locks held by as/6557:
> >  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
> >  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
> >  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
> > 
> > stack backtrace:
> > CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > Call Trace:
> >  dump_stack+0x85/0xc2
> >  __lock_acquire+0x15ea/0x1640
> >  lock_acquire+0x100/0x1f0
> >  ? cluster_list_add_tail.part.31+0x33/0x70
> >  _raw_spin_lock+0x38/0x50
> >  ? cluster_list_add_tail.part.31+0x33/0x70
> >  cluster_list_add_tail.part.31+0x33/0x70
> >  swapcache_free_entries+0x2f9/0x330
> >  free_swap_slot+0xf8/0x110
> >  swapcache_free+0x36/0x40
> >  delete_from_swap_cache+0x5f/0xa0
> >  try_to_free_swap+0x6e/0xa0
> >  free_pages_and_swap_cache+0x7d/0xb0
> >  tlb_flush_mmu_free+0x36/0x60
> >  tlb_finish_mmu+0x1c/0x50
> >  exit_mmap+0xc7/0x150
> >  mmput+0x51/0x110
> >  do_exit+0x2b2/0xc30
> >  ? trace_hardirqs_on_caller+0x129/0x1b0
> >  do_group_exit+0x50/0xd0
> >  SyS_exit_group+0x14/0x20
> >  entry_SYSCALL_64_fastpath+0x23/0xc6
> > RIP: 0033:0x2b9a2dbdf309
> > RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
> > R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
> > R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001
> > 
> > Reported-by: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > ---
> >  include/linux/swap.h | 6 ++++++
> >  mm/swapfile.c        | 8 +++++++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4d12b381821f..ef044ea8fe79 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -166,6 +166,12 @@ enum {
> >  #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
> >  #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
> >  
> > +enum swap_cluster_lock_class
> > +{
> > +	SWAP_CLUSTER_LOCK_NORMAL,  /* implicitly used by plain spin_lock() APIs. */
> > +	SWAP_CLUSTER_LOCK_NESTED,
> > +};
> > +
> >  /*
> >   * We use this to track usage of a cluster. A cluster is a block of swap disk
> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5ac2cb40dbd3..0a52e9b2f843 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -263,6 +263,12 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
> >  	spin_lock(&ci->lock);
> >  }
> >  
> > +static inline void __lock_cluster_nested(struct swap_cluster_info *ci,
> > +					 unsigned subclass)
> > +{
> > +	spin_lock_nested(&ci->lock, subclass);
> > +}
> > +
> >  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> >  						     unsigned long offset)
> >  {
> > @@ -336,7 +342,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
> >  		 * only acquired when we held swap_info_struct->lock
> >  		 */
> >  		ci_tail = ci + tail;
> > -		__lock_cluster(ci_tail);
> > +		__lock_cluster_nested(ci_tail, SWAP_CLUSTER_LOCK_NESTED);
> >  		cluster_set_next(ci_tail, idx);
> >  		unlock_cluster(ci_tail);
> >  		cluster_set_next_flag(&list->tail, idx, 0);
> > -- 
> > 2.11.0
> 
> I do not understand your zest for putting wrappers around every little
> thing, making it all harder to follow than it need be.  Here's the patch
> I've been running with (but you have a leak somewhere, and I don't have
> time to search out and fix it: please try sustained swapping and swapoff).
> 
> [PATCH] mm, swap: Annotate nested locking for cluster lock
> 
> Fix swap cluster lockdep warnings.
> 
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acutually, before the reporting, I tested below hunk and confirmed it doesn't
make lockdep warn any more. But I doubted it's okay for non-nested case
(i.e., setup_swap_map_and_extends) for lockdep subclass working.
I guess it's no problem but not sure so I just reported it without fixing
by myself. :)
If it's no problem, I'm sure both patches from you guys would work well
but I prefer Hugh's patch which makes it simple/clear.

Thanks.

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5ac2cb4..348b9c5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -263,6 +263,11 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
 	spin_lock(&ci->lock);
 }
 
+static inline void __lock_cluster_nested(struct swap_cluster_info *ci)
+{
+	spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING);
+}
+
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -336,7 +341,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		 * only acquired when we held swap_info_struct->lock
 		 */
 		ci_tail = ci + tail;
-		__lock_cluster(ci_tail);
+		__lock_cluster_nested(ci_tail);
 		cluster_set_next(ci_tail, idx);
 		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-16 19:00     ` Hugh Dickins
@ 2017-02-17  0:38       ` Huang, Ying
  -1 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  0:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Huang, Ying, Minchan Kim, Andrew Morton, Tim Chen, linux-kernel,
	linux-mm

Hugh Dickins <hughd@google.com> writes:

> On Thu, 16 Feb 2017, Huang, Ying wrote:
>
>> Hi, Minchan,
>> 
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > Hi Huang,
>> >
>> > With changing from bit lock to spinlock of swap_cluster_info, my zram
>> > test failed with below message. It seems nested lock problem so need to
>> > play with lockdep.
>> 
>> Sorry, I could not reproduce the warning in my tests.  Could you try the
>> patches as below?   And could you share your test case?
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> ------------------------------------------------------------->
>> From 2b9e2f78a6e389442f308c4f9e8d5ac40fe6aa2f Mon Sep 17 00:00:00 2001
>> From: Huang Ying <ying.huang@intel.com>
>> Date: Thu, 16 Feb 2017 16:38:17 +0800
>> Subject: [PATCH] mm, swap: Annotate nested locking for cluster lock
>> 
>> There is a nested locking in cluster_list_add_tail() for cluster lock,
>> which caused lockdep to complain as below.  The nested locking is safe
>> because both cluster locks are only acquired when we held the
>> swap_info_struct->lock.  Annotated the nested locking via
>> spin_lock_nested() to fix the complain of lockdep.
>> 
>> =============================================
>> [ INFO: possible recursive locking detected ]
>> 4.10.0-rc8-next-20170214-zram #24 Not tainted
>> ---------------------------------------------
>> as/6557 is trying to acquire lock:
>>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
>> 
>> but task is already holding lock:
>>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>> 
>> other info that might help us debug this:
>>  Possible unsafe locking scenario:
>> 
>>        CPU0
>>        ----
>>   lock(&(&((cluster_info + ci)->lock))->rlock);
>>   lock(&(&((cluster_info + ci)->lock))->rlock);
>> 
>>  *** DEADLOCK ***
>> 
>>  May be due to missing lock nesting notation
>> 
>> 3 locks held by as/6557:
>>  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
>>  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
>>  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>> 
>> stack backtrace:
>> CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> Call Trace:
>>  dump_stack+0x85/0xc2
>>  __lock_acquire+0x15ea/0x1640
>>  lock_acquire+0x100/0x1f0
>>  ? cluster_list_add_tail.part.31+0x33/0x70
>>  _raw_spin_lock+0x38/0x50
>>  ? cluster_list_add_tail.part.31+0x33/0x70
>>  cluster_list_add_tail.part.31+0x33/0x70
>>  swapcache_free_entries+0x2f9/0x330
>>  free_swap_slot+0xf8/0x110
>>  swapcache_free+0x36/0x40
>>  delete_from_swap_cache+0x5f/0xa0
>>  try_to_free_swap+0x6e/0xa0
>>  free_pages_and_swap_cache+0x7d/0xb0
>>  tlb_flush_mmu_free+0x36/0x60
>>  tlb_finish_mmu+0x1c/0x50
>>  exit_mmap+0xc7/0x150
>>  mmput+0x51/0x110
>>  do_exit+0x2b2/0xc30
>>  ? trace_hardirqs_on_caller+0x129/0x1b0
>>  do_group_exit+0x50/0xd0
>>  SyS_exit_group+0x14/0x20
>>  entry_SYSCALL_64_fastpath+0x23/0xc6
>> RIP: 0033:0x2b9a2dbdf309
>> RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>> RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
>> R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
>> R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001
>> 
>> Reported-by: Minchan Kim <minchan@kernel.org>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/swap.h | 6 ++++++
>>  mm/swapfile.c        | 8 +++++++-
>>  2 files changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 4d12b381821f..ef044ea8fe79 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -166,6 +166,12 @@ enum {
>>  #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
>>  #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
>>  
>> +enum swap_cluster_lock_class
>> +{
>> +	SWAP_CLUSTER_LOCK_NORMAL,  /* implicitly used by plain spin_lock() APIs. */
>> +	SWAP_CLUSTER_LOCK_NESTED,
>> +};
>> +
>>  /*
>>   * We use this to track usage of a cluster. A cluster is a block of swap disk
>>   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 5ac2cb40dbd3..0a52e9b2f843 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -263,6 +263,12 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
>>  	spin_lock(&ci->lock);
>>  }
>>  
>> +static inline void __lock_cluster_nested(struct swap_cluster_info *ci,
>> +					 unsigned subclass)
>> +{
>> +	spin_lock_nested(&ci->lock, subclass);
>> +}
>> +
>>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>>  						     unsigned long offset)
>>  {
>> @@ -336,7 +342,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
>>  		 * only acquired when we held swap_info_struct->lock
>>  		 */
>>  		ci_tail = ci + tail;
>> -		__lock_cluster(ci_tail);
>> +		__lock_cluster_nested(ci_tail, SWAP_CLUSTER_LOCK_NESTED);
>>  		cluster_set_next(ci_tail, idx);
>>  		unlock_cluster(ci_tail);
>>  		cluster_set_next_flag(&list->tail, idx, 0);
>> -- 
>> 2.11.0
>
> I do not understand your zest for putting wrappers around every little
> thing, making it all harder to follow than it need be.  Here's the patch
> I've been running with (but you have a leak somewhere, and I don't have
> time to search out and fix it: please try sustained swapping and swapoff).

Thanks for your patch.  cluster_lock is bit_spinlock before, the wrapper
made it easier to be converted to normal spinlock.  But especially after
splitting the function into 2 variants, the wrapper looks pure
redundant.  Thanks for fixing that too.

Best Regards,
Huang, Ying

> [PATCH] mm, swap: Annotate nested locking for cluster lock
>
> Fix swap cluster lockdep warnings.
>
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>
>  mm/swapfile.c |    9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> --- 4.10-rc7-mm1/mm/swapfile.c	2017-02-08 10:56:23.359358518 -0800
> +++ linux/mm/swapfile.c	2017-02-08 11:25:55.513241067 -0800
> @@ -258,11 +258,6 @@ static inline void cluster_set_null(stru
>  	info->data = 0;
>  }
>  
> -static inline void __lock_cluster(struct swap_cluster_info *ci)
> -{
> -	spin_lock(&ci->lock);
> -}
> -
>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>  						     unsigned long offset)
>  {
> @@ -271,7 +266,7 @@ static inline struct swap_cluster_info *
>  	ci = si->cluster_info;
>  	if (ci) {
>  		ci += offset / SWAPFILE_CLUSTER;
> -		__lock_cluster(ci);
> +		spin_lock(&ci->lock);
>  	}
>  	return ci;
>  }
> @@ -336,7 +331,7 @@ static void cluster_list_add_tail(struct
>  		 * only acquired when we held swap_info_struct->lock
>  		 */
>  		ci_tail = ci + tail;
> -		__lock_cluster(ci_tail);
> +		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
>  		cluster_set_next(ci_tail, idx);
>  		unlock_cluster(ci_tail);
>  		cluster_set_next_flag(&list->tail, idx, 0);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-17  0:38       ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  0:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Huang, Ying, Minchan Kim, Andrew Morton, Tim Chen, linux-kernel,
	linux-mm

Hugh Dickins <hughd@google.com> writes:

> On Thu, 16 Feb 2017, Huang, Ying wrote:
>
>> Hi, Minchan,
>> 
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > Hi Huang,
>> >
>> > With changing from bit lock to spinlock of swap_cluster_info, my zram
>> > test failed with below message. It seems nested lock problem so need to
>> > play with lockdep.
>> 
>> Sorry, I could not reproduce the warning in my tests.  Could you try the
>> patches as below?   And could you share your test case?
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> ------------------------------------------------------------->
>> From 2b9e2f78a6e389442f308c4f9e8d5ac40fe6aa2f Mon Sep 17 00:00:00 2001
>> From: Huang Ying <ying.huang@intel.com>
>> Date: Thu, 16 Feb 2017 16:38:17 +0800
>> Subject: [PATCH] mm, swap: Annotate nested locking for cluster lock
>> 
>> There is a nested locking in cluster_list_add_tail() for cluster lock,
>> which caused lockdep to complain as below.  The nested locking is safe
>> because both cluster locks are only acquired when we held the
>> swap_info_struct->lock.  Annotated the nested locking via
>> spin_lock_nested() to fix the complain of lockdep.
>> 
>> =============================================
>> [ INFO: possible recursive locking detected ]
>> 4.10.0-rc8-next-20170214-zram #24 Not tainted
>> ---------------------------------------------
>> as/6557 is trying to acquire lock:
>>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811ddd03>] cluster_list_add_tail.part.31+0x33/0x70
>> 
>> but task is already holding lock:
>>  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>> 
>> other info that might help us debug this:
>>  Possible unsafe locking scenario:
>> 
>>        CPU0
>>        ----
>>   lock(&(&((cluster_info + ci)->lock))->rlock);
>>   lock(&(&((cluster_info + ci)->lock))->rlock);
>> 
>>  *** DEADLOCK ***
>> 
>>  May be due to missing lock nesting notation
>> 
>> 3 locks held by as/6557:
>>  #0:  (&(&cache->free_lock)->rlock){......}, at: [<ffffffff811c206b>] free_swap_slot+0x8b/0x110
>>  #1:  (&(&p->lock)->rlock){+.+.-.}, at: [<ffffffff811df295>] swapcache_free_entries+0x75/0x330
>>  #2:  (&(&((cluster_info + ci)->lock))->rlock){+.+.-.}, at: [<ffffffff811df2bb>] swapcache_free_entries+0x9b/0x330
>> 
>> stack backtrace:
>> CPU: 3 PID: 6557 Comm: as Not tainted 4.10.0-rc8-next-20170214-zram #24
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> Call Trace:
>>  dump_stack+0x85/0xc2
>>  __lock_acquire+0x15ea/0x1640
>>  lock_acquire+0x100/0x1f0
>>  ? cluster_list_add_tail.part.31+0x33/0x70
>>  _raw_spin_lock+0x38/0x50
>>  ? cluster_list_add_tail.part.31+0x33/0x70
>>  cluster_list_add_tail.part.31+0x33/0x70
>>  swapcache_free_entries+0x2f9/0x330
>>  free_swap_slot+0xf8/0x110
>>  swapcache_free+0x36/0x40
>>  delete_from_swap_cache+0x5f/0xa0
>>  try_to_free_swap+0x6e/0xa0
>>  free_pages_and_swap_cache+0x7d/0xb0
>>  tlb_flush_mmu_free+0x36/0x60
>>  tlb_finish_mmu+0x1c/0x50
>>  exit_mmap+0xc7/0x150
>>  mmput+0x51/0x110
>>  do_exit+0x2b2/0xc30
>>  ? trace_hardirqs_on_caller+0x129/0x1b0
>>  do_group_exit+0x50/0xd0
>>  SyS_exit_group+0x14/0x20
>>  entry_SYSCALL_64_fastpath+0x23/0xc6
>> RIP: 0033:0x2b9a2dbdf309
>> RSP: 002b:00007ffe71887528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00002b9a2dbdf309
>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>> RBP: 00002b9a2ded8858 R08: 000000000000003c R09: 00000000000000e7
>> R10: ffffffffffffff60 R11: 0000000000000246 R12: 00002b9a2ded8858
>> R13: 00002b9a2dedde80 R14: 000000000255f770 R15: 0000000000000001
>> 
>> Reported-by: Minchan Kim <minchan@kernel.org>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/swap.h | 6 ++++++
>>  mm/swapfile.c        | 8 +++++++-
>>  2 files changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 4d12b381821f..ef044ea8fe79 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -166,6 +166,12 @@ enum {
>>  #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
>>  #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
>>  
>> +enum swap_cluster_lock_class
>> +{
>> +	SWAP_CLUSTER_LOCK_NORMAL,  /* implicitly used by plain spin_lock() APIs. */
>> +	SWAP_CLUSTER_LOCK_NESTED,
>> +};
>> +
>>  /*
>>   * We use this to track usage of a cluster. A cluster is a block of swap disk
>>   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 5ac2cb40dbd3..0a52e9b2f843 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -263,6 +263,12 @@ static inline void __lock_cluster(struct swap_cluster_info *ci)
>>  	spin_lock(&ci->lock);
>>  }
>>  
>> +static inline void __lock_cluster_nested(struct swap_cluster_info *ci,
>> +					 unsigned subclass)
>> +{
>> +	spin_lock_nested(&ci->lock, subclass);
>> +}
>> +
>>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>>  						     unsigned long offset)
>>  {
>> @@ -336,7 +342,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
>>  		 * only acquired when we held swap_info_struct->lock
>>  		 */
>>  		ci_tail = ci + tail;
>> -		__lock_cluster(ci_tail);
>> +		__lock_cluster_nested(ci_tail, SWAP_CLUSTER_LOCK_NESTED);
>>  		cluster_set_next(ci_tail, idx);
>>  		unlock_cluster(ci_tail);
>>  		cluster_set_next_flag(&list->tail, idx, 0);
>> -- 
>> 2.11.0
>
> I do not understand your zest for putting wrappers around every little
> thing, making it all harder to follow than it need be.  Here's the patch
> I've been running with (but you have a leak somewhere, and I don't have
> time to search out and fix it: please try sustained swapping and swapoff).

Thanks for your patch.  cluster_lock is bit_spinlock before, the wrapper
made it easier to be converted to normal spinlock.  But especially after
splitting the function into 2 variants, the wrapper looks pure
redundant.  Thanks for fixing that too.

Best Regards,
Huang, Ying

> [PATCH] mm, swap: Annotate nested locking for cluster lock
>
> Fix swap cluster lockdep warnings.
>
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>
>  mm/swapfile.c |    9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> --- 4.10-rc7-mm1/mm/swapfile.c	2017-02-08 10:56:23.359358518 -0800
> +++ linux/mm/swapfile.c	2017-02-08 11:25:55.513241067 -0800
> @@ -258,11 +258,6 @@ static inline void cluster_set_null(stru
>  	info->data = 0;
>  }
>  
> -static inline void __lock_cluster(struct swap_cluster_info *ci)
> -{
> -	spin_lock(&ci->lock);
> -}
> -
>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>  						     unsigned long offset)
>  {
> @@ -271,7 +266,7 @@ static inline struct swap_cluster_info *
>  	ci = si->cluster_info;
>  	if (ci) {
>  		ci += offset / SWAPFILE_CLUSTER;
> -		__lock_cluster(ci);
> +		spin_lock(&ci->lock);
>  	}
>  	return ci;
>  }
> @@ -336,7 +331,7 @@ static void cluster_list_add_tail(struct
>  		 * only acquired when we held swap_info_struct->lock
>  		 */
>  		ci_tail = ci + tail;
> -		__lock_cluster(ci_tail);
> +		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
>  		cluster_set_next(ci_tail, idx);
>  		unlock_cluster(ci_tail);
>  		cluster_set_next_flag(&list->tail, idx, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-16 19:34       ` Tim Chen
  (?)
@ 2017-02-17  1:46       ` Hugh Dickins
  2017-02-17  2:07           ` Huang, Ying
  2017-02-17  7:32           ` Huang, Ying
  -1 siblings, 2 replies; 23+ messages in thread
From: Hugh Dickins @ 2017-02-17  1:46 UTC (permalink / raw)
  To: Tim Chen
  Cc: Hugh Dickins, Huang, Ying, Minchan Kim, Andrew Morton,
	linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1798 bytes --]

On Thu, 16 Feb 2017, Tim Chen wrote:
> 
> > I do not understand your zest for putting wrappers around every little
> > thing, making it all harder to follow than it need be.  Here's the patch
> > I've been running with (but you have a leak somewhere, and I don't have
> > time to search out and fix it: please try sustained swapping and swapoff).
> > 
> 
> Hugh, trying to duplicate your test case.  So you were doing swapping,
> then swap off, swap on the swap device and restart swapping?

Repeated pair of make -j20 kernel builds in 700M RAM, 1.5G swap on SSD,
8 cpus; one of the builds in tmpfs, other in ext4 on loop on tmpfs file;
sizes tuned for plenty of swapping but no OOMing (it's an ancient 2.6.24
kernel I build, modern one needing a lot more space with a lot less in use).

How much of that is relevant I don't know: hopefully none of it, it's
hard to get the tunings right from scratch.  To answer your specific
question: yes, I'm not doing concurrent swapoffs in this test showing
the leak, just waiting for each of the pair of builds to complete,
then tearing down the trees, doing swapoff followed by swapon, and
starting a new pair of builds.

Sometimes it's the swapoff that fails with ENOMEM, more often it's a
fork during build that fails with ENOMEM: after 6 or 7 hours of load
(but timings show it getting slower leading up to that).  /proc/meminfo
did not give me an immediate clue, Slab didn't look surprising but
I may not have studied close enough.

I quilt-bisected it as far as the mm-swap series, good before, bad
after, but didn't manage to narrow it down further because of hitting
a presumably different issue inside the series, where swapoff ENOMEMed
much sooner (after 25 mins one time, during first iteration the next).

Hugh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-17  1:46       ` Hugh Dickins
@ 2017-02-17  2:07           ` Huang, Ying
  2017-02-17  7:32           ` Huang, Ying
  1 sibling, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  2:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tim Chen, Huang, Ying, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

Hi, Hugh,

Hugh Dickins <hughd@google.com> writes:

> On Thu, 16 Feb 2017, Tim Chen wrote:
>> 
>> > I do not understand your zest for putting wrappers around every little
>> > thing, making it all harder to follow than it need be.  Here's the patch
>> > I've been running with (but you have a leak somewhere, and I don't have
>> > time to search out and fix it: please try sustained swapping and swapoff).
>> > 
>> 
>> Hugh, trying to duplicate your test case.  So you were doing swapping,
>> then swap off, swap on the swap device and restart swapping?
>
> Repeated pair of make -j20 kernel builds in 700M RAM, 1.5G swap on SSD,
> 8 cpus; one of the builds in tmpfs, other in ext4 on loop on tmpfs file;
> sizes tuned for plenty of swapping but no OOMing (it's an ancient 2.6.24
> kernel I build, modern one needing a lot more space with a lot less in use).
>
> How much of that is relevant I don't know: hopefully none of it, it's
> hard to get the tunings right from scratch.  To answer your specific
> question: yes, I'm not doing concurrent swapoffs in this test showing
> the leak, just waiting for each of the pair of builds to complete,
> then tearing down the trees, doing swapoff followed by swapon, and
> starting a new pair of builds.
>
> Sometimes it's the swapoff that fails with ENOMEM, more often it's a
> fork during build that fails with ENOMEM: after 6 or 7 hours of load
> (but timings show it getting slower leading up to that).  /proc/meminfo
> did not give me an immediate clue, Slab didn't look surprising but
> I may not have studied close enough.

Thanks for you information!

Memory newly allocated in the mm-swap series are allocated via vmalloc,
could you find anything special for vmalloc in /proc/meminfo?

Best Regards,
Huang, Ying

> I quilt-bisected it as far as the mm-swap series, good before, bad
> after, but didn't manage to narrow it down further because of hitting
> a presumably different issue inside the series, where swapoff ENOMEMed
> much sooner (after 25 mins one time, during first iteration the next).
>
> Hugh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-17  2:07           ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  2:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tim Chen, Huang, Ying, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

Hi, Hugh,

Hugh Dickins <hughd@google.com> writes:

> On Thu, 16 Feb 2017, Tim Chen wrote:
>> 
>> > I do not understand your zest for putting wrappers around every little
>> > thing, making it all harder to follow than it need be.A  Here's the patch
>> > I've been running with (but you have a leak somewhere, and I don't have
>> > time to search out and fix it: please try sustained swapping and swapoff).
>> > 
>> 
>> Hugh, trying to duplicate your test case. A So you were doing swapping,
>> then swap off, swap on the swap device and restart swapping?
>
> Repeated pair of make -j20 kernel builds in 700M RAM, 1.5G swap on SSD,
> 8 cpus; one of the builds in tmpfs, other in ext4 on loop on tmpfs file;
> sizes tuned for plenty of swapping but no OOMing (it's an ancient 2.6.24
> kernel I build, modern one needing a lot more space with a lot less in use).
>
> How much of that is relevant I don't know: hopefully none of it, it's
> hard to get the tunings right from scratch.  To answer your specific
> question: yes, I'm not doing concurrent swapoffs in this test showing
> the leak, just waiting for each of the pair of builds to complete,
> then tearing down the trees, doing swapoff followed by swapon, and
> starting a new pair of builds.
>
> Sometimes it's the swapoff that fails with ENOMEM, more often it's a
> fork during build that fails with ENOMEM: after 6 or 7 hours of load
> (but timings show it getting slower leading up to that).  /proc/meminfo
> did not give me an immediate clue, Slab didn't look surprising but
> I may not have studied close enough.

Thanks for you information!

Memory newly allocated in the mm-swap series are allocated via vmalloc,
could you find anything special for vmalloc in /proc/meminfo?

Best Regards,
Huang, Ying

> I quilt-bisected it as far as the mm-swap series, good before, bad
> after, but didn't manage to narrow it down further because of hitting
> a presumably different issue inside the series, where swapoff ENOMEMed
> much sooner (after 25 mins one time, during first iteration the next).
>
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-17  2:07           ` Huang, Ying
@ 2017-02-17  2:37             ` Huang, Ying
  -1 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  2:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Tim Chen, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Hugh,
>
> Hugh Dickins <hughd@google.com> writes:
>
>> On Thu, 16 Feb 2017, Tim Chen wrote:
>>> 
>>> > I do not understand your zest for putting wrappers around every little
>>> > thing, making it all harder to follow than it need be.  Here's the patch
>>> > I've been running with (but you have a leak somewhere, and I don't have
>>> > time to search out and fix it: please try sustained swapping and swapoff).
>>> > 
>>> 
>>> Hugh, trying to duplicate your test case.  So you were doing swapping,
>>> then swap off, swap on the swap device and restart swapping?
>>
>> Repeated pair of make -j20 kernel builds in 700M RAM, 1.5G swap on SSD,
>> 8 cpus; one of the builds in tmpfs, other in ext4 on loop on tmpfs file;
>> sizes tuned for plenty of swapping but no OOMing (it's an ancient 2.6.24
>> kernel I build, modern one needing a lot more space with a lot less in use).
>>
>> How much of that is relevant I don't know: hopefully none of it, it's
>> hard to get the tunings right from scratch.  To answer your specific
>> question: yes, I'm not doing concurrent swapoffs in this test showing
>> the leak, just waiting for each of the pair of builds to complete,
>> then tearing down the trees, doing swapoff followed by swapon, and
>> starting a new pair of builds.
>>
>> Sometimes it's the swapoff that fails with ENOMEM, more often it's a
>> fork during build that fails with ENOMEM: after 6 or 7 hours of load
>> (but timings show it getting slower leading up to that).  /proc/meminfo
>> did not give me an immediate clue, Slab didn't look surprising but
>> I may not have studied close enough.
>
> Thanks for you information!
>
> Memory newly allocated in the mm-swap series are allocated via vmalloc,
> could you find anything special for vmalloc in /proc/meminfo?

I found a potential issue in the mm-swap series, could you try the
patches as below?

Best Regards,
Huang, Ying

----------------------------------------------------->
>From 943494339bd5bc321b8f36f286bc143ac437719b Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Fri, 17 Feb 2017 10:31:37 +0800
Subject: [PATCH] Debug memory leak

---
 mm/swap_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2126e9ba23b2..473b71e052a8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -333,7 +333,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * else swap_off will be aborted if we return NULL.
 		 */
 		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
-			return NULL;
+			break;
 
 		/*
 		 * Get a new page to read into from swap.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-17  2:37             ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  2:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Tim Chen, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Hugh,
>
> Hugh Dickins <hughd@google.com> writes:
>
>> On Thu, 16 Feb 2017, Tim Chen wrote:
>>> 
>>> > I do not understand your zest for putting wrappers around every little
>>> > thing, making it all harder to follow than it need be.A  Here's the patch
>>> > I've been running with (but you have a leak somewhere, and I don't have
>>> > time to search out and fix it: please try sustained swapping and swapoff).
>>> > 
>>> 
>>> Hugh, trying to duplicate your test case. A So you were doing swapping,
>>> then swap off, swap on the swap device and restart swapping?
>>
>> Repeated pair of make -j20 kernel builds in 700M RAM, 1.5G swap on SSD,
>> 8 cpus; one of the builds in tmpfs, other in ext4 on loop on tmpfs file;
>> sizes tuned for plenty of swapping but no OOMing (it's an ancient 2.6.24
>> kernel I build, modern one needing a lot more space with a lot less in use).
>>
>> How much of that is relevant I don't know: hopefully none of it, it's
>> hard to get the tunings right from scratch.  To answer your specific
>> question: yes, I'm not doing concurrent swapoffs in this test showing
>> the leak, just waiting for each of the pair of builds to complete,
>> then tearing down the trees, doing swapoff followed by swapon, and
>> starting a new pair of builds.
>>
>> Sometimes it's the swapoff that fails with ENOMEM, more often it's a
>> fork during build that fails with ENOMEM: after 6 or 7 hours of load
>> (but timings show it getting slower leading up to that).  /proc/meminfo
>> did not give me an immediate clue, Slab didn't look surprising but
>> I may not have studied close enough.
>
> Thanks for you information!
>
> Memory newly allocated in the mm-swap series are allocated via vmalloc,
> could you find anything special for vmalloc in /proc/meminfo?

I found a potential issue in the mm-swap series, could you try the
patches as below?

Best Regards,
Huang, Ying

----------------------------------------------------->

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-17  1:46       ` Hugh Dickins
@ 2017-02-17  7:32           ` Huang, Ying
  2017-02-17  7:32           ` Huang, Ying
  1 sibling, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  7:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tim Chen, Huang, Ying, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

Hi, Hugh,

Hugh Dickins <hughd@google.com> writes:

> On Thu, 16 Feb 2017, Tim Chen wrote:
>> 
>> > I do not understand your zest for putting wrappers around every little
>> > thing, making it all harder to follow than it need be.  Here's the patch
>> > I've been running with (but you have a leak somewhere, and I don't have
>> > time to search out and fix it: please try sustained swapping and swapoff).
>> > 
>> 
>> Hugh, trying to duplicate your test case.  So you were doing swapping,
>> then swap off, swap on the swap device and restart swapping?
>
> Repeated pair of make -j20 kernel builds in 700M RAM, 1.5G swap on SSD,
> 8 cpus; one of the builds in tmpfs, other in ext4 on loop on tmpfs file;
> sizes tuned for plenty of swapping but no OOMing (it's an ancient 2.6.24
> kernel I build, modern one needing a lot more space with a lot less in use).
>
> How much of that is relevant I don't know: hopefully none of it, it's
> hard to get the tunings right from scratch.  To answer your specific
> question: yes, I'm not doing concurrent swapoffs in this test showing
> the leak, just waiting for each of the pair of builds to complete,
> then tearing down the trees, doing swapoff followed by swapon, and
> starting a new pair of builds.
>
> Sometimes it's the swapoff that fails with ENOMEM, more often it's a
> fork during build that fails with ENOMEM: after 6 or 7 hours of load
> (but timings show it getting slower leading up to that).  /proc/meminfo
> did not give me an immediate clue, Slab didn't look surprising but
> I may not have studied close enough.
>
> I quilt-bisected it as far as the mm-swap series, good before, bad
> after, but didn't manage to narrow it down further because of hitting
> a presumably different issue inside the series, where swapoff ENOMEMed
> much sooner (after 25 mins one time, during first iteration the next).

I found a memory leak in __read_swap_cache_async() introduced by mm-swap
series, and confirmed it via testing.  Could you verify whether it fixed
your cases?  Thanks a lot for reporting.

Best Regards,
Huang, Ying

------------------------------------------------------------------------->
>From 4b96423796ab7435104eb2cb4dcf5d525b9e0800 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Fri, 17 Feb 2017 10:31:37 +0800
Subject: [PATCH] mm, swap: Fix memory leak in __read_swap_cache_async()

The memory may be leaked in __read_swap_cache_async().  For the cases
as below,

CPU 0						CPU 1
-----						-----

find_get_page() == NULL
__swp_swapcount() != 0
new_page = alloc_page_vma()
radix_tree_maybe_preload()
						swap in swap slot
swapcache_prepare() == -EEXIST
cond_resched()
						reclaim the swap slot
find_get_page() == NULL
__swp_swapcount() == 0
return NULL				<- new_page leaked here !!!

The memory leak has been confirmed via checking the value of new_page
when returning inside the loop in __read_swap_cache_async().

This is fixed via replacing return with break inside of loop in
__read_swap_cache_async(), so that there is opportunity for the
new_page to be checked and freed.

Reported-by: Hugh Dickins <hughd@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2126e9ba23b2..473b71e052a8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -333,7 +333,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * else swap_off will be aborted if we return NULL.
 		 */
 		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
-			return NULL;
+			break;
 
 		/*
 		 * Get a new page to read into from swap.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-17  7:32           ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-02-17  7:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tim Chen, Huang, Ying, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

Hi, Hugh,

Hugh Dickins <hughd@google.com> writes:

> On Thu, 16 Feb 2017, Tim Chen wrote:
>> 
>> > I do not understand your zest for putting wrappers around every little
>> > thing, making it all harder to follow than it need be.A  Here's the patch
>> > I've been running with (but you have a leak somewhere, and I don't have
>> > time to search out and fix it: please try sustained swapping and swapoff).
>> > 
>> 
>> Hugh, trying to duplicate your test case. A So you were doing swapping,
>> then swap off, swap on the swap device and restart swapping?
>
> Repeated pair of make -j20 kernel builds in 700M RAM, 1.5G swap on SSD,
> 8 cpus; one of the builds in tmpfs, other in ext4 on loop on tmpfs file;
> sizes tuned for plenty of swapping but no OOMing (it's an ancient 2.6.24
> kernel I build, modern one needing a lot more space with a lot less in use).
>
> How much of that is relevant I don't know: hopefully none of it, it's
> hard to get the tunings right from scratch.  To answer your specific
> question: yes, I'm not doing concurrent swapoffs in this test showing
> the leak, just waiting for each of the pair of builds to complete,
> then tearing down the trees, doing swapoff followed by swapon, and
> starting a new pair of builds.
>
> Sometimes it's the swapoff that fails with ENOMEM, more often it's a
> fork during build that fails with ENOMEM: after 6 or 7 hours of load
> (but timings show it getting slower leading up to that).  /proc/meminfo
> did not give me an immediate clue, Slab didn't look surprising but
> I may not have studied close enough.
>
> I quilt-bisected it as far as the mm-swap series, good before, bad
> after, but didn't manage to narrow it down further because of hitting
> a presumably different issue inside the series, where swapoff ENOMEMed
> much sooner (after 25 mins one time, during first iteration the next).

I found a memory leak in __read_swap_cache_async() introduced by mm-swap
series, and confirmed it via testing.  Could you verify whether it fixed
your cases?  Thanks a lot for reporting.

Best Regards,
Huang, Ying

------------------------------------------------------------------------->

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
  2017-02-17  7:32           ` Huang, Ying
@ 2017-02-17 18:42             ` Hugh Dickins
  -1 siblings, 0 replies; 23+ messages in thread
From: Hugh Dickins @ 2017-02-17 18:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Tim Chen, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

On Fri, 17 Feb 2017, Huang, Ying wrote:
> 
> I found a memory leak in __read_swap_cache_async() introduced by mm-swap
> series, and confirmed it via testing.  Could you verify whether it fixed
> your cases?  Thanks a lot for reporting.

Well caught!  That indeed fixes the leak I've been seeing: my load has
now passed the 7 hour danger mark, with no indication of slowing down.
I'll keep it running until I need to try something else on that machine,
but all good for now.

You could add
Tested-by: Hugh Dickins <hughd@google.com>
but don't bother: I'm sure Andrew will simply fold this fix into the
fixed patch later on.

Thanks,
Hugh

> 
> Best Regards,
> Huang, Ying
> 
> ------------------------------------------------------------------------->
> From 4b96423796ab7435104eb2cb4dcf5d525b9e0800 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Fri, 17 Feb 2017 10:31:37 +0800
> Subject: [PATCH] mm, swap: Fix memory leak in __read_swap_cache_async()
> 
> The memory may be leaked in __read_swap_cache_async().  For the cases
> as below,
> 
> CPU 0						CPU 1
> -----						-----
> 
> find_get_page() == NULL
> __swp_swapcount() != 0
> new_page = alloc_page_vma()
> radix_tree_maybe_preload()
> 						swap in swap slot
> swapcache_prepare() == -EEXIST
> cond_resched()
> 						reclaim the swap slot
> find_get_page() == NULL
> __swp_swapcount() == 0
> return NULL				<- new_page leaked here !!!
> 
> The memory leak has been confirmed via checking the value of new_page
> when returning inside the loop in __read_swap_cache_async().
> 
> This is fixed via replacing return with break inside of loop in
> __read_swap_cache_async(), so that there is opportunity for the
> new_page to be checked and freed.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  mm/swap_state.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 2126e9ba23b2..473b71e052a8 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -333,7 +333,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * else swap_off will be aborted if we return NULL.
>  		 */
>  		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
> -			return NULL;
> +			break;
>  
>  		/*
>  		 * Get a new page to read into from swap.
> -- 
> 2.11.0
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: swap_cluster_info lockdep splat
@ 2017-02-17 18:42             ` Hugh Dickins
  0 siblings, 0 replies; 23+ messages in thread
From: Hugh Dickins @ 2017-02-17 18:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Tim Chen, Minchan Kim, Andrew Morton, linux-kernel,
	linux-mm

On Fri, 17 Feb 2017, Huang, Ying wrote:
> 
> I found a memory leak in __read_swap_cache_async() introduced by mm-swap
> series, and confirmed it via testing.  Could you verify whether it fixed
> your cases?  Thanks a lot for reporting.

Well caught!  That indeed fixes the leak I've been seeing: my load has
now passed the 7 hour danger mark, with no indication of slowing down.
I'll keep it running until I need to try something else on that machine,
but all good for now.

You could add
Tested-by: Hugh Dickins <hughd@google.com>
but don't bother: I'm sure Andrew will simply fold this fix into the
fixed patch later on.

Thanks,
Hugh

> 
> Best Regards,
> Huang, Ying
> 
> ------------------------------------------------------------------------->
> From 4b96423796ab7435104eb2cb4dcf5d525b9e0800 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Fri, 17 Feb 2017 10:31:37 +0800
> Subject: [PATCH] mm, swap: Fix memory leak in __read_swap_cache_async()
> 
> The memory may be leaked in __read_swap_cache_async().  For the cases
> as below,
> 
> CPU 0						CPU 1
> -----						-----
> 
> find_get_page() == NULL
> __swp_swapcount() != 0
> new_page = alloc_page_vma()
> radix_tree_maybe_preload()
> 						swap in swap slot
> swapcache_prepare() == -EEXIST
> cond_resched()
> 						reclaim the swap slot
> find_get_page() == NULL
> __swp_swapcount() == 0
> return NULL				<- new_page leaked here !!!
> 
> The memory leak has been confirmed via checking the value of new_page
> when returning inside the loop in __read_swap_cache_async().
> 
> This is fixed via replacing return with break inside of loop in
> __read_swap_cache_async(), so that there is opportunity for the
> new_page to be checked and freed.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  mm/swap_state.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 2126e9ba23b2..473b71e052a8 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -333,7 +333,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * else swap_off will be aborted if we return NULL.
>  		 */
>  		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
> -			return NULL;
> +			break;
>  
>  		/*
>  		 * Get a new page to read into from swap.
> -- 
> 2.11.0
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-02-17 18:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-16  5:22 swap_cluster_info lockdep splat Minchan Kim
2017-02-16  5:22 ` Minchan Kim
2017-02-16  7:13 ` Huang, Ying
2017-02-16  7:13   ` Huang, Ying
2017-02-16  8:44 ` Huang, Ying
2017-02-16  8:44   ` Huang, Ying
2017-02-16 19:00   ` Hugh Dickins
2017-02-16 19:00     ` Hugh Dickins
2017-02-16 19:34     ` Tim Chen
2017-02-16 19:34       ` Tim Chen
2017-02-17  1:46       ` Hugh Dickins
2017-02-17  2:07         ` Huang, Ying
2017-02-17  2:07           ` Huang, Ying
2017-02-17  2:37           ` Huang, Ying
2017-02-17  2:37             ` Huang, Ying
2017-02-17  7:32         ` Huang, Ying
2017-02-17  7:32           ` Huang, Ying
2017-02-17 18:42           ` Hugh Dickins
2017-02-17 18:42             ` Hugh Dickins
2017-02-16 23:45     ` Minchan Kim
2017-02-16 23:45       ` Minchan Kim
2017-02-17  0:38     ` Huang, Ying
2017-02-17  0:38       ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.