All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-03  5:46 ` Cong Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2017-08-03  5:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Cong Wang, linux-mm, stable, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Kirill A . Shutemov

We saw many list corruption warnings on shmem shrinklist:

 [45480.300911] ------------[ cut here ]------------
 [45480.305558] WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
 [45480.313622] list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
 [45480.322435] Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
 [45480.357776] CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
 [45480.365679] Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
 [45480.373416]  ffffb13c03ccbaf8 ffffffff9e36bc87 ffffb13c03ccbb48 0000000000000000
 [45480.380940]  ffffb13c03ccbb38 ffffffff9e08511b 0000003b7fffc000 0000000000000002
 [45480.388392]  ffff9ae5699ba960 ffffb13c03ccbbe8 ffffb13c03ccbbf8 ffff9ae5694b82d8
 [45480.395893] Call Trace:
 [45480.398214]  [<ffffffff9e36bc87>] dump_stack+0x4d/0x66
 [45480.403481]  [<ffffffff9e08511b>] __warn+0xcb/0xf0
 [45480.408322]  [<ffffffff9e08518f>] warn_slowpath_fmt+0x4f/0x60
 [45480.414095]  [<ffffffff9e38a6fe>] __list_del_entry+0x9e/0xc0
 [45480.419831]  [<ffffffff9e1a33aa>] shmem_unused_huge_shrink+0xfa/0x2e0
 [45480.426269]  [<ffffffff9e1a35b0>] shmem_unused_huge_scan+0x20/0x30
 [45480.432382]  [<ffffffff9e20a0d3>] super_cache_scan+0x193/0x1a0
 [45480.438238]  [<ffffffff9e19a9c3>] shrink_slab.part.41+0x1e3/0x3f0
 [45480.444370]  [<ffffffff9e19abf9>] shrink_slab+0x29/0x30
 [45480.449610]  [<ffffffff9e19ed39>] shrink_node+0xf9/0x2f0
 [45480.454858]  [<ffffffff9e19fbd8>] kswapd+0x2d8/0x6c0
 [45480.459896]  [<ffffffff9e19f900>] ? mem_cgroup_shrink_node+0x140/0x140
 [45480.466337]  [<ffffffff9e0a3b87>] kthread+0xd7/0xf0
 [45480.471231]  [<ffffffff9e0b519e>] ? vtime_account_idle+0xe/0x50
 [45480.477282]  [<ffffffff9e0a3ab0>] ? kthread_park+0x60/0x60
 [45480.482820]  [<ffffffff9e6d4c52>] ret_from_fork+0x22/0x30
 [45480.488234] ---[ end trace 66841eda03a967a0 ]---
 [45480.492834] ------------[ cut here ]------------
 [45480.497432] WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
 [45480.505020] list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
 [45480.516716] Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
 [45480.551020] CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
 [45480.560706] Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
 [45480.568299]  ffffb13c04913b30 ffffffff9e36bc87 ffffb13c04913b80 0000000000000000
 [45480.575628]  ffffb13c04913b70 ffffffff9e08511b 00000021699ba900 ffff9ae5694b82d8
 [45480.583080]  ffff9ae5694b82d8 ffff9ae5699ba960 ffff9ae5699ba900 0000000000000000
 [45480.590560] Call Trace:
 [45480.592937]  [<ffffffff9e36bc87>] dump_stack+0x4d/0x66
 [45480.598144]  [<ffffffff9e08511b>] __warn+0xcb/0xf0
 [45480.602978]  [<ffffffff9e08518f>] warn_slowpath_fmt+0x4f/0x60
 [45480.608718]  [<ffffffff9e38a639>] __list_add+0x89/0xb0
 [45480.613785]  [<ffffffff9e1a55d4>] shmem_setattr+0x204/0x230
 [45480.619340]  [<ffffffff9e2232ef>] notify_change+0x2ef/0x440
 [45480.624929]  [<ffffffff9e203bad>] do_truncate+0x5d/0x90
 [45480.630184]  [<ffffffff9e20393a>] ? do_dentry_open+0x27a/0x310
 [45480.635974]  [<ffffffff9e214101>] path_openat+0x331/0x1190
 [45480.641549]  [<ffffffff9e21680e>] do_filp_open+0x7e/0xe0
 [45480.646791]  [<ffffffff9e1bb3f4>] ? handle_mm_fault+0xa54/0x1340
 [45480.652888]  [<ffffffff9e1e6703>] ? kmem_cache_alloc+0xd3/0x1a0
 [45480.658778]  [<ffffffff9e215927>] ? getname_flags+0x37/0x190
 [45480.664527]  [<ffffffff9e22423f>] ? __alloc_fd+0x3f/0x170
 [45480.669918]  [<ffffffff9e205023>] do_sys_open+0x123/0x200
 [45480.675339]  [<ffffffff9e20511e>] SyS_open+0x1e/0x20
 [45480.680216]  [<ffffffff9e002aa1>] do_syscall_64+0x61/0x170
 [45480.685805]  [<ffffffff9e6d4ac6>] entry_SYSCALL64_slow_path+0x25/0x25
 [45480.692255] ---[ end trace 66841eda03a967a1 ]---
 [45480.696823] ------------[ cut here ]------------

The problem is that shmem_unused_huge_shrink() moves entries
from the global sbinfo->shrinklist to its local lists and then
releases the spinlock. However, a parallel shmem_setattr()
could access one of these entries directly and add it back to
the global shrinklist if it is removed, with the spinlock held.

The logic itself looks solid since an entry could be either
in a local list or the global list, otherwise it is removed
from one of them by list_del_init(). So probably the race
condition is that, one CPU is in the middle of INIT_LIST_HEAD()
but the other CPU calls list_empty() which returns true
too early then the following list_add_tail() sees a corrupted
entry.

list_empty_careful() is designed to fix this situation.

Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Cc: linux-mm@kvack.org
Cc: stable@kernel.org
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 mm/shmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b0aa6075d164..9c16b62ec6c9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1022,7 +1022,7 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 			 */
 			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
 				spin_lock(&sbinfo->shrinklist_lock);
-				if (list_empty(&info->shrinklist)) {
+				if (list_empty_careful(&info->shrinklist)) {
 					list_add_tail(&info->shrinklist,
 							&sbinfo->shrinklist);
 					sbinfo->shrinklist_len++;
@@ -1817,7 +1817,7 @@ alloc_nohuge:		page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
 			 * to shrink under memory pressure.
 			 */
 			spin_lock(&sbinfo->shrinklist_lock);
-			if (list_empty(&info->shrinklist)) {
+			if (list_empty_careful(&info->shrinklist)) {
 				list_add_tail(&info->shrinklist,
 						&sbinfo->shrinklist);
 				sbinfo->shrinklist_len++;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-03  5:46 ` Cong Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2017-08-03  5:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Cong Wang, linux-mm, stable, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Kirill A . Shutemov

We saw many list corruption warnings on shmem shrinklist:

 [45480.300911] ------------[ cut here ]------------
 [45480.305558] WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
 [45480.313622] list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
 [45480.322435] Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
 [45480.357776] CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
 [45480.365679] Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
 [45480.373416]  ffffb13c03ccbaf8 ffffffff9e36bc87 ffffb13c03ccbb48 0000000000000000
 [45480.380940]  ffffb13c03ccbb38 ffffffff9e08511b 0000003b7fffc000 0000000000000002
 [45480.388392]  ffff9ae5699ba960 ffffb13c03ccbbe8 ffffb13c03ccbbf8 ffff9ae5694b82d8
 [45480.395893] Call Trace:
 [45480.398214]  [<ffffffff9e36bc87>] dump_stack+0x4d/0x66
 [45480.403481]  [<ffffffff9e08511b>] __warn+0xcb/0xf0
 [45480.408322]  [<ffffffff9e08518f>] warn_slowpath_fmt+0x4f/0x60
 [45480.414095]  [<ffffffff9e38a6fe>] __list_del_entry+0x9e/0xc0
 [45480.419831]  [<ffffffff9e1a33aa>] shmem_unused_huge_shrink+0xfa/0x2e0
 [45480.426269]  [<ffffffff9e1a35b0>] shmem_unused_huge_scan+0x20/0x30
 [45480.432382]  [<ffffffff9e20a0d3>] super_cache_scan+0x193/0x1a0
 [45480.438238]  [<ffffffff9e19a9c3>] shrink_slab.part.41+0x1e3/0x3f0
 [45480.444370]  [<ffffffff9e19abf9>] shrink_slab+0x29/0x30
 [45480.449610]  [<ffffffff9e19ed39>] shrink_node+0xf9/0x2f0
 [45480.454858]  [<ffffffff9e19fbd8>] kswapd+0x2d8/0x6c0
 [45480.459896]  [<ffffffff9e19f900>] ? mem_cgroup_shrink_node+0x140/0x140
 [45480.466337]  [<ffffffff9e0a3b87>] kthread+0xd7/0xf0
 [45480.471231]  [<ffffffff9e0b519e>] ? vtime_account_idle+0xe/0x50
 [45480.477282]  [<ffffffff9e0a3ab0>] ? kthread_park+0x60/0x60
 [45480.482820]  [<ffffffff9e6d4c52>] ret_from_fork+0x22/0x30
 [45480.488234] ---[ end trace 66841eda03a967a0 ]---
 [45480.492834] ------------[ cut here ]------------
 [45480.497432] WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
 [45480.505020] list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
 [45480.516716] Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
 [45480.551020] CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
 [45480.560706] Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
 [45480.568299]  ffffb13c04913b30 ffffffff9e36bc87 ffffb13c04913b80 0000000000000000
 [45480.575628]  ffffb13c04913b70 ffffffff9e08511b 00000021699ba900 ffff9ae5694b82d8
 [45480.583080]  ffff9ae5694b82d8 ffff9ae5699ba960 ffff9ae5699ba900 0000000000000000
 [45480.590560] Call Trace:
 [45480.592937]  [<ffffffff9e36bc87>] dump_stack+0x4d/0x66
 [45480.598144]  [<ffffffff9e08511b>] __warn+0xcb/0xf0
 [45480.602978]  [<ffffffff9e08518f>] warn_slowpath_fmt+0x4f/0x60
 [45480.608718]  [<ffffffff9e38a639>] __list_add+0x89/0xb0
 [45480.613785]  [<ffffffff9e1a55d4>] shmem_setattr+0x204/0x230
 [45480.619340]  [<ffffffff9e2232ef>] notify_change+0x2ef/0x440
 [45480.624929]  [<ffffffff9e203bad>] do_truncate+0x5d/0x90
 [45480.630184]  [<ffffffff9e20393a>] ? do_dentry_open+0x27a/0x310
 [45480.635974]  [<ffffffff9e214101>] path_openat+0x331/0x1190
 [45480.641549]  [<ffffffff9e21680e>] do_filp_open+0x7e/0xe0
 [45480.646791]  [<ffffffff9e1bb3f4>] ? handle_mm_fault+0xa54/0x1340
 [45480.652888]  [<ffffffff9e1e6703>] ? kmem_cache_alloc+0xd3/0x1a0
 [45480.658778]  [<ffffffff9e215927>] ? getname_flags+0x37/0x190
 [45480.664527]  [<ffffffff9e22423f>] ? __alloc_fd+0x3f/0x170
 [45480.669918]  [<ffffffff9e205023>] do_sys_open+0x123/0x200
 [45480.675339]  [<ffffffff9e20511e>] SyS_open+0x1e/0x20
 [45480.680216]  [<ffffffff9e002aa1>] do_syscall_64+0x61/0x170
 [45480.685805]  [<ffffffff9e6d4ac6>] entry_SYSCALL64_slow_path+0x25/0x25
 [45480.692255] ---[ end trace 66841eda03a967a1 ]---
 [45480.696823] ------------[ cut here ]------------

The problem is that shmem_unused_huge_shrink() moves entries
from the global sbinfo->shrinklist to its local lists and then
releases the spinlock. However, a parallel shmem_setattr()
could access one of these entries directly and add it back to
the global shrinklist if it is removed, with the spinlock held.

The logic itself looks solid since an entry could be either
in a local list or the global list, otherwise it is removed
from one of them by list_del_init(). So probably the race
condition is that, one CPU is in the middle of INIT_LIST_HEAD()
but the other CPU calls list_empty() which returns true
too early then the following list_add_tail() sees a corrupted
entry.

list_empty_careful() is designed to fix this situation.

Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Cc: linux-mm@kvack.org
Cc: stable@kernel.org
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 mm/shmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b0aa6075d164..9c16b62ec6c9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1022,7 +1022,7 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 			 */
 			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
 				spin_lock(&sbinfo->shrinklist_lock);
-				if (list_empty(&info->shrinklist)) {
+				if (list_empty_careful(&info->shrinklist)) {
 					list_add_tail(&info->shrinklist,
 							&sbinfo->shrinklist);
 					sbinfo->shrinklist_len++;
@@ -1817,7 +1817,7 @@ alloc_nohuge:		page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
 			 * to shrink under memory pressure.
 			 */
 			spin_lock(&sbinfo->shrinklist_lock);
-			if (list_empty(&info->shrinklist)) {
+			if (list_empty_careful(&info->shrinklist)) {
 				list_add_tail(&info->shrinklist,
 						&sbinfo->shrinklist);
 				sbinfo->shrinklist_len++;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
  2017-08-03  5:46 ` Cong Wang
@ 2017-08-03 23:11   ` Andrew Morton
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2017-08-03 23:11 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, linux-mm, stable, Hugh Dickins, Linus Torvalds,
	Kirill A . Shutemov

On Wed,  2 Aug 2017 22:46:30 -0700 Cong Wang <xiyou.wangcong@gmail.com> wrote:

> We saw many list corruption warnings on shmem shrinklist:
> 
> ...
> 
> The problem is that shmem_unused_huge_shrink() moves entries
> from the global sbinfo->shrinklist to its local lists and then
> releases the spinlock. However, a parallel shmem_setattr()
> could access one of these entries directly and add it back to
> the global shrinklist if it is removed, with the spinlock held.
> 
> The logic itself looks solid since an entry could be either
> in a local list or the global list, otherwise it is removed
> from one of them by list_del_init(). So probably the race
> condition is that, one CPU is in the middle of INIT_LIST_HEAD()

Where is this INIT_LIST_HEAD()?

> but the other CPU calls list_empty() which returns true
> too early then the following list_add_tail() sees a corrupted
> entry.
> 
> list_empty_careful() is designed to fix this situation.
> 

I'm not sure I'm understanding this.  AFAICT all the list operations to
which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-03 23:11   ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2017-08-03 23:11 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, linux-mm, stable, Hugh Dickins, Linus Torvalds,
	Kirill A . Shutemov

On Wed,  2 Aug 2017 22:46:30 -0700 Cong Wang <xiyou.wangcong@gmail.com> wrote:

> We saw many list corruption warnings on shmem shrinklist:
> 
> ...
> 
> The problem is that shmem_unused_huge_shrink() moves entries
> from the global sbinfo->shrinklist to its local lists and then
> releases the spinlock. However, a parallel shmem_setattr()
> could access one of these entries directly and add it back to
> the global shrinklist if it is removed, with the spinlock held.
> 
> The logic itself looks solid since an entry could be either
> in a local list or the global list, otherwise it is removed
> from one of them by list_del_init(). So probably the race
> condition is that, one CPU is in the middle of INIT_LIST_HEAD()

Where is this INIT_LIST_HEAD()?

> but the other CPU calls list_empty() which returns true
> too early then the following list_add_tail() sees a corrupted
> entry.
> 
> list_empty_careful() is designed to fix this situation.
> 

I'm not sure I'm understanding this.  AFAICT all the list operations to
which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
  2017-08-03 23:11   ` Andrew Morton
@ 2017-08-03 23:25     ` Linus Torvalds
  -1 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2017-08-03 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Where is this INIT_LIST_HEAD()?

I think it's this one:

        list_del_init(&info->shrinklist);

in shmem_unused_huge_shrink().

> I'm not sure I'm understanding this.  AFAICT all the list operations to
> which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?

No, notice how shmem_unused_huge_shrink() does the

        list_move(&info->shrinklist, &to_remove);

and

        list_move(&info->shrinklist, &list);

to move to (two different) private lists under the shrinklist_lock,
but once it is on that private "list/to_remove" list, it is then
accessed outside the locked region.

Honestly, I don't love this situation, or the patch, but I think the
patch is likely the right thing to do.

              Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-03 23:25     ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2017-08-03 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Where is this INIT_LIST_HEAD()?

I think it's this one:

        list_del_init(&info->shrinklist);

in shmem_unused_huge_shrink().

> I'm not sure I'm understanding this.  AFAICT all the list operations to
> which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?

No, notice how shmem_unused_huge_shrink() does the

        list_move(&info->shrinklist, &to_remove);

and

        list_move(&info->shrinklist, &list);

to move to (two different) private lists under the shrinklist_lock,
but once it is on that private "list/to_remove" list, it is then
accessed outside the locked region.

Honestly, I don't love this situation, or the patch, but I think the
patch is likely the right thing to do.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
  2017-08-03 23:25     ` Linus Torvalds
@ 2017-08-03 23:53       ` Andrew Morton
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2017-08-03 23:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Cong Wang, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, 3 Aug 2017 16:25:46 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Where is this INIT_LIST_HEAD()?
> 
> I think it's this one:
> 
>         list_del_init(&info->shrinklist);
> 
> in shmem_unused_huge_shrink().

OK.

> > I'm not sure I'm understanding this.  AFAICT all the list operations to
> > which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
> 
> No, notice how shmem_unused_huge_shrink() does the
> 
>         list_move(&info->shrinklist, &to_remove);
> 
> and
> 
>         list_move(&info->shrinklist, &list);
> 
> to move to (two different) private lists under the shrinklist_lock,
> but once it is on that private "list/to_remove" list, it is then
> accessed outside the locked region.

So the code is using sbinfo->shrinklist_lock to protect
sbinfo->shrinklist AND to protect all the per-inode info->shrinklist's.
Except it didn't get the coverage complete.

Presumably it's too expensive to extend sbinfo->shrinklist_lock
coverage in shmem_unused_huge_shrink() (or is it?  - this is huge
pages).  An alternative would be to add a new
shmem_inode_info.shrinklist_lock whose mandate is to protect
shmem_inode_info.shrinklist.

> Honestly, I don't love this situation, or the patch, but I think the
> patch is likely the right thing to do.

Well, we could view the premature droppage of sbinfo->shrinklist_lock
in shmem_unused_huge_shrink() to be a performance optimization and put
some big fat comments in there explaining what's going on.  But it's
tricky and it's not known that such an optimization is warranted.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-03 23:53       ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2017-08-03 23:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Cong Wang, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, 3 Aug 2017 16:25:46 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Where is this INIT_LIST_HEAD()?
> 
> I think it's this one:
> 
>         list_del_init(&info->shrinklist);
> 
> in shmem_unused_huge_shrink().

OK.

> > I'm not sure I'm understanding this.  AFAICT all the list operations to
> > which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
> 
> No, notice how shmem_unused_huge_shrink() does the
> 
>         list_move(&info->shrinklist, &to_remove);
> 
> and
> 
>         list_move(&info->shrinklist, &list);
> 
> to move to (two different) private lists under the shrinklist_lock,
> but once it is on that private "list/to_remove" list, it is then
> accessed outside the locked region.

So the code is using sbinfo->shrinklist_lock to protect
sbinfo->shrinklist AND to protect all the per-inode info->shrinklist's.
Except it didn't get the coverage complete.

Presumably it's too expensive to extend sbinfo->shrinklist_lock
coverage in shmem_unused_huge_shrink() (or is it?  - this is huge
pages).  An alternative would be to add a new
shmem_inode_info.shrinklist_lock whose mandate is to protect
shmem_inode_info.shrinklist.

> Honestly, I don't love this situation, or the patch, but I think the
> patch is likely the right thing to do.

Well, we could view the premature droppage of sbinfo->shrinklist_lock
in shmem_unused_huge_shrink() to be a performance optimization and put
some big fat comments in there explaining what's going on.  But it's
tricky and it's not known that such an optimization is warranted.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
  2017-08-03 23:53       ` Andrew Morton
@ 2017-08-04 14:42         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2017-08-04 14:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Cong Wang, Linux Kernel Mailing List, linux-mm,
	# .39.x, Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 03, 2017 at 04:53:07PM -0700, Andrew Morton wrote:
> On Thu, 3 Aug 2017 16:25:46 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > Where is this INIT_LIST_HEAD()?
> > 
> > I think it's this one:
> > 
> >         list_del_init(&info->shrinklist);
> > 
> > in shmem_unused_huge_shrink().
> 
> OK.
> 
> > > I'm not sure I'm understanding this.  AFAICT all the list operations to
> > > which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
> > 
> > No, notice how shmem_unused_huge_shrink() does the
> > 
> >         list_move(&info->shrinklist, &to_remove);
> > 
> > and
> > 
> >         list_move(&info->shrinklist, &list);
> > 
> > to move to (two different) private lists under the shrinklist_lock,
> > but once it is on that private "list/to_remove" list, it is then
> > accessed outside the locked region.
> 
> So the code is using sbinfo->shrinklist_lock to protect
> sbinfo->shrinklist AND to protect all the per-inode info->shrinklist's.
> Except it didn't get the coverage complete.
> 
> Presumably it's too expensive to extend sbinfo->shrinklist_lock
> coverage in shmem_unused_huge_shrink() (or is it?  - this is huge
> pages).  An alternative would be to add a new
> shmem_inode_info.shrinklist_lock whose mandate is to protect
> shmem_inode_info.shrinklist.
> 
> > Honestly, I don't love this situation, or the patch, but I think the
> > patch is likely the right thing to do.
> 
> Well, we could view the premature droppage of sbinfo->shrinklist_lock
> in shmem_unused_huge_shrink() to be a performance optimization and put
> some big fat comments in there explaining what's going on.  But it's
> tricky and it's not known that such an optimization is warranted.

The reason we need to drop shrinklist_lock is that we need to perform
sleeping operations for both of the lists:

 - 'to_remove' was added to get iput() running outside spin lock.
    See 253fd0f02040 ("shmem: fix sleeping from atomic context")

 - on handling 'list' we need to take lock_page() and also call iput().

 The fix looks fine to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-04 14:42         ` Kirill A. Shutemov
  0 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2017-08-04 14:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Cong Wang, Linux Kernel Mailing List, linux-mm,
	# .39.x, Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 03, 2017 at 04:53:07PM -0700, Andrew Morton wrote:
> On Thu, 3 Aug 2017 16:25:46 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > Where is this INIT_LIST_HEAD()?
> > 
> > I think it's this one:
> > 
> >         list_del_init(&info->shrinklist);
> > 
> > in shmem_unused_huge_shrink().
> 
> OK.
> 
> > > I'm not sure I'm understanding this.  AFAICT all the list operations to
> > > which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
> > 
> > No, notice how shmem_unused_huge_shrink() does the
> > 
> >         list_move(&info->shrinklist, &to_remove);
> > 
> > and
> > 
> >         list_move(&info->shrinklist, &list);
> > 
> > to move to (two different) private lists under the shrinklist_lock,
> > but once it is on that private "list/to_remove" list, it is then
> > accessed outside the locked region.
> 
> So the code is using sbinfo->shrinklist_lock to protect
> sbinfo->shrinklist AND to protect all the per-inode info->shrinklist's.
> Except it didn't get the coverage complete.
> 
> Presumably it's too expensive to extend sbinfo->shrinklist_lock
> coverage in shmem_unused_huge_shrink() (or is it?  - this is huge
> pages).  An alternative would be to add a new
> shmem_inode_info.shrinklist_lock whose mandate is to protect
> shmem_inode_info.shrinklist.
> 
> > Honestly, I don't love this situation, or the patch, but I think the
> > patch is likely the right thing to do.
> 
> Well, we could view the premature droppage of sbinfo->shrinklist_lock
> in shmem_unused_huge_shrink() to be a performance optimization and put
> some big fat comments in there explaining what's going on.  But it's
> tricky and it's not known that such an optimization is warranted.

The reason we need to drop shrinklist_lock is that we need to perform
sleeping operations for both of the lists:

 - 'to_remove' was added to get iput() running outside spin lock.
    See 253fd0f02040 ("shmem: fix sleeping from atomic context")

 - on handling 'list' we need to take lock_page() and also call iput().

 The fix looks fine to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
  2017-08-03 23:53       ` Andrew Morton
@ 2017-08-04 17:58         ` Cong Wang
  -1 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2017-08-04 17:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 3, 2017 at 4:53 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 3 Aug 2017 16:25:46 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> > Where is this INIT_LIST_HEAD()?
>>
>> I think it's this one:
>>
>>         list_del_init(&info->shrinklist);
>>
>> in shmem_unused_huge_shrink().
>
> OK.
>
>> > I'm not sure I'm understanding this.  AFAICT all the list operations to
>> > which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
>>
>> No, notice how shmem_unused_huge_shrink() does the
>>
>>         list_move(&info->shrinklist, &to_remove);
>>
>> and
>>
>>         list_move(&info->shrinklist, &list);
>>
>> to move to (two different) private lists under the shrinklist_lock,
>> but once it is on that private "list/to_remove" list, it is then
>> accessed outside the locked region.
>
> So the code is using sbinfo->shrinklist_lock to protect
> sbinfo->shrinklist AND to protect all the per-inode info->shrinklist's.
> Except it didn't get the coverage complete.


Normally once we move list entries from a global list to a private
one they are no longer visible to others, however in this case
they could still be accessed via setattr() path.

>
> Presumably it's too expensive to extend sbinfo->shrinklist_lock
> coverage in shmem_unused_huge_shrink() (or is it?  - this is huge
> pages).  An alternative would be to add a new
> shmem_inode_info.shrinklist_lock whose mandate is to protect
> shmem_inode_info.shrinklist.

Both find_lock_page() and iput() could sleep, I think this is why
we have to defer these two calls after releasing spinlock.

>
>> Honestly, I don't love this situation, or the patch, but I think the
>> patch is likely the right thing to do.
>
> Well, we could view the premature droppage of sbinfo->shrinklist_lock
> in shmem_unused_huge_shrink() to be a performance optimization and put
> some big fat comments in there explaining what's going on.  But it's
> tricky and it's not known that such an optimization is warranted.

It is not for performance optimization, because we still traverse
the list with the spinlock held. A typical optimization is to use
a list_splice() with spinlock and traverse it without it, this is
used by a few places in networking subsystem.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-04 17:58         ` Cong Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2017-08-04 17:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 3, 2017 at 4:53 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 3 Aug 2017 16:25:46 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> > Where is this INIT_LIST_HEAD()?
>>
>> I think it's this one:
>>
>>         list_del_init(&info->shrinklist);
>>
>> in shmem_unused_huge_shrink().
>
> OK.
>
>> > I'm not sure I'm understanding this.  AFAICT all the list operations to
>> > which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
>>
>> No, notice how shmem_unused_huge_shrink() does the
>>
>>         list_move(&info->shrinklist, &to_remove);
>>
>> and
>>
>>         list_move(&info->shrinklist, &list);
>>
>> to move to (two different) private lists under the shrinklist_lock,
>> but once it is on that private "list/to_remove" list, it is then
>> accessed outside the locked region.
>
> So the code is using sbinfo->shrinklist_lock to protect
> sbinfo->shrinklist AND to protect all the per-inode info->shrinklist's.
> Except it didn't get the coverage complete.


Normally once we move list entries from a global list to a private
one they are no longer visible to others, however in this case
they could still be accessed via setattr() path.

>
> Presumably it's too expensive to extend sbinfo->shrinklist_lock
> coverage in shmem_unused_huge_shrink() (or is it?  - this is huge
> pages).  An alternative would be to add a new
> shmem_inode_info.shrinklist_lock whose mandate is to protect
> shmem_inode_info.shrinklist.

Both find_lock_page() and iput() could sleep, I think this is why
we have to defer these two calls after releasing spinlock.

>
>> Honestly, I don't love this situation, or the patch, but I think the
>> patch is likely the right thing to do.
>
> Well, we could view the premature droppage of sbinfo->shrinklist_lock
> in shmem_unused_huge_shrink() to be a performance optimization and put
> some big fat comments in there explaining what's going on.  But it's
> tricky and it's not known that such an optimization is warranted.

It is not for performance optimization, because we still traverse
the list with the spinlock held. A typical optimization is to use
a list_splice() with spinlock and traverse it without it, this is
used by a few places in networking subsystem.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
  2017-08-03 23:25     ` Linus Torvalds
@ 2017-08-04 18:06       ` Cong Wang
  -1 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2017-08-04 18:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 3, 2017 at 4:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> Where is this INIT_LIST_HEAD()?
>
> I think it's this one:
>
>         list_del_init(&info->shrinklist);
>
> in shmem_unused_huge_shrink().


Yes, this is correct. Sorry about confusion.


>
>> I'm not sure I'm understanding this.  AFAICT all the list operations to
>> which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
>
> No, notice how shmem_unused_huge_shrink() does the
>
>         list_move(&info->shrinklist, &to_remove);
>
> and
>
>         list_move(&info->shrinklist, &list);
>
> to move to (two different) private lists under the shrinklist_lock,
> but once it is on that private "list/to_remove" list, it is then
> accessed outside the locked region.
>
> Honestly, I don't love this situation, or the patch, but I think the
> patch is likely the right thing to do.
>

Me neither. This is probably the quickest fix we could have,
other possible changes might need much more work.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix list corruptions on shmem shrinklist
@ 2017-08-04 18:06       ` Cong Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2017-08-04 18:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, # .39.x,
	Hugh Dickins, Kirill A . Shutemov

On Thu, Aug 3, 2017 at 4:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Aug 3, 2017 at 4:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> Where is this INIT_LIST_HEAD()?
>
> I think it's this one:
>
>         list_del_init(&info->shrinklist);
>
> in shmem_unused_huge_shrink().


Yes, this is correct. Sorry about confusion.


>
>> I'm not sure I'm understanding this.  AFAICT all the list operations to
>> which you refer are synchronized under spin_lock(&sbinfo->shrinklist_lock)?
>
> No, notice how shmem_unused_huge_shrink() does the
>
>         list_move(&info->shrinklist, &to_remove);
>
> and
>
>         list_move(&info->shrinklist, &list);
>
> to move to (two different) private lists under the shrinklist_lock,
> but once it is on that private "list/to_remove" list, it is then
> accessed outside the locked region.
>
> Honestly, I don't love this situation, or the patch, but I think the
> patch is likely the right thing to do.
>

Me neither. This is probably the quickest fix we could have,
other possible changes might need much more work.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-08-04 18:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-03  5:46 [PATCH] mm: fix list corruptions on shmem shrinklist Cong Wang
2017-08-03  5:46 ` Cong Wang
2017-08-03 23:11 ` Andrew Morton
2017-08-03 23:11   ` Andrew Morton
2017-08-03 23:25   ` Linus Torvalds
2017-08-03 23:25     ` Linus Torvalds
2017-08-03 23:53     ` Andrew Morton
2017-08-03 23:53       ` Andrew Morton
2017-08-04 14:42       ` Kirill A. Shutemov
2017-08-04 14:42         ` Kirill A. Shutemov
2017-08-04 17:58       ` Cong Wang
2017-08-04 17:58         ` Cong Wang
2017-08-04 18:06     ` Cong Wang
2017-08-04 18:06       ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.