All of lore.kernel.org
 help / color / mirror / Atom feed
* mm: shm: hang in shmem_fallocate
@ 2013-12-16  4:01 ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2013-12-16  4:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, LKML

Hi all,

While fuzzing with trinity inside a KVM tools guest running latest -next, I've noticed that
quite often there's a hang happening inside shmem_fallocate. There are several processes stuck
trying to acquire inode->i_mutex (for more than 2 minutes), while the process that holds it has
the following stack trace:

[ 2059.561282] Call Trace:
[ 2059.561557]  [<ffffffff81175588>] ? sched_clock_cpu+0x108/0x120
[ 2059.562444]  [<ffffffff8118e1fa>] ? get_lock_stats+0x2a/0x60
[ 2059.563247]  [<ffffffff8118e23e>] ? put_lock_stats+0xe/0x30
[ 2059.563930]  [<ffffffff8118e1fa>] ? get_lock_stats+0x2a/0x60
[ 2059.564646]  [<ffffffff810adc13>] ? x2apic_send_IPI_mask+0x13/0x20
[ 2059.565431]  [<ffffffff811b3224>] ? __rcu_read_unlock+0x44/0xb0
[ 2059.566161]  [<ffffffff811d48d5>] ? generic_exec_single+0x55/0x80
[ 2059.566992]  [<ffffffff8128bd45>] ? page_remove_rmap+0x295/0x320
[ 2059.567782]  [<ffffffff843afe8c>] ? _raw_spin_lock+0x6c/0x80
[ 2059.568390]  [<ffffffff8127c6cc>] ? zap_pte_range+0xec/0x590
[ 2059.569157]  [<ffffffff8127c8d0>] ? zap_pte_range+0x2f0/0x590
[ 2059.569907]  [<ffffffff810c6560>] ? flush_tlb_mm_range+0x360/0x360
[ 2059.570855]  [<ffffffff843aa863>] ? preempt_schedule+0x53/0x80
[ 2059.571613]  [<ffffffff81077086>] ? ___preempt_schedule+0x56/0xb0
[ 2059.572526]  [<ffffffff810c6536>] ? flush_tlb_mm_range+0x336/0x360
[ 2059.573368]  [<ffffffff8127a6eb>] ? tlb_flush_mmu+0x3b/0x90
[ 2059.574152]  [<ffffffff8127a754>] ? tlb_finish_mmu+0x14/0x40
[ 2059.574951]  [<ffffffff8127d276>] ? zap_page_range_single+0x146/0x160
[ 2059.575797]  [<ffffffff81193768>] ? trace_hardirqs_on+0x8/0x10
[ 2059.576629]  [<ffffffff8127d303>] ? unmap_mapping_range+0x73/0x180
[ 2059.577362]  [<ffffffff8127d38e>] ? unmap_mapping_range+0xfe/0x180
[ 2059.578194]  [<ffffffff8125eeb7>] ? truncate_inode_page+0x37/0x90
[ 2059.579013]  [<ffffffff8126bc61>] ? shmem_undo_range+0x711/0x830
[ 2059.579807]  [<ffffffff8127d3f8>] ? unmap_mapping_range+0x168/0x180
[ 2059.580729]  [<ffffffff8126bd98>] ? shmem_truncate_range+0x18/0x40
[ 2059.581598]  [<ffffffff8126c0a9>] ? shmem_fallocate+0x99/0x2f0
[ 2059.582325]  [<ffffffff81278eae>] ? madvise_vma+0xde/0x1c0
[ 2059.583049]  [<ffffffff8119555a>] ? __lock_release+0x1da/0x1f0
[ 2059.583816]  [<ffffffff812d0cb6>] ? do_fallocate+0x126/0x170
[ 2059.584581]  [<ffffffff81278ec4>] ? madvise_vma+0xf4/0x1c0
[ 2059.585302]  [<ffffffff81279118>] ? SyS_madvise+0x188/0x250
[ 2059.586012]  [<ffffffff843ba5d0>] ? tracesys+0xdd/0xe2
[ 2059.586689]  ffff880f39bc3db8 0000000000000002 ffff880fce4b0000 ffff880fce4b0000
[ 2059.587768]  ffff880f39bc2010 00000000001d78c0 00000000001d78c0 00000000001d78c0
[ 2059.588840]  ffff880fce6a0000 ffff880fce4b0000 ffff880fe5bd6d40 ffff880fa88e8ab0


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* mm: shm: hang in shmem_fallocate
@ 2013-12-16  4:01 ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2013-12-16  4:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, LKML

Hi all,

While fuzzing with trinity inside a KVM tools guest running latest -next, I've noticed that
quite often there's a hang happening inside shmem_fallocate. There are several processes stuck
trying to acquire inode->i_mutex (for more than 2 minutes), while the process that holds it has
the following stack trace:

[ 2059.561282] Call Trace:
[ 2059.561557]  [<ffffffff81175588>] ? sched_clock_cpu+0x108/0x120
[ 2059.562444]  [<ffffffff8118e1fa>] ? get_lock_stats+0x2a/0x60
[ 2059.563247]  [<ffffffff8118e23e>] ? put_lock_stats+0xe/0x30
[ 2059.563930]  [<ffffffff8118e1fa>] ? get_lock_stats+0x2a/0x60
[ 2059.564646]  [<ffffffff810adc13>] ? x2apic_send_IPI_mask+0x13/0x20
[ 2059.565431]  [<ffffffff811b3224>] ? __rcu_read_unlock+0x44/0xb0
[ 2059.566161]  [<ffffffff811d48d5>] ? generic_exec_single+0x55/0x80
[ 2059.566992]  [<ffffffff8128bd45>] ? page_remove_rmap+0x295/0x320
[ 2059.567782]  [<ffffffff843afe8c>] ? _raw_spin_lock+0x6c/0x80
[ 2059.568390]  [<ffffffff8127c6cc>] ? zap_pte_range+0xec/0x590
[ 2059.569157]  [<ffffffff8127c8d0>] ? zap_pte_range+0x2f0/0x590
[ 2059.569907]  [<ffffffff810c6560>] ? flush_tlb_mm_range+0x360/0x360
[ 2059.570855]  [<ffffffff843aa863>] ? preempt_schedule+0x53/0x80
[ 2059.571613]  [<ffffffff81077086>] ? ___preempt_schedule+0x56/0xb0
[ 2059.572526]  [<ffffffff810c6536>] ? flush_tlb_mm_range+0x336/0x360
[ 2059.573368]  [<ffffffff8127a6eb>] ? tlb_flush_mmu+0x3b/0x90
[ 2059.574152]  [<ffffffff8127a754>] ? tlb_finish_mmu+0x14/0x40
[ 2059.574951]  [<ffffffff8127d276>] ? zap_page_range_single+0x146/0x160
[ 2059.575797]  [<ffffffff81193768>] ? trace_hardirqs_on+0x8/0x10
[ 2059.576629]  [<ffffffff8127d303>] ? unmap_mapping_range+0x73/0x180
[ 2059.577362]  [<ffffffff8127d38e>] ? unmap_mapping_range+0xfe/0x180
[ 2059.578194]  [<ffffffff8125eeb7>] ? truncate_inode_page+0x37/0x90
[ 2059.579013]  [<ffffffff8126bc61>] ? shmem_undo_range+0x711/0x830
[ 2059.579807]  [<ffffffff8127d3f8>] ? unmap_mapping_range+0x168/0x180
[ 2059.580729]  [<ffffffff8126bd98>] ? shmem_truncate_range+0x18/0x40
[ 2059.581598]  [<ffffffff8126c0a9>] ? shmem_fallocate+0x99/0x2f0
[ 2059.582325]  [<ffffffff81278eae>] ? madvise_vma+0xde/0x1c0
[ 2059.583049]  [<ffffffff8119555a>] ? __lock_release+0x1da/0x1f0
[ 2059.583816]  [<ffffffff812d0cb6>] ? do_fallocate+0x126/0x170
[ 2059.584581]  [<ffffffff81278ec4>] ? madvise_vma+0xf4/0x1c0
[ 2059.585302]  [<ffffffff81279118>] ? SyS_madvise+0x188/0x250
[ 2059.586012]  [<ffffffff843ba5d0>] ? tracesys+0xdd/0xe2
[ 2059.586689]  ffff880f39bc3db8 0000000000000002 ffff880fce4b0000 ffff880fce4b0000
[ 2059.587768]  ffff880f39bc2010 00000000001d78c0 00000000001d78c0 00000000001d78c0
[ 2059.588840]  ffff880fce6a0000 ffff880fce4b0000 ffff880fe5bd6d40 ffff880fa88e8ab0


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2013-12-16  4:01 ` Sasha Levin
@ 2014-02-08 19:46   ` Sasha Levin
  -1 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-02-08 19:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, LKML

On 12/15/2013 11:01 PM, Sasha Levin wrote:
> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running latest -next, I've noticed that
> quite often there's a hang happening inside shmem_fallocate. There are several processes stuck
> trying to acquire inode->i_mutex (for more than 2 minutes), while the process that holds it has
> the following stack trace:

[snip]

This still happens. For the record, here's a better trace:

[  507.124903] CPU: 60 PID: 10864 Comm: trinity-c173 Tainted: G        W 
3.14.0-rc1-next-20140207-sasha-00007-g03959f6-dirty #2
[  507.124903] task: ffff8801f1e38000 ti: ffff8801f1e40000 task.ti: ffff8801f1e40000
[  507.124903] RIP: 0010:[<ffffffff81ae924f>]  [<ffffffff81ae924f>] __delay+0xf/0x20
[  507.124903] RSP: 0000:ffff8801f1e418a8  EFLAGS: 00000202
[  507.124903] RAX: 0000000000000001 RBX: ffff880524cf9f40 RCX: 00000000e9adc2c3
[  507.124903] RDX: 000000000000010f RSI: ffffffff8129813c RDI: 00000000ffffffff
[  507.124903] RBP: ffff8801f1e418a8 R08: 0000000000000000 R09: 0000000000000000
[  507.124903] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000affe0
[  507.124903] R13: 0000000086c42710 R14: ffff8801f1e41998 R15: ffff8801f1e41ac8
[  507.124903] FS:  00007ff708073700(0000) GS:ffff88052b400000(0000) knlGS:0000000000000000
[  507.124903] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  507.124903] CR2: 000000000089d010 CR3: 00000001f1e2c000 CR4: 00000000000006e0
[  507.124903] DR0: 0000000000696000 DR1: 0000000000000000 DR2: 0000000000000000
[  507.124903] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  507.124903] Stack:
[  507.124903]  ffff8801f1e418d8 ffffffff811af053 ffff880524cf9f40 ffff880524cf9f40
[  507.124903]  ffff880524cf9f58 ffff8807275b1000 ffff8801f1e41908 ffffffff84447580
[  507.124903]  ffffffff8129813c ffffffff811ea882 00003ffffffff000 00007ff705eb2000
[  507.124903] Call Trace:
[  507.124903]  [<ffffffff811af053>] do_raw_spin_lock+0xe3/0x170
[  507.124903]  [<ffffffff84447580>] _raw_spin_lock+0x60/0x80
[  507.124903]  [<ffffffff8129813c>] ? zap_pte_range+0xec/0x580
[  507.124903]  [<ffffffff811ea882>] ? smp_call_function_single+0x242/0x270
[  507.124903]  [<ffffffff8129813c>] zap_pte_range+0xec/0x580
[  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
[  507.124903]  [<ffffffff81adbd67>] ? cpumask_next_and+0xa7/0xd0
[  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
[  507.124903]  [<ffffffff812989ce>] unmap_page_range+0x3fe/0x410
[  507.124903]  [<ffffffff81298ae1>] unmap_single_vma+0x101/0x120
[  507.124903]  [<ffffffff81298cb9>] zap_page_range_single+0x119/0x160
[  507.124903]  [<ffffffff811a87b8>] ? trace_hardirqs_on+0x8/0x10
[  507.124903]  [<ffffffff812ddb8a>] ? memcg_check_events+0x7a/0x170
[  507.124903]  [<ffffffff81298d73>] ? unmap_mapping_range+0x73/0x180
[  507.124903]  [<ffffffff81298dfe>] unmap_mapping_range+0xfe/0x180
[  507.124903]  [<ffffffff812790c7>] truncate_inode_page+0x37/0x90
[  507.124903]  [<ffffffff812861aa>] shmem_undo_range+0x6aa/0x770
[  507.124903]  [<ffffffff81298e68>] ? unmap_mapping_range+0x168/0x180
[  507.124903]  [<ffffffff81286288>] shmem_truncate_range+0x18/0x40
[  507.124903]  [<ffffffff81286599>] shmem_fallocate+0x99/0x2f0
[  507.124903]  [<ffffffff8129487e>] ? madvise_vma+0xde/0x1c0
[  507.124903]  [<ffffffff811aa5d2>] ? __lock_release+0x1e2/0x200
[  507.124903]  [<ffffffff812ee006>] do_fallocate+0x126/0x170
[  507.124903]  [<ffffffff81294894>] madvise_vma+0xf4/0x1c0
[  507.124903]  [<ffffffff81294ae8>] SyS_madvise+0x188/0x250
[  507.124903]  [<ffffffff84452450>] tracesys+0xdd/0xe2
[  507.124903] Code: 66 66 66 66 90 48 c7 05 a4 66 04 05 e0 92 ae 81 c9 c3 66 2e 0f 1f 84 00 00 00 
00 00 55 48 89 e5 66 66 66 66 90 ff 15 89 66 04 05 <c9> c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 
00 55 48 8d 04

I'm still trying to figure it out. To me it seems like a series of calls to shmem_truncate_range() 
takes so long that one of the tasks triggers a hung task. We don't actually hang in any specific
shmem_truncate_range() for too long though.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-02-08 19:46   ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-02-08 19:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, LKML

On 12/15/2013 11:01 PM, Sasha Levin wrote:
> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running latest -next, I've noticed that
> quite often there's a hang happening inside shmem_fallocate. There are several processes stuck
> trying to acquire inode->i_mutex (for more than 2 minutes), while the process that holds it has
> the following stack trace:

[snip]

This still happens. For the record, here's a better trace:

[  507.124903] CPU: 60 PID: 10864 Comm: trinity-c173 Tainted: G        W 
3.14.0-rc1-next-20140207-sasha-00007-g03959f6-dirty #2
[  507.124903] task: ffff8801f1e38000 ti: ffff8801f1e40000 task.ti: ffff8801f1e40000
[  507.124903] RIP: 0010:[<ffffffff81ae924f>]  [<ffffffff81ae924f>] __delay+0xf/0x20
[  507.124903] RSP: 0000:ffff8801f1e418a8  EFLAGS: 00000202
[  507.124903] RAX: 0000000000000001 RBX: ffff880524cf9f40 RCX: 00000000e9adc2c3
[  507.124903] RDX: 000000000000010f RSI: ffffffff8129813c RDI: 00000000ffffffff
[  507.124903] RBP: ffff8801f1e418a8 R08: 0000000000000000 R09: 0000000000000000
[  507.124903] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000affe0
[  507.124903] R13: 0000000086c42710 R14: ffff8801f1e41998 R15: ffff8801f1e41ac8
[  507.124903] FS:  00007ff708073700(0000) GS:ffff88052b400000(0000) knlGS:0000000000000000
[  507.124903] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  507.124903] CR2: 000000000089d010 CR3: 00000001f1e2c000 CR4: 00000000000006e0
[  507.124903] DR0: 0000000000696000 DR1: 0000000000000000 DR2: 0000000000000000
[  507.124903] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  507.124903] Stack:
[  507.124903]  ffff8801f1e418d8 ffffffff811af053 ffff880524cf9f40 ffff880524cf9f40
[  507.124903]  ffff880524cf9f58 ffff8807275b1000 ffff8801f1e41908 ffffffff84447580
[  507.124903]  ffffffff8129813c ffffffff811ea882 00003ffffffff000 00007ff705eb2000
[  507.124903] Call Trace:
[  507.124903]  [<ffffffff811af053>] do_raw_spin_lock+0xe3/0x170
[  507.124903]  [<ffffffff84447580>] _raw_spin_lock+0x60/0x80
[  507.124903]  [<ffffffff8129813c>] ? zap_pte_range+0xec/0x580
[  507.124903]  [<ffffffff811ea882>] ? smp_call_function_single+0x242/0x270
[  507.124903]  [<ffffffff8129813c>] zap_pte_range+0xec/0x580
[  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
[  507.124903]  [<ffffffff81adbd67>] ? cpumask_next_and+0xa7/0xd0
[  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
[  507.124903]  [<ffffffff812989ce>] unmap_page_range+0x3fe/0x410
[  507.124903]  [<ffffffff81298ae1>] unmap_single_vma+0x101/0x120
[  507.124903]  [<ffffffff81298cb9>] zap_page_range_single+0x119/0x160
[  507.124903]  [<ffffffff811a87b8>] ? trace_hardirqs_on+0x8/0x10
[  507.124903]  [<ffffffff812ddb8a>] ? memcg_check_events+0x7a/0x170
[  507.124903]  [<ffffffff81298d73>] ? unmap_mapping_range+0x73/0x180
[  507.124903]  [<ffffffff81298dfe>] unmap_mapping_range+0xfe/0x180
[  507.124903]  [<ffffffff812790c7>] truncate_inode_page+0x37/0x90
[  507.124903]  [<ffffffff812861aa>] shmem_undo_range+0x6aa/0x770
[  507.124903]  [<ffffffff81298e68>] ? unmap_mapping_range+0x168/0x180
[  507.124903]  [<ffffffff81286288>] shmem_truncate_range+0x18/0x40
[  507.124903]  [<ffffffff81286599>] shmem_fallocate+0x99/0x2f0
[  507.124903]  [<ffffffff8129487e>] ? madvise_vma+0xde/0x1c0
[  507.124903]  [<ffffffff811aa5d2>] ? __lock_release+0x1e2/0x200
[  507.124903]  [<ffffffff812ee006>] do_fallocate+0x126/0x170
[  507.124903]  [<ffffffff81294894>] madvise_vma+0xf4/0x1c0
[  507.124903]  [<ffffffff81294ae8>] SyS_madvise+0x188/0x250
[  507.124903]  [<ffffffff84452450>] tracesys+0xdd/0xe2
[  507.124903] Code: 66 66 66 66 90 48 c7 05 a4 66 04 05 e0 92 ae 81 c9 c3 66 2e 0f 1f 84 00 00 00 
00 00 55 48 89 e5 66 66 66 66 90 ff 15 89 66 04 05 <c9> c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 
00 55 48 8d 04

I'm still trying to figure it out. To me it seems like a series of calls to shmem_truncate_range() 
takes so long that one of the tasks triggers a hung task. We don't actually hang in any specific
shmem_truncate_range() for too long though.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-02-08 19:46   ` Sasha Levin
@ 2014-02-09  3:25     ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-02-09  3:25 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Hugh Dickins, Andrew Morton, linux-mm, linux-fsdevel, LKML

On Sat, 8 Feb 2014, Sasha Levin wrote:
> On 12/15/2013 11:01 PM, Sasha Levin wrote:
> > Hi all,
> > 
> > While fuzzing with trinity inside a KVM tools guest running latest -next,
> > I've noticed that
> > quite often there's a hang happening inside shmem_fallocate. There are
> > several processes stuck
> > trying to acquire inode->i_mutex (for more than 2 minutes), while the
> > process that holds it has
> > the following stack trace:
> 
> [snip]
> 
> This still happens. For the record, here's a better trace:

Thanks for the reminder, and for the better trace: I don't find those
traces where _every_ line is a "? " very useful (and whenever I puzzle
over one of those, I wonder if it's inevitable, or something we got
just slightly wrong in working out the frames... another time).

> 
> [  507.124903] CPU: 60 PID: 10864 Comm: trinity-c173 Tainted: G        W
> 3.14.0-rc1-next-20140207-sasha-00007-g03959f6-dirty #2
> [  507.124903] task: ffff8801f1e38000 ti: ffff8801f1e40000 task.ti:
> ffff8801f1e40000
> [  507.124903] RIP: 0010:[<ffffffff81ae924f>]  [<ffffffff81ae924f>]
> __delay+0xf/0x20
> [  507.124903] RSP: 0000:ffff8801f1e418a8  EFLAGS: 00000202
> [  507.124903] RAX: 0000000000000001 RBX: ffff880524cf9f40 RCX:
> 00000000e9adc2c3
> [  507.124903] RDX: 000000000000010f RSI: ffffffff8129813c RDI:
> 00000000ffffffff
> [  507.124903] RBP: ffff8801f1e418a8 R08: 0000000000000000 R09:
> 0000000000000000
> [  507.124903] R10: 0000000000000001 R11: 0000000000000000 R12:
> 00000000000affe0
> [  507.124903] R13: 0000000086c42710 R14: ffff8801f1e41998 R15:
> ffff8801f1e41ac8
> [  507.124903] FS:  00007ff708073700(0000) GS:ffff88052b400000(0000)
> knlGS:0000000000000000
> [  507.124903] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  507.124903] CR2: 000000000089d010 CR3: 00000001f1e2c000 CR4:
> 00000000000006e0
> [  507.124903] DR0: 0000000000696000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  507.124903] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000600
> [  507.124903] Stack:
> [  507.124903]  ffff8801f1e418d8 ffffffff811af053 ffff880524cf9f40
> ffff880524cf9f40
> [  507.124903]  ffff880524cf9f58 ffff8807275b1000 ffff8801f1e41908
> ffffffff84447580
> [  507.124903]  ffffffff8129813c ffffffff811ea882 00003ffffffff000
> 00007ff705eb2000
> [  507.124903] Call Trace:
> [  507.124903]  [<ffffffff811af053>] do_raw_spin_lock+0xe3/0x170
> [  507.124903]  [<ffffffff84447580>] _raw_spin_lock+0x60/0x80
> [  507.124903]  [<ffffffff8129813c>] ? zap_pte_range+0xec/0x580
> [  507.124903]  [<ffffffff811ea882>] ? smp_call_function_single+0x242/0x270
> [  507.124903]  [<ffffffff8129813c>] zap_pte_range+0xec/0x580
> [  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
> [  507.124903]  [<ffffffff81adbd67>] ? cpumask_next_and+0xa7/0xd0
> [  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
> [  507.124903]  [<ffffffff812989ce>] unmap_page_range+0x3fe/0x410
> [  507.124903]  [<ffffffff81298ae1>] unmap_single_vma+0x101/0x120
> [  507.124903]  [<ffffffff81298cb9>] zap_page_range_single+0x119/0x160
> [  507.124903]  [<ffffffff811a87b8>] ? trace_hardirqs_on+0x8/0x10
> [  507.124903]  [<ffffffff812ddb8a>] ? memcg_check_events+0x7a/0x170
> [  507.124903]  [<ffffffff81298d73>] ? unmap_mapping_range+0x73/0x180
> [  507.124903]  [<ffffffff81298dfe>] unmap_mapping_range+0xfe/0x180
> [  507.124903]  [<ffffffff812790c7>] truncate_inode_page+0x37/0x90
> [  507.124903]  [<ffffffff812861aa>] shmem_undo_range+0x6aa/0x770
> [  507.124903]  [<ffffffff81298e68>] ? unmap_mapping_range+0x168/0x180
> [  507.124903]  [<ffffffff81286288>] shmem_truncate_range+0x18/0x40
> [  507.124903]  [<ffffffff81286599>] shmem_fallocate+0x99/0x2f0
> [  507.124903]  [<ffffffff8129487e>] ? madvise_vma+0xde/0x1c0
> [  507.124903]  [<ffffffff811aa5d2>] ? __lock_release+0x1e2/0x200
> [  507.124903]  [<ffffffff812ee006>] do_fallocate+0x126/0x170
> [  507.124903]  [<ffffffff81294894>] madvise_vma+0xf4/0x1c0
> [  507.124903]  [<ffffffff81294ae8>] SyS_madvise+0x188/0x250
> [  507.124903]  [<ffffffff84452450>] tracesys+0xdd/0xe2
> [  507.124903] Code: 66 66 66 66 90 48 c7 05 a4 66 04 05 e0 92 ae 81 c9 c3 66
> 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 ff 15 89 66 04 05 <c9>
> c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 8d 04
> 
> I'm still trying to figure it out. To me it seems like a series of calls to
> shmem_truncate_range() takes so long that one of the tasks triggers a hung
> task. We don't actually hang in any specific
> shmem_truncate_range() for too long though.

Okay, we're doing a FALLOC_FL_PUNCH_HOLE on tmpfs (via MADV_REMOVE).

This trace shows clearly that unmap_mapping_range is being called from
truncate_inode_range: that's supposed to be a rare inefficient fallback,
normally all the mapped pages being unmapped by the prior call to
unmap_mapping_range in shmem_fallocate itself.

Now it's conceivable that there's some kind of off-by-one wrap-around
case which doesn't behave as intended, but I was fairly careful there:
you have to be because the different functions involved have different
calling conventions and needs.  It would be interesting to know the
arguments to madvise() and to shmem_fallocate() to rule that out,
but my guess is that's not the problem.

Would trinity be likely to have a thread or process repeatedly faulting
in pages from the hole while it is being punched?

That's what it looks like to me, but I'm not sure what to do about it,
if anything.  It's a pity that shmem_fallocate is holding i_mutex over
this holepunch that never completes, and that locks out others wanting
i_mutex; but whether it's a serious matter in the scale of denials of
service, I'm not so sure.

Note that straight truncation does not suffer from the same problem:
because there the faulters get a SIGBUS when they try to access
beyond end of file, and need the i_mutex to re-extend the file.

Does this happen with other holepunch filesystems?  If it does not,
I'd suppose it's because the tmpfs fault-in-newly-created-page path
is lighter than a consistent disk-based filesystem's has to be.
But we don't want to make the tmpfs path heavier to match them.

My old commit, d0823576bf4b "mm: pincer in truncate_inode_pages_range",
subsequently copied into shmem_undo_range, can be blamed.  It seemed a
nice idea at the time, to guarantee an instant during the holepunch when
the entire hole is empty, whatever userspace does afterwards; but perhaps
we should revert to sweeping out the pages without looking back.

I don't want to make that change (and I don't want to make it in
shmem_undo_range without doing the same in truncate_inode_pages_range),
but it might be the right thing to do: linux-fsdevel Cc'ed for views.

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-02-09  3:25     ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-02-09  3:25 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Hugh Dickins, Andrew Morton, linux-mm, linux-fsdevel, LKML

On Sat, 8 Feb 2014, Sasha Levin wrote:
> On 12/15/2013 11:01 PM, Sasha Levin wrote:
> > Hi all,
> > 
> > While fuzzing with trinity inside a KVM tools guest running latest -next,
> > I've noticed that
> > quite often there's a hang happening inside shmem_fallocate. There are
> > several processes stuck
> > trying to acquire inode->i_mutex (for more than 2 minutes), while the
> > process that holds it has
> > the following stack trace:
> 
> [snip]
> 
> This still happens. For the record, here's a better trace:

Thanks for the reminder, and for the better trace: I don't find those
traces where _every_ line is a "? " very useful (and whenever I puzzle
over one of those, I wonder if it's inevitable, or something we got
just slightly wrong in working out the frames... another time).

> 
> [  507.124903] CPU: 60 PID: 10864 Comm: trinity-c173 Tainted: G        W
> 3.14.0-rc1-next-20140207-sasha-00007-g03959f6-dirty #2
> [  507.124903] task: ffff8801f1e38000 ti: ffff8801f1e40000 task.ti:
> ffff8801f1e40000
> [  507.124903] RIP: 0010:[<ffffffff81ae924f>]  [<ffffffff81ae924f>]
> __delay+0xf/0x20
> [  507.124903] RSP: 0000:ffff8801f1e418a8  EFLAGS: 00000202
> [  507.124903] RAX: 0000000000000001 RBX: ffff880524cf9f40 RCX:
> 00000000e9adc2c3
> [  507.124903] RDX: 000000000000010f RSI: ffffffff8129813c RDI:
> 00000000ffffffff
> [  507.124903] RBP: ffff8801f1e418a8 R08: 0000000000000000 R09:
> 0000000000000000
> [  507.124903] R10: 0000000000000001 R11: 0000000000000000 R12:
> 00000000000affe0
> [  507.124903] R13: 0000000086c42710 R14: ffff8801f1e41998 R15:
> ffff8801f1e41ac8
> [  507.124903] FS:  00007ff708073700(0000) GS:ffff88052b400000(0000)
> knlGS:0000000000000000
> [  507.124903] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  507.124903] CR2: 000000000089d010 CR3: 00000001f1e2c000 CR4:
> 00000000000006e0
> [  507.124903] DR0: 0000000000696000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  507.124903] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000600
> [  507.124903] Stack:
> [  507.124903]  ffff8801f1e418d8 ffffffff811af053 ffff880524cf9f40
> ffff880524cf9f40
> [  507.124903]  ffff880524cf9f58 ffff8807275b1000 ffff8801f1e41908
> ffffffff84447580
> [  507.124903]  ffffffff8129813c ffffffff811ea882 00003ffffffff000
> 00007ff705eb2000
> [  507.124903] Call Trace:
> [  507.124903]  [<ffffffff811af053>] do_raw_spin_lock+0xe3/0x170
> [  507.124903]  [<ffffffff84447580>] _raw_spin_lock+0x60/0x80
> [  507.124903]  [<ffffffff8129813c>] ? zap_pte_range+0xec/0x580
> [  507.124903]  [<ffffffff811ea882>] ? smp_call_function_single+0x242/0x270
> [  507.124903]  [<ffffffff8129813c>] zap_pte_range+0xec/0x580
> [  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
> [  507.124903]  [<ffffffff81adbd67>] ? cpumask_next_and+0xa7/0xd0
> [  507.124903]  [<ffffffff810ca710>] ? flush_tlb_mm_range+0x280/0x280
> [  507.124903]  [<ffffffff812989ce>] unmap_page_range+0x3fe/0x410
> [  507.124903]  [<ffffffff81298ae1>] unmap_single_vma+0x101/0x120
> [  507.124903]  [<ffffffff81298cb9>] zap_page_range_single+0x119/0x160
> [  507.124903]  [<ffffffff811a87b8>] ? trace_hardirqs_on+0x8/0x10
> [  507.124903]  [<ffffffff812ddb8a>] ? memcg_check_events+0x7a/0x170
> [  507.124903]  [<ffffffff81298d73>] ? unmap_mapping_range+0x73/0x180
> [  507.124903]  [<ffffffff81298dfe>] unmap_mapping_range+0xfe/0x180
> [  507.124903]  [<ffffffff812790c7>] truncate_inode_page+0x37/0x90
> [  507.124903]  [<ffffffff812861aa>] shmem_undo_range+0x6aa/0x770
> [  507.124903]  [<ffffffff81298e68>] ? unmap_mapping_range+0x168/0x180
> [  507.124903]  [<ffffffff81286288>] shmem_truncate_range+0x18/0x40
> [  507.124903]  [<ffffffff81286599>] shmem_fallocate+0x99/0x2f0
> [  507.124903]  [<ffffffff8129487e>] ? madvise_vma+0xde/0x1c0
> [  507.124903]  [<ffffffff811aa5d2>] ? __lock_release+0x1e2/0x200
> [  507.124903]  [<ffffffff812ee006>] do_fallocate+0x126/0x170
> [  507.124903]  [<ffffffff81294894>] madvise_vma+0xf4/0x1c0
> [  507.124903]  [<ffffffff81294ae8>] SyS_madvise+0x188/0x250
> [  507.124903]  [<ffffffff84452450>] tracesys+0xdd/0xe2
> [  507.124903] Code: 66 66 66 66 90 48 c7 05 a4 66 04 05 e0 92 ae 81 c9 c3 66
> 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 ff 15 89 66 04 05 <c9>
> c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 8d 04
> 
> I'm still trying to figure it out. To me it seems like a series of calls to
> shmem_truncate_range() takes so long that one of the tasks triggers a hung
> task. We don't actually hang in any specific
> shmem_truncate_range() for too long though.

Okay, we're doing a FALLOC_FL_PUNCH_HOLE on tmpfs (via MADV_REMOVE).

This trace shows clearly that unmap_mapping_range is being called from
truncate_inode_range: that's supposed to be a rare inefficient fallback,
normally all the mapped pages being unmapped by the prior call to
unmap_mapping_range in shmem_fallocate itself.

Now it's conceivable that there's some kind of off-by-one wrap-around
case which doesn't behave as intended, but I was fairly careful there:
you have to be because the different functions involved have different
calling conventions and needs.  It would be interesting to know the
arguments to madvise() and to shmem_fallocate() to rule that out,
but my guess is that's not the problem.

Would trinity be likely to have a thread or process repeatedly faulting
in pages from the hole while it is being punched?

That's what it looks like to me, but I'm not sure what to do about it,
if anything.  It's a pity that shmem_fallocate is holding i_mutex over
this holepunch that never completes, and that locks out others wanting
i_mutex; but whether it's a serious matter in the scale of denials of
service, I'm not so sure.

Note that straight truncation does not suffer from the same problem:
because there the faulters get a SIGBUS when they try to access
beyond end of file, and need the i_mutex to re-extend the file.

Does this happen with other holepunch filesystems?  If it does not,
I'd suppose it's because the tmpfs fault-in-newly-created-page path
is lighter than a consistent disk-based filesystem's has to be.
But we don't want to make the tmpfs path heavier to match them.

My old commit, d0823576bf4b "mm: pincer in truncate_inode_pages_range",
subsequently copied into shmem_undo_range, can be blamed.  It seemed a
nice idea at the time, to guarantee an instant during the holepunch when
the entire hole is empty, whatever userspace does afterwards; but perhaps
we should revert to sweeping out the pages without looking back.

I don't want to make that change (and I don't want to make it in
shmem_undo_range without doing the same in truncate_inode_pages_range),
but it might be the right thing to do: linux-fsdevel Cc'ed for views.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-02-09  3:25     ` Hugh Dickins
@ 2014-02-10  1:41       ` Sasha Levin
  -1 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-02-10  1:41 UTC (permalink / raw)
  To: Hugh Dickins, Dave Jones; +Cc: Andrew Morton, linux-mm, linux-fsdevel, LKML

On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 > Would trinity be likely to have a thread or process repeatedly faulting
 > in pages from the hole while it is being punched?

I can see how trinity would do that, but just to be certain - Cc davej.

On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 > Does this happen with other holepunch filesystems?  If it does not,
 > I'd suppose it's because the tmpfs fault-in-newly-created-page path
 > is lighter than a consistent disk-based filesystem's has to be.
 > But we don't want to make the tmpfs path heavier to match them.

No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
punching in other filesystems and I make sure to get a bunch of those
mounted before starting testing.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-02-10  1:41       ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-02-10  1:41 UTC (permalink / raw)
  To: Hugh Dickins, Dave Jones; +Cc: Andrew Morton, linux-mm, linux-fsdevel, LKML

On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 > Would trinity be likely to have a thread or process repeatedly faulting
 > in pages from the hole while it is being punched?

I can see how trinity would do that, but just to be certain - Cc davej.

On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 > Does this happen with other holepunch filesystems?  If it does not,
 > I'd suppose it's because the tmpfs fault-in-newly-created-page path
 > is lighter than a consistent disk-based filesystem's has to be.
 > But we don't want to make the tmpfs path heavier to match them.

No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
punching in other filesystems and I make sure to get a bunch of those
mounted before starting testing.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-02-10  1:41       ` Sasha Levin
@ 2014-06-12 20:38         ` Sasha Levin
  -1 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-12 20:38 UTC (permalink / raw)
  To: Hugh Dickins, Dave Jones; +Cc: Andrew Morton, linux-mm, linux-fsdevel, LKML

On 02/09/2014 08:41 PM, Sasha Levin wrote:
> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>> Would trinity be likely to have a thread or process repeatedly faulting
>> in pages from the hole while it is being punched?
> 
> I can see how trinity would do that, but just to be certain - Cc davej.
> 
> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>> Does this happen with other holepunch filesystems?  If it does not,
>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>> is lighter than a consistent disk-based filesystem's has to be.
>> But we don't want to make the tmpfs path heavier to match them.
> 
> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> punching in other filesystems and I make sure to get a bunch of those
> mounted before starting testing.

Just pinging this one again. I still see hangs in -next where the hang
location looks same as before:


[ 3602.443529] CPU: 6 PID: 1153 Comm: trinity-c35 Not tainted 3.15.0-next-20140612-sasha-00022-g5e4db85-dirty #645
[ 3602.443529] task: ffff8801b45eb000 ti: ffff8801a0b90000 task.ti: ffff8801a0b90000
[ 3602.443529] RIP: vtime_account_system (include/linux/seqlock.h:229 include/linux/seqlock.h:234 include/linux/seqlock.h:301 kernel/sched/cputime.c:664)
[ 3602.443529] RSP: 0018:ffff8801b4e03ef8  EFLAGS: 00000046
[ 3602.443529] RAX: ffffffffb31a83b8 RBX: ffff8801b45eb000 RCX: 0000000000000001
[ 3602.443529] RDX: ffffffffb31a80bb RSI: ffffffffb7915a75 RDI: 0000000000000082
[ 3602.443529] RBP: ffff8801b4e03f28 R08: 0000000000000001 R09: 0000000000000000
[ 3602.443529] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b45eb968
[ 3602.443529] R13: ffff8801b45eb938 R14: 0000000000000282 R15: ffff8801b45ebda0
[ 3602.443529] FS:  00007f93ac8ec700(0000) GS:ffff8801b4e00000(0000) knlGS:0000000000000000
[ 3602.443529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3602.443529] CR2: 00007f93a8854c9f CR3: 000000018a189000 CR4: 00000000000006a0
[ 3602.443529] Stack:
[ 3602.443529]  ffff8801b45eb000 00000000001d7800 ffffffffb32bd749 ffff8801b45eb000
[ 3602.443529]  00000000001d7800 ffffffffb32bd749 ffff8801b4e03f48 ffffffffb31a83b8
[ 3602.443529]  ffff8801b4e03f48 ffff8801b45eb000 ffff8801b4e03f68 ffffffffb31666a0
[ 3602.443529] Call Trace:
[ 3602.443529]  <IRQ>
[ 3602.443529] vtime_common_account_irq_enter (kernel/sched/cputime.c:430)
[ 3602.443529] irq_enter (include/linux/vtime.h:63 include/linux/vtime.h:115 kernel/softirq.c:334)
[ 3602.443529] scheduler_ipi (kernel/sched/core.c:1589 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/tick.h:168 include/linux/tick.h:199 kernel/sched/core.c:1590)
[ 3602.443529] smp_reschedule_interrupt (arch/x86/kernel/smp.c:266)
[ 3602.443529] reschedule_interrupt (arch/x86/kernel/entry_64.S:1046)
[ 3602.443529]  <EOI>
[ 3602.443529] _raw_spin_unlock (include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:183)
[ 3602.443529] zap_pte_range (mm/memory.c:1218)
[ 3602.443529] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
[ 3602.443529] zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
[ 3602.443529] unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
[ 3602.443529] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
[ 3602.443529] shmem_undo_range (mm/shmem.c:429)
[ 3602.443529] shmem_truncate_range (mm/shmem.c:527)
[ 3602.443529] shmem_fallocate (mm/shmem.c:1740)
[ 3602.443529] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[ 3602.443529] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[ 3602.443529] tracesys (arch/x86/kernel/entry_64.S:542)
[ 3602.443529] Code: 09 00 00 48 89 5d e8 48 89 fb 4c 89 e7 4c 89 6d f8 e8 25 69 3b 03 83 83 30 09 00 00 01 48 8b 45 08 4c 8d ab 38 09 00 00 45 31 c9 <41> b8 01 00 00 00 31 c9 31 d2 31 f6 4c 89 ef 48 89 04 24 e8 e8
All code
========
   0:   09 00                   or     %eax,(%rax)
   2:   00 48 89                add    %cl,-0x77(%rax)
   5:   5d                      pop    %rbp
   6:   e8 48 89 fb 4c          callq  0x4cfb8953
   b:   89 e7                   mov    %esp,%edi
   d:   4c 89 6d f8             mov    %r13,-0x8(%rbp)
  11:   e8 25 69 3b 03          callq  0x33b693b
  16:   83 83 30 09 00 00 01    addl   $0x1,0x930(%rbx)
  1d:   48 8b 45 08             mov    0x8(%rbp),%rax
  21:   4c 8d ab 38 09 00 00    lea    0x938(%rbx),%r13
  28:   45 31 c9                xor    %r9d,%r9d
  2b:*  41 b8 01 00 00 00       mov    $0x1,%r8d                <-- trapping instruction
  31:   31 c9                   xor    %ecx,%ecx
  33:   31 d2                   xor    %edx,%edx
  35:   31 f6                   xor    %esi,%esi
  37:   4c 89 ef                mov    %r13,%rdi
  3a:   48 89 04 24             mov    %rax,(%rsp)
  3e:   e8                      .byte 0xe8
  3f:   e8                      .byte 0xe8
        ...

Code starting with the faulting instruction
===========================================
   0:   41 b8 01 00 00 00       mov    $0x1,%r8d
   6:   31 c9                   xor    %ecx,%ecx
   8:   31 d2                   xor    %edx,%edx
   a:   31 f6                   xor    %esi,%esi
   c:   4c 89 ef                mov    %r13,%rdi
   f:   48 89 04 24             mov    %rax,(%rsp)
  13:   e8                      .byte 0xe8
  14:   e8                      .byte 0xe8


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-12 20:38         ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-12 20:38 UTC (permalink / raw)
  To: Hugh Dickins, Dave Jones; +Cc: Andrew Morton, linux-mm, linux-fsdevel, LKML

On 02/09/2014 08:41 PM, Sasha Levin wrote:
> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>> Would trinity be likely to have a thread or process repeatedly faulting
>> in pages from the hole while it is being punched?
> 
> I can see how trinity would do that, but just to be certain - Cc davej.
> 
> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>> Does this happen with other holepunch filesystems?  If it does not,
>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>> is lighter than a consistent disk-based filesystem's has to be.
>> But we don't want to make the tmpfs path heavier to match them.
> 
> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> punching in other filesystems and I make sure to get a bunch of those
> mounted before starting testing.

Just pinging this one again. I still see hangs in -next where the hang
location looks same as before:


[ 3602.443529] CPU: 6 PID: 1153 Comm: trinity-c35 Not tainted 3.15.0-next-20140612-sasha-00022-g5e4db85-dirty #645
[ 3602.443529] task: ffff8801b45eb000 ti: ffff8801a0b90000 task.ti: ffff8801a0b90000
[ 3602.443529] RIP: vtime_account_system (include/linux/seqlock.h:229 include/linux/seqlock.h:234 include/linux/seqlock.h:301 kernel/sched/cputime.c:664)
[ 3602.443529] RSP: 0018:ffff8801b4e03ef8  EFLAGS: 00000046
[ 3602.443529] RAX: ffffffffb31a83b8 RBX: ffff8801b45eb000 RCX: 0000000000000001
[ 3602.443529] RDX: ffffffffb31a80bb RSI: ffffffffb7915a75 RDI: 0000000000000082
[ 3602.443529] RBP: ffff8801b4e03f28 R08: 0000000000000001 R09: 0000000000000000
[ 3602.443529] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b45eb968
[ 3602.443529] R13: ffff8801b45eb938 R14: 0000000000000282 R15: ffff8801b45ebda0
[ 3602.443529] FS:  00007f93ac8ec700(0000) GS:ffff8801b4e00000(0000) knlGS:0000000000000000
[ 3602.443529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3602.443529] CR2: 00007f93a8854c9f CR3: 000000018a189000 CR4: 00000000000006a0
[ 3602.443529] Stack:
[ 3602.443529]  ffff8801b45eb000 00000000001d7800 ffffffffb32bd749 ffff8801b45eb000
[ 3602.443529]  00000000001d7800 ffffffffb32bd749 ffff8801b4e03f48 ffffffffb31a83b8
[ 3602.443529]  ffff8801b4e03f48 ffff8801b45eb000 ffff8801b4e03f68 ffffffffb31666a0
[ 3602.443529] Call Trace:
[ 3602.443529]  <IRQ>
[ 3602.443529] vtime_common_account_irq_enter (kernel/sched/cputime.c:430)
[ 3602.443529] irq_enter (include/linux/vtime.h:63 include/linux/vtime.h:115 kernel/softirq.c:334)
[ 3602.443529] scheduler_ipi (kernel/sched/core.c:1589 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/tick.h:168 include/linux/tick.h:199 kernel/sched/core.c:1590)
[ 3602.443529] smp_reschedule_interrupt (arch/x86/kernel/smp.c:266)
[ 3602.443529] reschedule_interrupt (arch/x86/kernel/entry_64.S:1046)
[ 3602.443529]  <EOI>
[ 3602.443529] _raw_spin_unlock (include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:183)
[ 3602.443529] zap_pte_range (mm/memory.c:1218)
[ 3602.443529] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
[ 3602.443529] zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
[ 3602.443529] unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
[ 3602.443529] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
[ 3602.443529] shmem_undo_range (mm/shmem.c:429)
[ 3602.443529] shmem_truncate_range (mm/shmem.c:527)
[ 3602.443529] shmem_fallocate (mm/shmem.c:1740)
[ 3602.443529] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[ 3602.443529] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[ 3602.443529] tracesys (arch/x86/kernel/entry_64.S:542)
[ 3602.443529] Code: 09 00 00 48 89 5d e8 48 89 fb 4c 89 e7 4c 89 6d f8 e8 25 69 3b 03 83 83 30 09 00 00 01 48 8b 45 08 4c 8d ab 38 09 00 00 45 31 c9 <41> b8 01 00 00 00 31 c9 31 d2 31 f6 4c 89 ef 48 89 04 24 e8 e8
All code
========
   0:   09 00                   or     %eax,(%rax)
   2:   00 48 89                add    %cl,-0x77(%rax)
   5:   5d                      pop    %rbp
   6:   e8 48 89 fb 4c          callq  0x4cfb8953
   b:   89 e7                   mov    %esp,%edi
   d:   4c 89 6d f8             mov    %r13,-0x8(%rbp)
  11:   e8 25 69 3b 03          callq  0x33b693b
  16:   83 83 30 09 00 00 01    addl   $0x1,0x930(%rbx)
  1d:   48 8b 45 08             mov    0x8(%rbp),%rax
  21:   4c 8d ab 38 09 00 00    lea    0x938(%rbx),%r13
  28:   45 31 c9                xor    %r9d,%r9d
  2b:*  41 b8 01 00 00 00       mov    $0x1,%r8d                <-- trapping instruction
  31:   31 c9                   xor    %ecx,%ecx
  33:   31 d2                   xor    %edx,%edx
  35:   31 f6                   xor    %esi,%esi
  37:   4c 89 ef                mov    %r13,%rdi
  3a:   48 89 04 24             mov    %rax,(%rsp)
  3e:   e8                      .byte 0xe8
  3f:   e8                      .byte 0xe8
        ...

Code starting with the faulting instruction
===========================================
   0:   41 b8 01 00 00 00       mov    $0x1,%r8d
   6:   31 c9                   xor    %ecx,%ecx
   8:   31 d2                   xor    %edx,%edx
   a:   31 f6                   xor    %esi,%esi
   c:   4c 89 ef                mov    %r13,%rdi
   f:   48 89 04 24             mov    %rax,(%rsp)
  13:   e8                      .byte 0xe8
  14:   e8                      .byte 0xe8


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-12 20:38         ` Sasha Levin
@ 2014-06-16  2:29           ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-16  2:29 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel, LKML

On Thu, 12 Jun 2014, Sasha Levin wrote:
> On 02/09/2014 08:41 PM, Sasha Levin wrote:
> > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >> Would trinity be likely to have a thread or process repeatedly faulting
> >> in pages from the hole while it is being punched?
> > 
> > I can see how trinity would do that, but just to be certain - Cc davej.
> > 
> > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >> Does this happen with other holepunch filesystems?  If it does not,
> >> I'd suppose it's because the tmpfs fault-in-newly-created-page path
> >> is lighter than a consistent disk-based filesystem's has to be.
> >> But we don't want to make the tmpfs path heavier to match them.
> > 
> > No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> > punching in other filesystems and I make sure to get a bunch of those
> > mounted before starting testing.
> 
> Just pinging this one again. I still see hangs in -next where the hang
> location looks same as before:
> 

Please give this patch a try.  It fixes what I can reproduce, but given
your unexplained page_mapped() BUG in this area, we know there's more
yet to be understood, so perhaps this patch won't do enough for you.


[PATCH] shmem: fix faulting into a hole while it's punched

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.

It appears that the tmpfs fault path is too light in comparison with
its hole-punching path, lacking an i_data_sem to obstruct it; but we
don't want to slow down the common case.

Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/shmem.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

--- 3.16-rc1/mm/shmem.c	2014-06-12 11:20:43.200001098 -0700
+++ linux/mm/shmem.c	2014-06-15 18:32:00.049039969 -0700
@@ -80,11 +80,12 @@ static struct vfsmount *shm_mnt;
 #define SHORT_SYMLINK_LEN 128
 
 /*
- * shmem_fallocate and shmem_writepage communicate via inode->i_private
- * (with i_mutex making sure that it has only one user at a time):
- * we would prefer not to enlarge the shmem inode just for that.
+ * shmem_fallocate communicates with shmem_fault or shmem_writepage via
+ * inode->i_private (with i_mutex making sure that it has only one user at
+ * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
+	int	mode;		/* FALLOC_FL mode currently operating */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -759,6 +760,7 @@ static int shmem_writepage(struct page *
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
+			    !shmem_falloc->mode &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1233,6 +1235,43 @@ static int shmem_fault(struct vm_area_st
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
+	/*
+	 * Trinity finds that probing a hole which tmpfs is punching can
+	 * prevent the hole-punch from ever completing: which in turn
+	 * locks writers out with its hold on i_mutex.  So refrain from
+	 * faulting pages into the hole while it's being punched, and
+	 * wait on i_mutex to be released if vmf->flags permits, 
+	 */
+	if (unlikely(inode->i_private)) {
+		struct shmem_falloc *shmem_falloc;
+		spin_lock(&inode->i_lock);
+		shmem_falloc = inode->i_private;
+		if (!shmem_falloc ||
+		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
+		    vmf->pgoff < shmem_falloc->start ||
+		    vmf->pgoff >= shmem_falloc->next)
+			shmem_falloc = NULL;
+		spin_unlock(&inode->i_lock);
+		/*
+		 * i_lock has protected us from taking shmem_falloc seriously
+		 * once return from shmem_fallocate() went back up that stack.
+		 * i_lock does not serialize with i_mutex at all, but it does
+		 * not matter if sometimes we wait unnecessarily, or sometimes
+		 * miss out on waiting: we just need to make those cases rare.
+		 */
+		if (shmem_falloc) {
+			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+				up_read(&vma->vm_mm->mmap_sem);
+				mutex_lock(&inode->i_mutex);
+				mutex_unlock(&inode->i_mutex);
+				return VM_FAULT_RETRY;
+			}
+			/* cond_resched? Leave that to GUP or return to user */
+			return VM_FAULT_NOPAGE;
+		}
+	}
+
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -1726,18 +1765,26 @@ static long shmem_fallocate(struct file
 
 	mutex_lock(&inode->i_mutex);
 
+	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
+
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
+		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
+		spin_lock(&inode->i_lock);
+		inode->i_private = &shmem_falloc;
+		spin_unlock(&inode->i_lock);
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
-		goto out;
+		goto undone;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-16  2:29           ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-16  2:29 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel, LKML

On Thu, 12 Jun 2014, Sasha Levin wrote:
> On 02/09/2014 08:41 PM, Sasha Levin wrote:
> > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >> Would trinity be likely to have a thread or process repeatedly faulting
> >> in pages from the hole while it is being punched?
> > 
> > I can see how trinity would do that, but just to be certain - Cc davej.
> > 
> > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >> Does this happen with other holepunch filesystems?  If it does not,
> >> I'd suppose it's because the tmpfs fault-in-newly-created-page path
> >> is lighter than a consistent disk-based filesystem's has to be.
> >> But we don't want to make the tmpfs path heavier to match them.
> > 
> > No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> > punching in other filesystems and I make sure to get a bunch of those
> > mounted before starting testing.
> 
> Just pinging this one again. I still see hangs in -next where the hang
> location looks same as before:
> 

Please give this patch a try.  It fixes what I can reproduce, but given
your unexplained page_mapped() BUG in this area, we know there's more
yet to be understood, so perhaps this patch won't do enough for you.


[PATCH] shmem: fix faulting into a hole while it's punched

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.

It appears that the tmpfs fault path is too light in comparison with
its hole-punching path, lacking an i_data_sem to obstruct it; but we
don't want to slow down the common case.

Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/shmem.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

--- 3.16-rc1/mm/shmem.c	2014-06-12 11:20:43.200001098 -0700
+++ linux/mm/shmem.c	2014-06-15 18:32:00.049039969 -0700
@@ -80,11 +80,12 @@ static struct vfsmount *shm_mnt;
 #define SHORT_SYMLINK_LEN 128
 
 /*
- * shmem_fallocate and shmem_writepage communicate via inode->i_private
- * (with i_mutex making sure that it has only one user at a time):
- * we would prefer not to enlarge the shmem inode just for that.
+ * shmem_fallocate communicates with shmem_fault or shmem_writepage via
+ * inode->i_private (with i_mutex making sure that it has only one user at
+ * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
+	int	mode;		/* FALLOC_FL mode currently operating */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -759,6 +760,7 @@ static int shmem_writepage(struct page *
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
+			    !shmem_falloc->mode &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1233,6 +1235,43 @@ static int shmem_fault(struct vm_area_st
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
+	/*
+	 * Trinity finds that probing a hole which tmpfs is punching can
+	 * prevent the hole-punch from ever completing: which in turn
+	 * locks writers out with its hold on i_mutex.  So refrain from
+	 * faulting pages into the hole while it's being punched, and
+	 * wait on i_mutex to be released if vmf->flags permits, 
+	 */
+	if (unlikely(inode->i_private)) {
+		struct shmem_falloc *shmem_falloc;
+		spin_lock(&inode->i_lock);
+		shmem_falloc = inode->i_private;
+		if (!shmem_falloc ||
+		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
+		    vmf->pgoff < shmem_falloc->start ||
+		    vmf->pgoff >= shmem_falloc->next)
+			shmem_falloc = NULL;
+		spin_unlock(&inode->i_lock);
+		/*
+		 * i_lock has protected us from taking shmem_falloc seriously
+		 * once return from shmem_fallocate() went back up that stack.
+		 * i_lock does not serialize with i_mutex at all, but it does
+		 * not matter if sometimes we wait unnecessarily, or sometimes
+		 * miss out on waiting: we just need to make those cases rare.
+		 */
+		if (shmem_falloc) {
+			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+				up_read(&vma->vm_mm->mmap_sem);
+				mutex_lock(&inode->i_mutex);
+				mutex_unlock(&inode->i_mutex);
+				return VM_FAULT_RETRY;
+			}
+			/* cond_resched? Leave that to GUP or return to user */
+			return VM_FAULT_NOPAGE;
+		}
+	}
+
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -1726,18 +1765,26 @@ static long shmem_fallocate(struct file
 
 	mutex_lock(&inode->i_mutex);
 
+	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
+
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
+		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
+		spin_lock(&inode->i_lock);
+		inode->i_private = &shmem_falloc;
+		spin_unlock(&inode->i_lock);
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
-		goto out;
+		goto undone;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-16  2:29           ` Hugh Dickins
@ 2014-06-17 20:32             ` Sasha Levin
  -1 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-17 20:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Dave Jones, Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/15/2014 10:29 PM, Hugh Dickins wrote:
> On Thu, 12 Jun 2014, Sasha Levin wrote:
>> > On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>> > > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> > >> Would trinity be likely to have a thread or process repeatedly faulting
>>>> > >> in pages from the hole while it is being punched?
>>> > > 
>>> > > I can see how trinity would do that, but just to be certain - Cc davej.
>>> > > 
>>> > > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> > >> Does this happen with other holepunch filesystems?  If it does not,
>>>> > >> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>> > >> is lighter than a consistent disk-based filesystem's has to be.
>>>> > >> But we don't want to make the tmpfs path heavier to match them.
>>> > > 
>>> > > No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>> > > punching in other filesystems and I make sure to get a bunch of those
>>> > > mounted before starting testing.
>> > 
>> > Just pinging this one again. I still see hangs in -next where the hang
>> > location looks same as before:
>> > 
> Please give this patch a try.  It fixes what I can reproduce, but given
> your unexplained page_mapped() BUG in this area, we know there's more
> yet to be understood, so perhaps this patch won't do enough for you.
> 
> 
> [PATCH] shmem: fix faulting into a hole while it's punched
> 
> Trinity finds that mmap access to a hole while it's punched from shmem
> can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
> from completing, until the reader chooses to stop; with the puncher's
> hold on i_mutex locking out all other writers until it can complete.
> 
> It appears that the tmpfs fault path is too light in comparison with
> its hole-punching path, lacking an i_data_sem to obstruct it; but we
> don't want to slow down the common case.
> 
> Extend shmem_fallocate()'s existing range notification mechanism, so
> shmem_fault() can refrain from faulting pages into the hole while it's
> punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
> faulting when not).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

No shmem_fallocate issues observed in the past day, works for me. Thanks Hugh!


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-17 20:32             ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-17 20:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Dave Jones, Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/15/2014 10:29 PM, Hugh Dickins wrote:
> On Thu, 12 Jun 2014, Sasha Levin wrote:
>> > On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>> > > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> > >> Would trinity be likely to have a thread or process repeatedly faulting
>>>> > >> in pages from the hole while it is being punched?
>>> > > 
>>> > > I can see how trinity would do that, but just to be certain - Cc davej.
>>> > > 
>>> > > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> > >> Does this happen with other holepunch filesystems?  If it does not,
>>>> > >> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>> > >> is lighter than a consistent disk-based filesystem's has to be.
>>>> > >> But we don't want to make the tmpfs path heavier to match them.
>>> > > 
>>> > > No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>> > > punching in other filesystems and I make sure to get a bunch of those
>>> > > mounted before starting testing.
>> > 
>> > Just pinging this one again. I still see hangs in -next where the hang
>> > location looks same as before:
>> > 
> Please give this patch a try.  It fixes what I can reproduce, but given
> your unexplained page_mapped() BUG in this area, we know there's more
> yet to be understood, so perhaps this patch won't do enough for you.
> 
> 
> [PATCH] shmem: fix faulting into a hole while it's punched
> 
> Trinity finds that mmap access to a hole while it's punched from shmem
> can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
> from completing, until the reader chooses to stop; with the puncher's
> hold on i_mutex locking out all other writers until it can complete.
> 
> It appears that the tmpfs fault path is too light in comparison with
> its hole-punching path, lacking an i_data_sem to obstruct it; but we
> don't want to slow down the common case.
> 
> Extend shmem_fallocate()'s existing range notification mechanism, so
> shmem_fault() can refrain from faulting pages into the hole while it's
> punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
> faulting when not).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

No shmem_fallocate issues observed in the past day, works for me. Thanks Hugh!


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-16  2:29           ` Hugh Dickins
@ 2014-06-24 16:31             ` Vlastimil Babka
  -1 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-06-24 16:31 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: Dave Jones, Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/16/2014 04:29 AM, Hugh Dickins wrote:
> On Thu, 12 Jun 2014, Sasha Levin wrote:
>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>> in pages from the hole while it is being punched?
>>>
>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>
>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>> But we don't want to make the tmpfs path heavier to match them.
>>>
>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>> punching in other filesystems and I make sure to get a bunch of those
>>> mounted before starting testing.
>>
>> Just pinging this one again. I still see hangs in -next where the hang
>> location looks same as before:
>>
> 
> Please give this patch a try.  It fixes what I can reproduce, but given
> your unexplained page_mapped() BUG in this area, we know there's more
> yet to be understood, so perhaps this patch won't do enough for you.
> 

Hi,

since this got a CVE, I've been looking at backport to an older kernel where
fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
it doesn't guarantee anything, it seems simpler just to give up retrying to
truncate really everything. Then I realized that maybe it would work for
current kernel as well, without having to add any checks in the page fault
path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
the old data from the range, it's fine from any information leak point of view.
If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
page before the parallel truncate has ended, or right after it has ended.
So I'm posting it here as a RFC. I haven't thought about the
i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
to the problem.

-----8<-----
From: Vlastimil Babka <vbabka@suse.cz>
Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching

---
 mm/shmem.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index f484c27..6d6005c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
+                        /* 
+                         * When this condition is true, it means we were
+                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
+                         * To prevent a livelock when someone else is faulting
+                         * pages back, we are content with single pass and do
+                         * not retry with index = start. It's important that
+                         * previous page content has been discarded, and
+                         * faulter(s) got new zeroed pages.
+                         *
+                         * The other callsites are shmem_setattr (for
+                         * truncation) and shmem_evict_inode, which set i_size
+                         * to truncated size or 0, respectively, and then call
+                         * us with lstart == inode->i_size. There we do want to
+                         * retry, and livelock cannot happen for other reasons.
+                         *
+                         * XXX what about i915_gem_object_truncate?
+                         */
+                        if (lstart < inode->i_size)
+                                break;
 			index = start;
 			continue;
 		}
-- 
1.8.4.5





^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-24 16:31             ` Vlastimil Babka
  0 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-06-24 16:31 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: Dave Jones, Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/16/2014 04:29 AM, Hugh Dickins wrote:
> On Thu, 12 Jun 2014, Sasha Levin wrote:
>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>> in pages from the hole while it is being punched?
>>>
>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>
>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>> But we don't want to make the tmpfs path heavier to match them.
>>>
>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>> punching in other filesystems and I make sure to get a bunch of those
>>> mounted before starting testing.
>>
>> Just pinging this one again. I still see hangs in -next where the hang
>> location looks same as before:
>>
> 
> Please give this patch a try.  It fixes what I can reproduce, but given
> your unexplained page_mapped() BUG in this area, we know there's more
> yet to be understood, so perhaps this patch won't do enough for you.
> 

Hi,

since this got a CVE, I've been looking at backport to an older kernel where
fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
it doesn't guarantee anything, it seems simpler just to give up retrying to
truncate really everything. Then I realized that maybe it would work for
current kernel as well, without having to add any checks in the page fault
path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
the old data from the range, it's fine from any information leak point of view.
If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
page before the parallel truncate has ended, or right after it has ended.
So I'm posting it here as a RFC. I haven't thought about the
i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
to the problem.

-----8<-----
From: Vlastimil Babka <vbabka@suse.cz>
Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching

---
 mm/shmem.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index f484c27..6d6005c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
+                        /* 
+                         * When this condition is true, it means we were
+                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
+                         * To prevent a livelock when someone else is faulting
+                         * pages back, we are content with single pass and do
+                         * not retry with index = start. It's important that
+                         * previous page content has been discarded, and
+                         * faulter(s) got new zeroed pages.
+                         *
+                         * The other callsites are shmem_setattr (for
+                         * truncation) and shmem_evict_inode, which set i_size
+                         * to truncated size or 0, respectively, and then call
+                         * us with lstart == inode->i_size. There we do want to
+                         * retry, and livelock cannot happen for other reasons.
+                         *
+                         * XXX what about i915_gem_object_truncate?
+                         */
+                        if (lstart < inode->i_size)
+                                break;
 			index = start;
 			continue;
 		}
-- 
1.8.4.5




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-24 16:31             ` Vlastimil Babka
@ 2014-06-25 22:36               ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-25 22:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Konstantin Khlebnikov, Hugh Dickins, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Tue, 24 Jun 2014, Vlastimil Babka wrote:
> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
> > On Thu, 12 Jun 2014, Sasha Levin wrote:
> >> On 02/09/2014 08:41 PM, Sasha Levin wrote:
> >>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >>>> Would trinity be likely to have a thread or process repeatedly faulting
> >>>> in pages from the hole while it is being punched?
> >>>
> >>> I can see how trinity would do that, but just to be certain - Cc davej.
> >>>
> >>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >>>> Does this happen with other holepunch filesystems?  If it does not,
> >>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
> >>>> is lighter than a consistent disk-based filesystem's has to be.
> >>>> But we don't want to make the tmpfs path heavier to match them.
> >>>
> >>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> >>> punching in other filesystems and I make sure to get a bunch of those
> >>> mounted before starting testing.
> >>
> >> Just pinging this one again. I still see hangs in -next where the hang
> >> location looks same as before:
> >>
> > 
> > Please give this patch a try.  It fixes what I can reproduce, but given
> > your unexplained page_mapped() BUG in this area, we know there's more
> > yet to be understood, so perhaps this patch won't do enough for you.
> > 
> 
> Hi,

Sorry for the slow response: I have got confused, learnt more, and
changed my mind, several times in the course of replying to you.
I think this reply will be stable... though not final.

> 
> since this got a CVE,

Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
Looks overrated to me (and amusing to see my pompous words about a
"range notification mechanism" taken too seriously), but of course
we do need to address it.

> I've been looking at backport to an older kernel where

Thanks a lot for looking into it.  I didn't think it was worth a
Cc: stable@vger.kernel.org myself, but admit to being both naive
and inconsistent about that.

> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since

Yes, that mechanism could be ported back pre-v3.5,
but I agree with your preference not to.

> it doesn't guarantee anything, it seems simpler just to give up retrying to

Right, I don't think we have formally documented the instant of "full hole"
that I strove for there, and it's probably not externally verifiable, nor
guaranteed by other filesystems.  I just thought it a good QoS aim, but
it has given us this problem.

> truncate really everything. Then I realized that maybe it would work for
> current kernel as well, without having to add any checks in the page fault
> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
> the old data from the range, it's fine from any information leak point of view.
> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
> page before the parallel truncate has ended, or right after it has ended.

Yes.  I disagree with your actual patch, for more than one reason,
but it's in the right area; and I found myself growing to agree with
you, that's it's better to have one kind of fix for all these releases,
than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
v3.0 too, I'm sceptical about that, but haven't tried it as yet.)

If I'd realized that we were going to have to backport, I'd have spent
longer looking for a patch like yours originally.  So my inclination
now is to go your route, make a new patch for v3.16 and backports,
and revert the f00cdc6df7d7 that has already gone in.

> So I'm posting it here as a RFC. I haven't thought about the
> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy

My understanding is that i915_gem_object_truncate() is not a problem,
that i915's dev->struct_mutex serializes all its relevant transitions,
plus the object woudn't even be interestingly accessible to the user.

> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
> to the problem.

I don't think i915 is vulnerable, but if it is, that condition would
be fine for it, as would be the patch I'm now thinking of.

> 
> -----8<-----
> From: Vlastimil Babka <vbabka@suse.cz>
> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
> 
> ---
>  mm/shmem.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f484c27..6d6005c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  		if (!pvec.nr) {
>  			if (index == start || unfalloc)
>  				break;
> +                        /* 
> +                         * When this condition is true, it means we were
> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
> +                         * To prevent a livelock when someone else is faulting
> +                         * pages back, we are content with single pass and do
> +                         * not retry with index = start. It's important that
> +                         * previous page content has been discarded, and
> +                         * faulter(s) got new zeroed pages.
> +                         *
> +                         * The other callsites are shmem_setattr (for
> +                         * truncation) and shmem_evict_inode, which set i_size
> +                         * to truncated size or 0, respectively, and then call
> +                         * us with lstart == inode->i_size. There we do want to
> +                         * retry, and livelock cannot happen for other reasons.
> +                         *
> +                         * XXX what about i915_gem_object_truncate?
> +                         */

I doubt you have ever faced such a criticism before, but I'm going
to speak my mind and say that comment is too long!  A comment of that
length is okay above or just inside or at a natural break in a function,
but here it distracts too much from what the code is actually doing.

In particular, the words "this condition" are so much closer to the
condition above than the condition below, that it's rather confusing.

/* Single pass when hole-punching to not livelock on racing faults */
would have been enough (yes, I've cheated, that would be 2 or 4 lines).

> +                        if (lstart < inode->i_size)

For a long time I was going to suggest that you leave i_size out of it,
and use "lend > 0" instead.  Then suddenly I realized that this is the
wrong place for the test.  And then that it's not your fault, it's mine,
in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
Wow, that really pessimized the hole-punch case!

When is pvec.nr 0?  When we've reached the end of the file.  Why should
we go to the end of the file, when punching a hole at the start?  Ughh!

> +                                break;
>  			index = start;
>  			continue;
>  		}
> -- 
> 1.8.4.5

But there is another problem.  We cannot break out after one pass on
shmem, because there's a possiblilty that a swap entry in the radix_tree
got swizzled into a page just as it was about to be removed - your patch
might then leave that data behind in the hole.

As it happens, Konstantin Khlebnikov suggested a patch for that a few
weeks ago, before noticing that it's already handled by the endless loop.
If we make that loop no longer endless, we need to add in Konstantin's
"if (shmem_free_swap) goto retry" patch.

Right now I'm thinking that my idiocy in d0823576bf4b may actually
be the whole of Trinity's problem: patch below.  If we waste time
traversing the radix_tree to end-of-file, no wonder that concurrent
faults have time to put something in the hole every time.

Sasha, may I trespass on your time, and ask you to revert the previous
patch from your tree, and give this patch below a try?  I am very
interested to learn if in fact it fixes it for you (as it did for me).

However, I am wasting your time, in that I think we shall decide that
it's too unsafe to rely solely upon the patch below (what happens if
1024 cpus are all faulting on it while we try to punch a 4MB hole at
end of file? if we care).  I think we shall end up with the optimization
below (or some such: it can be written in various ways), plus reverting
d0823576bf4b's "index == start && " pincer, plus Konstantin's
shmem_free_swap handling, rolled into a single patch; and a similar
patch (without the swap part) for several functions in truncate.c.

Hugh

--- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
+++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
@@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
 	for ( ; ; ) {
 		cond_resched();
 
+		index = min(index, end);
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-25 22:36               ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-25 22:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Konstantin Khlebnikov, Hugh Dickins, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Tue, 24 Jun 2014, Vlastimil Babka wrote:
> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
> > On Thu, 12 Jun 2014, Sasha Levin wrote:
> >> On 02/09/2014 08:41 PM, Sasha Levin wrote:
> >>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >>>> Would trinity be likely to have a thread or process repeatedly faulting
> >>>> in pages from the hole while it is being punched?
> >>>
> >>> I can see how trinity would do that, but just to be certain - Cc davej.
> >>>
> >>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >>>> Does this happen with other holepunch filesystems?  If it does not,
> >>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
> >>>> is lighter than a consistent disk-based filesystem's has to be.
> >>>> But we don't want to make the tmpfs path heavier to match them.
> >>>
> >>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> >>> punching in other filesystems and I make sure to get a bunch of those
> >>> mounted before starting testing.
> >>
> >> Just pinging this one again. I still see hangs in -next where the hang
> >> location looks same as before:
> >>
> > 
> > Please give this patch a try.  It fixes what I can reproduce, but given
> > your unexplained page_mapped() BUG in this area, we know there's more
> > yet to be understood, so perhaps this patch won't do enough for you.
> > 
> 
> Hi,

Sorry for the slow response: I have got confused, learnt more, and
changed my mind, several times in the course of replying to you.
I think this reply will be stable... though not final.

> 
> since this got a CVE,

Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
Looks overrated to me (and amusing to see my pompous words about a
"range notification mechanism" taken too seriously), but of course
we do need to address it.

> I've been looking at backport to an older kernel where

Thanks a lot for looking into it.  I didn't think it was worth a
Cc: stable@vger.kernel.org myself, but admit to being both naive
and inconsistent about that.

> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since

Yes, that mechanism could be ported back pre-v3.5,
but I agree with your preference not to.

> it doesn't guarantee anything, it seems simpler just to give up retrying to

Right, I don't think we have formally documented the instant of "full hole"
that I strove for there, and it's probably not externally verifiable, nor
guaranteed by other filesystems.  I just thought it a good QoS aim, but
it has given us this problem.

> truncate really everything. Then I realized that maybe it would work for
> current kernel as well, without having to add any checks in the page fault
> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
> the old data from the range, it's fine from any information leak point of view.
> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
> page before the parallel truncate has ended, or right after it has ended.

Yes.  I disagree with your actual patch, for more than one reason,
but it's in the right area; and I found myself growing to agree with
you, that's it's better to have one kind of fix for all these releases,
than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
v3.0 too, I'm sceptical about that, but haven't tried it as yet.)

If I'd realized that we were going to have to backport, I'd have spent
longer looking for a patch like yours originally.  So my inclination
now is to go your route, make a new patch for v3.16 and backports,
and revert the f00cdc6df7d7 that has already gone in.

> So I'm posting it here as a RFC. I haven't thought about the
> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy

My understanding is that i915_gem_object_truncate() is not a problem,
that i915's dev->struct_mutex serializes all its relevant transitions,
plus the object woudn't even be interestingly accessible to the user.

> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
> to the problem.

I don't think i915 is vulnerable, but if it is, that condition would
be fine for it, as would be the patch I'm now thinking of.

> 
> -----8<-----
> From: Vlastimil Babka <vbabka@suse.cz>
> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
> 
> ---
>  mm/shmem.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f484c27..6d6005c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  		if (!pvec.nr) {
>  			if (index == start || unfalloc)
>  				break;
> +                        /* 
> +                         * When this condition is true, it means we were
> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
> +                         * To prevent a livelock when someone else is faulting
> +                         * pages back, we are content with single pass and do
> +                         * not retry with index = start. It's important that
> +                         * previous page content has been discarded, and
> +                         * faulter(s) got new zeroed pages.
> +                         *
> +                         * The other callsites are shmem_setattr (for
> +                         * truncation) and shmem_evict_inode, which set i_size
> +                         * to truncated size or 0, respectively, and then call
> +                         * us with lstart == inode->i_size. There we do want to
> +                         * retry, and livelock cannot happen for other reasons.
> +                         *
> +                         * XXX what about i915_gem_object_truncate?
> +                         */

I doubt you have ever faced such a criticism before, but I'm going
to speak my mind and say that comment is too long!  A comment of that
length is okay above or just inside or at a natural break in a function,
but here it distracts too much from what the code is actually doing.

In particular, the words "this condition" are so much closer to the
condition above than the condition below, that it's rather confusing.

/* Single pass when hole-punching to not livelock on racing faults */
would have been enough (yes, I've cheated, that would be 2 or 4 lines).

> +                        if (lstart < inode->i_size)

For a long time I was going to suggest that you leave i_size out of it,
and use "lend > 0" instead.  Then suddenly I realized that this is the
wrong place for the test.  And then that it's not your fault, it's mine,
in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
Wow, that really pessimized the hole-punch case!

When is pvec.nr 0?  When we've reached the end of the file.  Why should
we go to the end of the file, when punching a hole at the start?  Ughh!

> +                                break;
>  			index = start;
>  			continue;
>  		}
> -- 
> 1.8.4.5

But there is another problem.  We cannot break out after one pass on
shmem, because there's a possiblilty that a swap entry in the radix_tree
got swizzled into a page just as it was about to be removed - your patch
might then leave that data behind in the hole.

As it happens, Konstantin Khlebnikov suggested a patch for that a few
weeks ago, before noticing that it's already handled by the endless loop.
If we make that loop no longer endless, we need to add in Konstantin's
"if (shmem_free_swap) goto retry" patch.

Right now I'm thinking that my idiocy in d0823576bf4b may actually
be the whole of Trinity's problem: patch below.  If we waste time
traversing the radix_tree to end-of-file, no wonder that concurrent
faults have time to put something in the hole every time.

Sasha, may I trespass on your time, and ask you to revert the previous
patch from your tree, and give this patch below a try?  I am very
interested to learn if in fact it fixes it for you (as it did for me).

However, I am wasting your time, in that I think we shall decide that
it's too unsafe to rely solely upon the patch below (what happens if
1024 cpus are all faulting on it while we try to punch a 4MB hole at
end of file? if we care).  I think we shall end up with the optimization
below (or some such: it can be written in various ways), plus reverting
d0823576bf4b's "index == start && " pincer, plus Konstantin's
shmem_free_swap handling, rolled into a single patch; and a similar
patch (without the swap part) for several functions in truncate.c.

Hugh

--- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
+++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
@@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
 	for ( ; ; ) {
 		cond_resched();
 
+		index = min(index, end);
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-25 22:36               ` Hugh Dickins
@ 2014-06-26  9:14                 ` Vlastimil Babka
  -1 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-06-26  9:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Sasha Levin, Dave Jones, Andrew Morton,
	linux-mm, linux-fsdevel, LKML

On 06/26/2014 12:36 AM, Hugh Dickins wrote:
> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
>>> On Thu, 12 Jun 2014, Sasha Levin wrote:
>>>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>>>> in pages from the hole while it is being punched?
>>>>>
>>>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>>>
>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>>>> But we don't want to make the tmpfs path heavier to match them.
>>>>>
>>>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>>>> punching in other filesystems and I make sure to get a bunch of those
>>>>> mounted before starting testing.
>>>>
>>>> Just pinging this one again. I still see hangs in -next where the hang
>>>> location looks same as before:
>>>>
>>>
>>> Please give this patch a try.  It fixes what I can reproduce, but given
>>> your unexplained page_mapped() BUG in this area, we know there's more
>>> yet to be understood, so perhaps this patch won't do enough for you.
>>>
>>
>> Hi,
>
> Sorry for the slow response: I have got confused, learnt more, and
> changed my mind, several times in the course of replying to you.
> I think this reply will be stable... though not final.

Thanks a lot for looking into it!

>>
>> since this got a CVE,
>
> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.

Sorry, I should have mentioned it explicitly.

> Looks overrated to me

I'd bet it would pass unnoticed if you didn't use the sentence "but 
whether it's a serious matter in the scale of denials of service, I'm 
not so sure" in your first reply to Sasha's report :) I wouldn't be 
surprised if people grep for this.

> (and amusing to see my pompous words about a
> "range notification mechanism" taken too seriously), but of course
> we do need to address it.
>
>> I've been looking at backport to an older kernel where
>
> Thanks a lot for looking into it.  I didn't think it was worth a
> Cc: stable@vger.kernel.org myself, but admit to being both naive
> and inconsistent about that.
>
>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
>
> Yes, that mechanism could be ported back pre-v3.5,
> but I agree with your preference not to.
>
>> it doesn't guarantee anything, it seems simpler just to give up retrying to
>
> Right, I don't think we have formally documented the instant of "full hole"
> that I strove for there, and it's probably not externally verifiable, nor
> guaranteed by other filesystems.  I just thought it a good QoS aim, but
> it has given us this problem.
>
>> truncate really everything. Then I realized that maybe it would work for
>> current kernel as well, without having to add any checks in the page fault
>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
>> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
>> the old data from the range, it's fine from any information leak point of view.
>> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
>> page before the parallel truncate has ended, or right after it has ended.
>
> Yes.  I disagree with your actual patch, for more than one reason,
> but it's in the right area; and I found myself growing to agree with
> you, that's it's better to have one kind of fix for all these releases,
> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)

I was looking at our 3.0 based kernel, but it could be due to backported 
patches on top.

> If I'd realized that we were going to have to backport, I'd have spent
> longer looking for a patch like yours originally.  So my inclination
> now is to go your route, make a new patch for v3.16 and backports,
> and revert the f00cdc6df7d7 that has already gone in.
>
>> So I'm posting it here as a RFC. I haven't thought about the
>> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
>
> My understanding is that i915_gem_object_truncate() is not a problem,
> that i915's dev->struct_mutex serializes all its relevant transitions,
> plus the object woudn't even be interestingly accessible to the user.
>
>> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
>> to the problem.
>
> I don't think i915 is vulnerable, but if it is, that condition would
> be fine for it, as would be the patch I'm now thinking of.
>
>>
>> -----8<-----
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
>>
>> ---
>>   mm/shmem.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index f484c27..6d6005c 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		if (!pvec.nr) {
>>   			if (index == start || unfalloc)
>>   				break;
>> +                        /*
>> +                         * When this condition is true, it means we were
>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>> +                         * To prevent a livelock when someone else is faulting
>> +                         * pages back, we are content with single pass and do
>> +                         * not retry with index = start. It's important that
>> +                         * previous page content has been discarded, and
>> +                         * faulter(s) got new zeroed pages.
>> +                         *
>> +                         * The other callsites are shmem_setattr (for
>> +                         * truncation) and shmem_evict_inode, which set i_size
>> +                         * to truncated size or 0, respectively, and then call
>> +                         * us with lstart == inode->i_size. There we do want to
>> +                         * retry, and livelock cannot happen for other reasons.
>> +                         *
>> +                         * XXX what about i915_gem_object_truncate?
>> +                         */
>
> I doubt you have ever faced such a criticism before, but I'm going
> to speak my mind and say that comment is too long!  A comment of that
> length is okay above or just inside or at a natural break in a function,
> but here it distracts too much from what the code is actually doing.

Fair enough. The reasoning should have gone into commit log, not comment.

> In particular, the words "this condition" are so much closer to the
> condition above than the condition below, that it's rather confusing.
>
> /* Single pass when hole-punching to not livelock on racing faults */
> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>
>> +                        if (lstart < inode->i_size)
>
> For a long time I was going to suggest that you leave i_size out of it,
> and use "lend > 0" instead.  Then suddenly I realized that this is the
> wrong place for the test.

Well my first idea was to just add a flag about how persistent it should 
be. And set it false for the punch hole case. Then I wondered if there's 
already some bit that distinguishes it. But it makes it more subtle.

> And then that it's not your fault, it's mine,
> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
> Wow, that really pessimized the hole-punch case!
>
> When is pvec.nr 0?  When we've reached the end of the file.  Why should
> we go to the end of the file, when punching a hole at the start?  Ughh!

Ah, I see (I think). But I managed to reproduce this problem when there 
was only an extra page between lend and the end of file, so I doubt this 
is the only problem. AFAIU it's enough to try punching a large enough 
hole, then the loop can only do a single pagevec worth of pages per 
iteration, which gives enough time for somebody faulting pages back?

>> +                                break;
>>   			index = start;
>>   			continue;
>>   		}
>> --
>> 1.8.4.5
>
> But there is another problem.  We cannot break out after one pass on
> shmem, because there's a possiblilty that a swap entry in the radix_tree
> got swizzled into a page just as it was about to be removed - your patch
> might then leave that data behind in the hole.

Thanks, I didn't notice that. Do I understand correctly that this could 
mean info leak for the punch hole call, but wouldn't be a problem for 
madvise? (In any case, that means the solution is not general enough for 
all kernels, so I'm asking just to be sure).

> As it happens, Konstantin Khlebnikov suggested a patch for that a few
> weeks ago, before noticing that it's already handled by the endless loop.
> If we make that loop no longer endless, we need to add in Konstantin's
> "if (shmem_free_swap) goto retry" patch.
>
> Right now I'm thinking that my idiocy in d0823576bf4b may actually
> be the whole of Trinity's problem: patch below.  If we waste time
> traversing the radix_tree to end-of-file, no wonder that concurrent
> faults have time to put something in the hole every time.
>
> Sasha, may I trespass on your time, and ask you to revert the previous
> patch from your tree, and give this patch below a try?  I am very
> interested to learn if in fact it fixes it for you (as it did for me).

I will try this, but as I explained above, I doubt that alone will help.

> However, I am wasting your time, in that I think we shall decide that
> it's too unsafe to rely solely upon the patch below (what happens if
> 1024 cpus are all faulting on it while we try to punch a 4MB hole at

My reproducer is 4MB file, where the puncher tries punching everything 
except first and last page. And there are 8 other threads (as I have 8 
logical CPU's) that just repeatedly sweep the same range, reading only 
the first byte of each page.

> end of file? if we care).  I think we shall end up with the optimization
> below (or some such: it can be written in various ways), plus reverting
> d0823576bf4b's "index == start && " pincer, plus Konstantin's
> shmem_free_swap handling, rolled into a single patch; and a similar

So that means no retry in any case (except the swap thing)? All callers 
can handle that? I guess shmem_evict_inode would be ok, as nobody else
can be accessing that inode. But what about shmem_setattr? (i.e. 
straight truncation) As you said earlier, faulters will get a SIGBUS 
(which AFAIU is due to i_size being updated before we enter 
shmem_undo_range). But could possibly a faulter already pass the i_size 
test, and proceed with the fault only when we are already in 
shmem_undo_range and have passed the page in question?

> patch (without the swap part) for several functions in truncate.c.
>
> Hugh
>
> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>   	for ( ; ; ) {
>   		cond_resched();
>
> +		index = min(index, end);
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-26  9:14                 ` Vlastimil Babka
  0 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-06-26  9:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Sasha Levin, Dave Jones, Andrew Morton,
	linux-mm, linux-fsdevel, LKML

On 06/26/2014 12:36 AM, Hugh Dickins wrote:
> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
>>> On Thu, 12 Jun 2014, Sasha Levin wrote:
>>>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>>>> in pages from the hole while it is being punched?
>>>>>
>>>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>>>
>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>>>> But we don't want to make the tmpfs path heavier to match them.
>>>>>
>>>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>>>> punching in other filesystems and I make sure to get a bunch of those
>>>>> mounted before starting testing.
>>>>
>>>> Just pinging this one again. I still see hangs in -next where the hang
>>>> location looks same as before:
>>>>
>>>
>>> Please give this patch a try.  It fixes what I can reproduce, but given
>>> your unexplained page_mapped() BUG in this area, we know there's more
>>> yet to be understood, so perhaps this patch won't do enough for you.
>>>
>>
>> Hi,
>
> Sorry for the slow response: I have got confused, learnt more, and
> changed my mind, several times in the course of replying to you.
> I think this reply will be stable... though not final.

Thanks a lot for looking into it!

>>
>> since this got a CVE,
>
> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.

Sorry, I should have mentioned it explicitly.

> Looks overrated to me

I'd bet it would pass unnoticed if you didn't use the sentence "but 
whether it's a serious matter in the scale of denials of service, I'm 
not so sure" in your first reply to Sasha's report :) I wouldn't be 
surprised if people grep for this.

> (and amusing to see my pompous words about a
> "range notification mechanism" taken too seriously), but of course
> we do need to address it.
>
>> I've been looking at backport to an older kernel where
>
> Thanks a lot for looking into it.  I didn't think it was worth a
> Cc: stable@vger.kernel.org myself, but admit to being both naive
> and inconsistent about that.
>
>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
>
> Yes, that mechanism could be ported back pre-v3.5,
> but I agree with your preference not to.
>
>> it doesn't guarantee anything, it seems simpler just to give up retrying to
>
> Right, I don't think we have formally documented the instant of "full hole"
> that I strove for there, and it's probably not externally verifiable, nor
> guaranteed by other filesystems.  I just thought it a good QoS aim, but
> it has given us this problem.
>
>> truncate really everything. Then I realized that maybe it would work for
>> current kernel as well, without having to add any checks in the page fault
>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
>> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
>> the old data from the range, it's fine from any information leak point of view.
>> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
>> page before the parallel truncate has ended, or right after it has ended.
>
> Yes.  I disagree with your actual patch, for more than one reason,
> but it's in the right area; and I found myself growing to agree with
> you, that's it's better to have one kind of fix for all these releases,
> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)

I was looking at our 3.0 based kernel, but it could be due to backported 
patches on top.

> If I'd realized that we were going to have to backport, I'd have spent
> longer looking for a patch like yours originally.  So my inclination
> now is to go your route, make a new patch for v3.16 and backports,
> and revert the f00cdc6df7d7 that has already gone in.
>
>> So I'm posting it here as a RFC. I haven't thought about the
>> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
>
> My understanding is that i915_gem_object_truncate() is not a problem,
> that i915's dev->struct_mutex serializes all its relevant transitions,
> plus the object woudn't even be interestingly accessible to the user.
>
>> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
>> to the problem.
>
> I don't think i915 is vulnerable, but if it is, that condition would
> be fine for it, as would be the patch I'm now thinking of.
>
>>
>> -----8<-----
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
>>
>> ---
>>   mm/shmem.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index f484c27..6d6005c 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		if (!pvec.nr) {
>>   			if (index == start || unfalloc)
>>   				break;
>> +                        /*
>> +                         * When this condition is true, it means we were
>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>> +                         * To prevent a livelock when someone else is faulting
>> +                         * pages back, we are content with single pass and do
>> +                         * not retry with index = start. It's important that
>> +                         * previous page content has been discarded, and
>> +                         * faulter(s) got new zeroed pages.
>> +                         *
>> +                         * The other callsites are shmem_setattr (for
>> +                         * truncation) and shmem_evict_inode, which set i_size
>> +                         * to truncated size or 0, respectively, and then call
>> +                         * us with lstart == inode->i_size. There we do want to
>> +                         * retry, and livelock cannot happen for other reasons.
>> +                         *
>> +                         * XXX what about i915_gem_object_truncate?
>> +                         */
>
> I doubt you have ever faced such a criticism before, but I'm going
> to speak my mind and say that comment is too long!  A comment of that
> length is okay above or just inside or at a natural break in a function,
> but here it distracts too much from what the code is actually doing.

Fair enough. The reasoning should have gone into commit log, not comment.

> In particular, the words "this condition" are so much closer to the
> condition above than the condition below, that it's rather confusing.
>
> /* Single pass when hole-punching to not livelock on racing faults */
> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>
>> +                        if (lstart < inode->i_size)
>
> For a long time I was going to suggest that you leave i_size out of it,
> and use "lend > 0" instead.  Then suddenly I realized that this is the
> wrong place for the test.

Well my first idea was to just add a flag about how persistent it should 
be. And set it false for the punch hole case. Then I wondered if there's 
already some bit that distinguishes it. But it makes it more subtle.

> And then that it's not your fault, it's mine,
> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
> Wow, that really pessimized the hole-punch case!
>
> When is pvec.nr 0?  When we've reached the end of the file.  Why should
> we go to the end of the file, when punching a hole at the start?  Ughh!

Ah, I see (I think). But I managed to reproduce this problem when there 
was only an extra page between lend and the end of file, so I doubt this 
is the only problem. AFAIU it's enough to try punching a large enough 
hole, then the loop can only do a single pagevec worth of pages per 
iteration, which gives enough time for somebody faulting pages back?

>> +                                break;
>>   			index = start;
>>   			continue;
>>   		}
>> --
>> 1.8.4.5
>
> But there is another problem.  We cannot break out after one pass on
> shmem, because there's a possiblilty that a swap entry in the radix_tree
> got swizzled into a page just as it was about to be removed - your patch
> might then leave that data behind in the hole.

Thanks, I didn't notice that. Do I understand correctly that this could 
mean info leak for the punch hole call, but wouldn't be a problem for 
madvise? (In any case, that means the solution is not general enough for 
all kernels, so I'm asking just to be sure).

> As it happens, Konstantin Khlebnikov suggested a patch for that a few
> weeks ago, before noticing that it's already handled by the endless loop.
> If we make that loop no longer endless, we need to add in Konstantin's
> "if (shmem_free_swap) goto retry" patch.
>
> Right now I'm thinking that my idiocy in d0823576bf4b may actually
> be the whole of Trinity's problem: patch below.  If we waste time
> traversing the radix_tree to end-of-file, no wonder that concurrent
> faults have time to put something in the hole every time.
>
> Sasha, may I trespass on your time, and ask you to revert the previous
> patch from your tree, and give this patch below a try?  I am very
> interested to learn if in fact it fixes it for you (as it did for me).

I will try this, but as I explained above, I doubt that alone will help.

> However, I am wasting your time, in that I think we shall decide that
> it's too unsafe to rely solely upon the patch below (what happens if
> 1024 cpus are all faulting on it while we try to punch a 4MB hole at

My reproducer is 4MB file, where the puncher tries punching everything 
except first and last page. And there are 8 other threads (as I have 8 
logical CPU's) that just repeatedly sweep the same range, reading only 
the first byte of each page.

> end of file? if we care).  I think we shall end up with the optimization
> below (or some such: it can be written in various ways), plus reverting
> d0823576bf4b's "index == start && " pincer, plus Konstantin's
> shmem_free_swap handling, rolled into a single patch; and a similar

So that means no retry in any case (except the swap thing)? All callers 
can handle that? I guess shmem_evict_inode would be ok, as nobody else
can be accessing that inode. But what about shmem_setattr? (i.e. 
straight truncation) As you said earlier, faulters will get a SIGBUS 
(which AFAIU is due to i_size being updated before we enter 
shmem_undo_range). But could possibly a faulter already pass the i_size 
test, and proceed with the fault only when we are already in 
shmem_undo_range and have passed the page in question?

> patch (without the swap part) for several functions in truncate.c.
>
> Hugh
>
> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>   	for ( ; ; ) {
>   		cond_resched();
>
> +		index = min(index, end);
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-25 22:36               ` Hugh Dickins
  (?)
  (?)
@ 2014-06-26 15:11               ` Sasha Levin
  2014-06-27  5:59                   ` Hugh Dickins
  -1 siblings, 1 reply; 47+ messages in thread
From: Sasha Levin @ 2014-06-26 15:11 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka
  Cc: Konstantin Khlebnikov, Dave Jones, Andrew Morton, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 4767 bytes --]

On 06/25/2014 06:36 PM, Hugh Dickins wrote:
> Sasha, may I trespass on your time, and ask you to revert the previous
> patch from your tree, and give this patch below a try?  I am very
> interested to learn if in fact it fixes it for you (as it did for me).

Hi Hugh,

Happy to help, and as I often do I will answer with a question.

I've observed two different issues after reverting the original fix and
applying this new patch. Both of them seem semi-related, but I'm not sure.

First, this:

[  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
[  681.268621] IP: zap_pte_range (mm/memory.c:1132)
[  681.269335] PGD 37fcc067 PUD 37fcb067 PMD 0
[  681.269972] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  681.270952] Dumping ftrace buffer:
[  681.270952]    (ftrace buffer empty)
[  681.270952] Modules linked in:
[  681.270952] CPU: 7 PID: 1952 Comm: trinity-c29 Not tainted 3.16.0-rc2-next-20140625-s
asha-00025-g2e02e05-dirty #730
[  681.270952] task: ffff8803e6f58000 ti: ffff8803df050000 task.ti: ffff8803df050000
[  681.270952] RIP: zap_pte_range (mm/memory.c:1132)
[  681.270952] RSP: 0018:ffff8803df053c58  EFLAGS: 00010246
[  681.270952] RAX: ffffea0003480040 RBX: ffff8803edae7a70 RCX: 0000000003480040
[  681.270952] RDX: 00000000d2001730 RSI: 0000000000000000 RDI: 00000000d2001730
[  681.270952] RBP: ffff8803df053cf8 R08: ffff88000015cc00 R09: 0000000000000000
[  681.270952] R10: 0000000000000001 R11: 0000000000000000 R12: ffffea0003480040
[  681.270952] R13: ffff8803df053de8 R14: 00007fc15014f000 R15: 00007fc15014e000
[  681.270952] FS:  00007fc15031b700(0000) GS:ffff8801ece00000(0000) knlGS:0000000000000
000
[  681.270952] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  681.270952] CR2: ffffea0003480048 CR3: 000000001a02e000 CR4: 00000000000006a0
[  681.270952] Stack:
[  681.270952]  ffff8803df053de8 00000000d2001000 00000000d2001fff ffff8803e6f58000
[  681.270952]  0000000000000000 0000000000000001 ffff880404dd8400 ffff8803e6e31900
[  681.270952]  00000000d2001730 ffff88000015cc00 0000000000000000 ffff8804078f8000
[  681.270952] Call Trace:
[  681.270952] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/m
emory.c:1346)
[  681.270952] unmap_vmas (mm/memory.c:1375 (discriminator 1))
[  681.270952] exit_mmap (mm/mmap.c:2797)
[  681.270952] ? preempt_count_sub (kernel/sched/core.c:2606)
[  681.270952] mmput (kernel/fork.c:638)
[  681.270952] do_exit (kernel/exit.c:744)
[  681.270952] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  681.270952] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/
lockdep.c:2599)
[  681.270952] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[  681.270952] do_group_exit (kernel/exit.c:884)
[  681.270952] SyS_exit_group (kernel/exit.c:895)
[  681.270952] tracesys (arch/x86/kernel/entry_64.S:542)
[ 681.270952] Code: e8 cf 39 25 03 49 8b 4c 24 10 48 39 c8 74 1c 48 8b 7d b8 48 c1 e1 0c
 48 89 da 48 83 c9 40 4c 89 fe e8 e5 db ff ff 0f 1f 44 00 00 <41> f6 44 24 08 01 74 08 83 6d c8 01 eb 33 66 90 f6 45 a0 40 74
All code
========
   0:   e8 cf 39 25 03          callq  0x32539d4
   5:   49 8b 4c 24 10          mov    0x10(%r12),%rcx
   a:   48 39 c8                cmp    %rcx,%rax
   d:   74 1c                   je     0x2b
   f:   48 8b 7d b8             mov    -0x48(%rbp),%rdi
  13:   48 c1 e1 0c             shl    $0xc,%rcx
  17:   48 89 da                mov    %rbx,%rdx
  1a:   48 83 c9 40             or     $0x40,%rcx
  1e:   4c 89 fe                mov    %r15,%rsi
  21:   e8 e5 db ff ff          callq  0xffffffffffffdc0b
  26:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  2b:*  41 f6 44 24 08 01       testb  $0x1,0x8(%r12)           <-- trapping instruction
  31:   74 08                   je     0x3b
  33:   83 6d c8 01             subl   $0x1,-0x38(%rbp)
  37:   eb 33                   jmp    0x6c
  39:   66 90                   xchg   %ax,%ax
  3b:   f6 45 a0 40             testb  $0x40,-0x60(%rbp)
  3f:   74 00                   je     0x41

Code starting with the faulting instruction
===========================================
   0:   41 f6 44 24 08 01       testb  $0x1,0x8(%r12)
   6:   74 08                   je     0x10
   8:   83 6d c8 01             subl   $0x1,-0x38(%rbp)
   c:   eb 33                   jmp    0x41
   e:   66 90                   xchg   %ax,%ax
  10:   f6 45 a0 40             testb  $0x40,-0x60(%rbp)
  14:   74 00                   je     0x16
[  681.270952] RIP zap_pte_range (mm/memory.c:1132)
[  681.270952]  RSP <ffff8803df053c58>
[  681.270952] CR2: ffffea0003480048

And a longer lockup that shows a few shmem_fallocate hanging, but they don't seem to be
the main reason for the hang (log it pretty long, attached).


Thanks,
Sasha

[-- Attachment #2: out.txt.gz --]
[-- Type: application/gzip, Size: 235278 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-26  9:14                 ` Vlastimil Babka
@ 2014-06-26 15:19                   ` Vlastimil Babka
  -1 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-06-26 15:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Sasha Levin, Dave Jones, Andrew Morton,
	linux-mm, linux-fsdevel, LKML

On 06/26/2014 11:14 AM, Vlastimil Babka wrote:
> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
>> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>>> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
>>>> On Thu, 12 Jun 2014, Sasha Levin wrote:
>>>>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>>>>> in pages from the hole while it is being punched?
>>>>>>
>>>>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>>>>
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>>>>> But we don't want to make the tmpfs path heavier to match them.
>>>>>>
>>>>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>>>>> punching in other filesystems and I make sure to get a bunch of those
>>>>>> mounted before starting testing.
>>>>>
>>>>> Just pinging this one again. I still see hangs in -next where the hang
>>>>> location looks same as before:
>>>>>
>>>>
>>>> Please give this patch a try.  It fixes what I can reproduce, but given
>>>> your unexplained page_mapped() BUG in this area, we know there's more
>>>> yet to be understood, so perhaps this patch won't do enough for you.
>>>>
>>>
>>> Hi,
>>
>> Sorry for the slow response: I have got confused, learnt more, and
>> changed my mind, several times in the course of replying to you.
>> I think this reply will be stable... though not final.
>
> Thanks a lot for looking into it!
>
>>>
>>> since this got a CVE,
>>
>> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
>
> Sorry, I should have mentioned it explicitly.
>
>> Looks overrated to me
>
> I'd bet it would pass unnoticed if you didn't use the sentence "but
> whether it's a serious matter in the scale of denials of service, I'm
> not so sure" in your first reply to Sasha's report :) I wouldn't be
> surprised if people grep for this.
>
>> (and amusing to see my pompous words about a
>> "range notification mechanism" taken too seriously), but of course
>> we do need to address it.
>>
>>> I've been looking at backport to an older kernel where
>>
>> Thanks a lot for looking into it.  I didn't think it was worth a
>> Cc: stable@vger.kernel.org myself, but admit to being both naive
>> and inconsistent about that.
>>
>>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
>>
>> Yes, that mechanism could be ported back pre-v3.5,
>> but I agree with your preference not to.
>>
>>> it doesn't guarantee anything, it seems simpler just to give up retrying to
>>
>> Right, I don't think we have formally documented the instant of "full hole"
>> that I strove for there, and it's probably not externally verifiable, nor
>> guaranteed by other filesystems.  I just thought it a good QoS aim, but
>> it has given us this problem.
>>
>>> truncate really everything. Then I realized that maybe it would work for
>>> current kernel as well, without having to add any checks in the page fault
>>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
>>> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
>>> the old data from the range, it's fine from any information leak point of view.
>>> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
>>> page before the parallel truncate has ended, or right after it has ended.
>>
>> Yes.  I disagree with your actual patch, for more than one reason,
>> but it's in the right area; and I found myself growing to agree with
>> you, that's it's better to have one kind of fix for all these releases,
>> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
>> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
>
> I was looking at our 3.0 based kernel, but it could be due to backported
> patches on top.

OK, seems I cannot reproduce this on 3.0.101 vanilla.

>> If I'd realized that we were going to have to backport, I'd have spent
>> longer looking for a patch like yours originally.  So my inclination
>> now is to go your route, make a new patch for v3.16 and backports,
>> and revert the f00cdc6df7d7 that has already gone in.
>>
>>> So I'm posting it here as a RFC. I haven't thought about the
>>> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
>>
>> My understanding is that i915_gem_object_truncate() is not a problem,
>> that i915's dev->struct_mutex serializes all its relevant transitions,
>> plus the object woudn't even be interestingly accessible to the user.
>>
>>> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
>>> to the problem.
>>
>> I don't think i915 is vulnerable, but if it is, that condition would
>> be fine for it, as would be the patch I'm now thinking of.
>>
>>>
>>> -----8<-----
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
>>>
>>> ---
>>>    mm/shmem.c | 19 +++++++++++++++++++
>>>    1 file changed, 19 insertions(+)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index f484c27..6d6005c 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>>    		if (!pvec.nr) {
>>>    			if (index == start || unfalloc)
>>>    				break;
>>> +                        /*
>>> +                         * When this condition is true, it means we were
>>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>>> +                         * To prevent a livelock when someone else is faulting
>>> +                         * pages back, we are content with single pass and do
>>> +                         * not retry with index = start. It's important that
>>> +                         * previous page content has been discarded, and
>>> +                         * faulter(s) got new zeroed pages.
>>> +                         *
>>> +                         * The other callsites are shmem_setattr (for
>>> +                         * truncation) and shmem_evict_inode, which set i_size
>>> +                         * to truncated size or 0, respectively, and then call
>>> +                         * us with lstart == inode->i_size. There we do want to
>>> +                         * retry, and livelock cannot happen for other reasons.
>>> +                         *
>>> +                         * XXX what about i915_gem_object_truncate?
>>> +                         */
>>
>> I doubt you have ever faced such a criticism before, but I'm going
>> to speak my mind and say that comment is too long!  A comment of that
>> length is okay above or just inside or at a natural break in a function,
>> but here it distracts too much from what the code is actually doing.
>
> Fair enough. The reasoning should have gone into commit log, not comment.
>
>> In particular, the words "this condition" are so much closer to the
>> condition above than the condition below, that it's rather confusing.
>>
>> /* Single pass when hole-punching to not livelock on racing faults */
>> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>>
>>> +                        if (lstart < inode->i_size)
>>
>> For a long time I was going to suggest that you leave i_size out of it,
>> and use "lend > 0" instead.  Then suddenly I realized that this is the
>> wrong place for the test.
>
> Well my first idea was to just add a flag about how persistent it should
> be. And set it false for the punch hole case. Then I wondered if there's
> already some bit that distinguishes it. But it makes it more subtle.
>
>> And then that it's not your fault, it's mine,
>> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
>> Wow, that really pessimized the hole-punch case!
>>
>> When is pvec.nr 0?  When we've reached the end of the file.  Why should
>> we go to the end of the file, when punching a hole at the start?  Ughh!
>
> Ah, I see (I think). But I managed to reproduce this problem when there
> was only an extra page between lend and the end of file, so I doubt this
> is the only problem. AFAIU it's enough to try punching a large enough
> hole, then the loop can only do a single pagevec worth of pages per
> iteration, which gives enough time for somebody faulting pages back?
>
>>> +                                break;
>>>    			index = start;
>>>    			continue;
>>>    		}
>>> --
>>> 1.8.4.5
>>
>> But there is another problem.  We cannot break out after one pass on
>> shmem, because there's a possiblilty that a swap entry in the radix_tree
>> got swizzled into a page just as it was about to be removed - your patch
>> might then leave that data behind in the hole.
>
> Thanks, I didn't notice that. Do I understand correctly that this could
> mean info leak for the punch hole call, but wouldn't be a problem for
> madvise? (In any case, that means the solution is not general enough for
> all kernels, so I'm asking just to be sure).
>
>> As it happens, Konstantin Khlebnikov suggested a patch for that a few
>> weeks ago, before noticing that it's already handled by the endless loop.
>> If we make that loop no longer endless, we need to add in Konstantin's
>> "if (shmem_free_swap) goto retry" patch.
>>
>> Right now I'm thinking that my idiocy in d0823576bf4b may actually
>> be the whole of Trinity's problem: patch below.  If we waste time
>> traversing the radix_tree to end-of-file, no wonder that concurrent
>> faults have time to put something in the hole every time.
>>
>> Sasha, may I trespass on your time, and ask you to revert the previous
>> patch from your tree, and give this patch below a try?  I am very
>> interested to learn if in fact it fixes it for you (as it did for me).
>
> I will try this, but as I explained above, I doubt that alone will help.

Yep, it didn't help here.

>> However, I am wasting your time, in that I think we shall decide that
>> it's too unsafe to rely solely upon the patch below (what happens if
>> 1024 cpus are all faulting on it while we try to punch a 4MB hole at
>
> My reproducer is 4MB file, where the puncher tries punching everything
> except first and last page. And there are 8 other threads (as I have 8
> logical CPU's) that just repeatedly sweep the same range, reading only
> the first byte of each page.
>
>> end of file? if we care).  I think we shall end up with the optimization
>> below (or some such: it can be written in various ways), plus reverting
>> d0823576bf4b's "index == start && " pincer, plus Konstantin's
>> shmem_free_swap handling, rolled into a single patch; and a similar
>
> So that means no retry in any case (except the swap thing)? All callers
> can handle that? I guess shmem_evict_inode would be ok, as nobody else
> can be accessing that inode. But what about shmem_setattr? (i.e.
> straight truncation) As you said earlier, faulters will get a SIGBUS
> (which AFAIU is due to i_size being updated before we enter
> shmem_undo_range). But could possibly a faulter already pass the i_size
> test, and proceed with the fault only when we are already in
> shmem_undo_range and have passed the page in question?
>
>> patch (without the swap part) for several functions in truncate.c.
>>
>> Hugh
>>
>> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
>> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
>> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>>    	for ( ; ; ) {
>>    		cond_resched();
>>
>> +		index = min(index, end);
>>    		pvec.nr = find_get_entries(mapping, index,
>>    				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>    				pvec.pages, indices);
>>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-26 15:19                   ` Vlastimil Babka
  0 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-06-26 15:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Sasha Levin, Dave Jones, Andrew Morton,
	linux-mm, linux-fsdevel, LKML

On 06/26/2014 11:14 AM, Vlastimil Babka wrote:
> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
>> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>>> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
>>>> On Thu, 12 Jun 2014, Sasha Levin wrote:
>>>>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>>>>> in pages from the hole while it is being punched?
>>>>>>
>>>>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>>>>
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>>>>> But we don't want to make the tmpfs path heavier to match them.
>>>>>>
>>>>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>>>>> punching in other filesystems and I make sure to get a bunch of those
>>>>>> mounted before starting testing.
>>>>>
>>>>> Just pinging this one again. I still see hangs in -next where the hang
>>>>> location looks same as before:
>>>>>
>>>>
>>>> Please give this patch a try.  It fixes what I can reproduce, but given
>>>> your unexplained page_mapped() BUG in this area, we know there's more
>>>> yet to be understood, so perhaps this patch won't do enough for you.
>>>>
>>>
>>> Hi,
>>
>> Sorry for the slow response: I have got confused, learnt more, and
>> changed my mind, several times in the course of replying to you.
>> I think this reply will be stable... though not final.
>
> Thanks a lot for looking into it!
>
>>>
>>> since this got a CVE,
>>
>> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
>
> Sorry, I should have mentioned it explicitly.
>
>> Looks overrated to me
>
> I'd bet it would pass unnoticed if you didn't use the sentence "but
> whether it's a serious matter in the scale of denials of service, I'm
> not so sure" in your first reply to Sasha's report :) I wouldn't be
> surprised if people grep for this.
>
>> (and amusing to see my pompous words about a
>> "range notification mechanism" taken too seriously), but of course
>> we do need to address it.
>>
>>> I've been looking at backport to an older kernel where
>>
>> Thanks a lot for looking into it.  I didn't think it was worth a
>> Cc: stable@vger.kernel.org myself, but admit to being both naive
>> and inconsistent about that.
>>
>>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
>>
>> Yes, that mechanism could be ported back pre-v3.5,
>> but I agree with your preference not to.
>>
>>> it doesn't guarantee anything, it seems simpler just to give up retrying to
>>
>> Right, I don't think we have formally documented the instant of "full hole"
>> that I strove for there, and it's probably not externally verifiable, nor
>> guaranteed by other filesystems.  I just thought it a good QoS aim, but
>> it has given us this problem.
>>
>>> truncate really everything. Then I realized that maybe it would work for
>>> current kernel as well, without having to add any checks in the page fault
>>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
>>> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
>>> the old data from the range, it's fine from any information leak point of view.
>>> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
>>> page before the parallel truncate has ended, or right after it has ended.
>>
>> Yes.  I disagree with your actual patch, for more than one reason,
>> but it's in the right area; and I found myself growing to agree with
>> you, that's it's better to have one kind of fix for all these releases,
>> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
>> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
>
> I was looking at our 3.0 based kernel, but it could be due to backported
> patches on top.

OK, seems I cannot reproduce this on 3.0.101 vanilla.

>> If I'd realized that we were going to have to backport, I'd have spent
>> longer looking for a patch like yours originally.  So my inclination
>> now is to go your route, make a new patch for v3.16 and backports,
>> and revert the f00cdc6df7d7 that has already gone in.
>>
>>> So I'm posting it here as a RFC. I haven't thought about the
>>> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
>>
>> My understanding is that i915_gem_object_truncate() is not a problem,
>> that i915's dev->struct_mutex serializes all its relevant transitions,
>> plus the object woudn't even be interestingly accessible to the user.
>>
>>> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
>>> to the problem.
>>
>> I don't think i915 is vulnerable, but if it is, that condition would
>> be fine for it, as would be the patch I'm now thinking of.
>>
>>>
>>> -----8<-----
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
>>>
>>> ---
>>>    mm/shmem.c | 19 +++++++++++++++++++
>>>    1 file changed, 19 insertions(+)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index f484c27..6d6005c 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>>    		if (!pvec.nr) {
>>>    			if (index == start || unfalloc)
>>>    				break;
>>> +                        /*
>>> +                         * When this condition is true, it means we were
>>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>>> +                         * To prevent a livelock when someone else is faulting
>>> +                         * pages back, we are content with single pass and do
>>> +                         * not retry with index = start. It's important that
>>> +                         * previous page content has been discarded, and
>>> +                         * faulter(s) got new zeroed pages.
>>> +                         *
>>> +                         * The other callsites are shmem_setattr (for
>>> +                         * truncation) and shmem_evict_inode, which set i_size
>>> +                         * to truncated size or 0, respectively, and then call
>>> +                         * us with lstart == inode->i_size. There we do want to
>>> +                         * retry, and livelock cannot happen for other reasons.
>>> +                         *
>>> +                         * XXX what about i915_gem_object_truncate?
>>> +                         */
>>
>> I doubt you have ever faced such a criticism before, but I'm going
>> to speak my mind and say that comment is too long!  A comment of that
>> length is okay above or just inside or at a natural break in a function,
>> but here it distracts too much from what the code is actually doing.
>
> Fair enough. The reasoning should have gone into commit log, not comment.
>
>> In particular, the words "this condition" are so much closer to the
>> condition above than the condition below, that it's rather confusing.
>>
>> /* Single pass when hole-punching to not livelock on racing faults */
>> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>>
>>> +                        if (lstart < inode->i_size)
>>
>> For a long time I was going to suggest that you leave i_size out of it,
>> and use "lend > 0" instead.  Then suddenly I realized that this is the
>> wrong place for the test.
>
> Well my first idea was to just add a flag about how persistent it should
> be. And set it false for the punch hole case. Then I wondered if there's
> already some bit that distinguishes it. But it makes it more subtle.
>
>> And then that it's not your fault, it's mine,
>> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
>> Wow, that really pessimized the hole-punch case!
>>
>> When is pvec.nr 0?  When we've reached the end of the file.  Why should
>> we go to the end of the file, when punching a hole at the start?  Ughh!
>
> Ah, I see (I think). But I managed to reproduce this problem when there
> was only an extra page between lend and the end of file, so I doubt this
> is the only problem. AFAIU it's enough to try punching a large enough
> hole, then the loop can only do a single pagevec worth of pages per
> iteration, which gives enough time for somebody faulting pages back?
>
>>> +                                break;
>>>    			index = start;
>>>    			continue;
>>>    		}
>>> --
>>> 1.8.4.5
>>
>> But there is another problem.  We cannot break out after one pass on
>> shmem, because there's a possiblilty that a swap entry in the radix_tree
>> got swizzled into a page just as it was about to be removed - your patch
>> might then leave that data behind in the hole.
>
> Thanks, I didn't notice that. Do I understand correctly that this could
> mean info leak for the punch hole call, but wouldn't be a problem for
> madvise? (In any case, that means the solution is not general enough for
> all kernels, so I'm asking just to be sure).
>
>> As it happens, Konstantin Khlebnikov suggested a patch for that a few
>> weeks ago, before noticing that it's already handled by the endless loop.
>> If we make that loop no longer endless, we need to add in Konstantin's
>> "if (shmem_free_swap) goto retry" patch.
>>
>> Right now I'm thinking that my idiocy in d0823576bf4b may actually
>> be the whole of Trinity's problem: patch below.  If we waste time
>> traversing the radix_tree to end-of-file, no wonder that concurrent
>> faults have time to put something in the hole every time.
>>
>> Sasha, may I trespass on your time, and ask you to revert the previous
>> patch from your tree, and give this patch below a try?  I am very
>> interested to learn if in fact it fixes it for you (as it did for me).
>
> I will try this, but as I explained above, I doubt that alone will help.

Yep, it didn't help here.

>> However, I am wasting your time, in that I think we shall decide that
>> it's too unsafe to rely solely upon the patch below (what happens if
>> 1024 cpus are all faulting on it while we try to punch a 4MB hole at
>
> My reproducer is 4MB file, where the puncher tries punching everything
> except first and last page. And there are 8 other threads (as I have 8
> logical CPU's) that just repeatedly sweep the same range, reading only
> the first byte of each page.
>
>> end of file? if we care).  I think we shall end up with the optimization
>> below (or some such: it can be written in various ways), plus reverting
>> d0823576bf4b's "index == start && " pincer, plus Konstantin's
>> shmem_free_swap handling, rolled into a single patch; and a similar
>
> So that means no retry in any case (except the swap thing)? All callers
> can handle that? I guess shmem_evict_inode would be ok, as nobody else
> can be accessing that inode. But what about shmem_setattr? (i.e.
> straight truncation) As you said earlier, faulters will get a SIGBUS
> (which AFAIU is due to i_size being updated before we enter
> shmem_undo_range). But could possibly a faulter already pass the i_size
> test, and proceed with the fault only when we are already in
> shmem_undo_range and have passed the page in question?
>
>> patch (without the swap part) for several functions in truncate.c.
>>
>> Hugh
>>
>> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
>> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
>> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>>    	for ( ; ; ) {
>>    		cond_resched();
>>
>> +		index = min(index, end);
>>    		pvec.nr = find_get_entries(mapping, index,
>>    				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>    				pvec.pages, indices);
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-26  9:14                 ` Vlastimil Babka
@ 2014-06-27  5:36                   ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-27  5:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Hugh Dickins, Konstantin Khlebnikov,
	Sasha Levin, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel,
	LKML

[Cc Johannes: at the end I have a particular question for you]

On Thu, 26 Jun 2014, Vlastimil Babka wrote:
> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
> > On Tue, 24 Jun 2014, Vlastimil Babka wrote:
> > 
> > Sorry for the slow response: I have got confused, learnt more, and
> > changed my mind, several times in the course of replying to you.
> > I think this reply will be stable... though not final.
> 
> Thanks a lot for looking into it!
> 
> > > 
> > > since this got a CVE,
> > 
> > Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
> 
> Sorry, I should have mentioned it explicitly.
> 
> > Looks overrated to me
> 
> I'd bet it would pass unnoticed if you didn't use the sentence "but whether
> it's a serious matter in the scale of denials of service, I'm not so sure" in
> your first reply to Sasha's report :) I wouldn't be surprised if people grep
> for this.

Hah, you're probably right,
I better choose my words more carefully in future.

> 
> > (and amusing to see my pompous words about a
> > "range notification mechanism" taken too seriously), but of course
> > we do need to address it.
> > 
> > > I've been looking at backport to an older kernel where
> > 
> > Thanks a lot for looking into it.  I didn't think it was worth a
> > Cc: stable@vger.kernel.org myself, but admit to being both naive
> > and inconsistent about that.
> > 
> > > fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
> > > range notification mechanism yet. There's just madvise(MADV_REMOVE) and
> > > since
> > 
> > Yes, that mechanism could be ported back pre-v3.5,
> > but I agree with your preference not to.
> > 
> > > it doesn't guarantee anything, it seems simpler just to give up retrying
> > > to
> > 
> > Right, I don't think we have formally documented the instant of "full hole"
> > that I strove for there, and it's probably not externally verifiable, nor
> > guaranteed by other filesystems.  I just thought it a good QoS aim, but
> > it has given us this problem.
> > 
> > > truncate really everything. Then I realized that maybe it would work for
> > > current kernel as well, without having to add any checks in the page
> > > fault
> > > path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look
> > > different
> > > from madvise(MADV_REMOVE), but it seems to me that as long as it does
> > > discard
> > > the old data from the range, it's fine from any information leak point of
> > > view.
> > > If someone races page faulting, it IMHO doesn't matter if he gets a new
> > > zeroed
> > > page before the parallel truncate has ended, or right after it has ended.
> > 
> > Yes.  I disagree with your actual patch, for more than one reason,
> > but it's in the right area; and I found myself growing to agree with
> > you, that's it's better to have one kind of fix for all these releases,
> > than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
> > v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
> 
> I was looking at our 3.0 based kernel, but it could be due to backported
> patches on top.

And later you confirm that 3.0.101 vanilla is okay: thanks, that fits.

> 
> > If I'd realized that we were going to have to backport, I'd have spent
> > longer looking for a patch like yours originally.  So my inclination
> > now is to go your route, make a new patch for v3.16 and backports,
> > and revert the f00cdc6df7d7 that has already gone in.
> > 
> > > So I'm posting it here as a RFC. I haven't thought about the
> > > i915_gem_object_truncate caller yet. I think that this path wouldn't
> > > satisfy
> > 
> > My understanding is that i915_gem_object_truncate() is not a problem,
> > that i915's dev->struct_mutex serializes all its relevant transitions,
> > plus the object woudn't even be interestingly accessible to the user.
> > 
> > > the new "lstart < inode->i_size" condition, but I don't know if it's
> > > "vulnerable"
> > > to the problem.
> > 
> > I don't think i915 is vulnerable, but if it is, that condition would
> > be fine for it, as would be the patch I'm now thinking of.
> > 
> > > 
> > > -----8<-----
> > > From: Vlastimil Babka <vbabka@suse.cz>
> > > Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole
> > > punching
> > > 
> > > ---
> > >   mm/shmem.c | 19 +++++++++++++++++++
> > >   1 file changed, 19 insertions(+)
> > > 
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index f484c27..6d6005c 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode,
> > > loff_t lstart, loff_t lend,
> > >   		if (!pvec.nr) {
> > >   			if (index == start || unfalloc)
> > >   				break;
> > > +                        /*
> > > +                         * When this condition is true, it means we were
> > > +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
> > > +                         * To prevent a livelock when someone else is
> > > faulting
> > > +                         * pages back, we are content with single pass
> > > and do
> > > +                         * not retry with index = start. It's important
> > > that
> > > +                         * previous page content has been discarded, and
> > > +                         * faulter(s) got new zeroed pages.
> > > +                         *
> > > +                         * The other callsites are shmem_setattr (for
> > > +                         * truncation) and shmem_evict_inode, which set
> > > i_size
> > > +                         * to truncated size or 0, respectively, and
> > > then call
> > > +                         * us with lstart == inode->i_size. There we do
> > > want to
> > > +                         * retry, and livelock cannot happen for other
> > > reasons.
> > > +                         *
> > > +                         * XXX what about i915_gem_object_truncate?
> > > +                         */
> > 
> > I doubt you have ever faced such a criticism before, but I'm going
> > to speak my mind and say that comment is too long!  A comment of that
> > length is okay above or just inside or at a natural break in a function,
> > but here it distracts too much from what the code is actually doing.
> 
> Fair enough. The reasoning should have gone into commit log, not comment.
> 
> > In particular, the words "this condition" are so much closer to the
> > condition above than the condition below, that it's rather confusing.
> > 
> > /* Single pass when hole-punching to not livelock on racing faults */
> > would have been enough (yes, I've cheated, that would be 2 or 4 lines).
> > 
> > > +                        if (lstart < inode->i_size)
> > 
> > For a long time I was going to suggest that you leave i_size out of it,
> > and use "lend > 0" instead.  Then suddenly I realized that this is the
> > wrong place for the test.
> 
> Well my first idea was to just add a flag about how persistent it should be.
> And set it false for the punch hole case. Then I wondered if there's already
> some bit that distinguishes it. But it makes it more subtle.
> 
> > And then that it's not your fault, it's mine,
> > in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
> > Wow, that really pessimized the hole-punch case!
> > 
> > When is pvec.nr 0?  When we've reached the end of the file.  Why should
> > we go to the end of the file, when punching a hole at the start?  Ughh!
> 
> Ah, I see (I think). But I managed to reproduce this problem when there was
> only an extra page between lend and the end of file, so I doubt this is the
> only problem. AFAIU it's enough to try punching a large enough hole, then the
> loop can only do a single pagevec worth of pages per iteration, which gives
> enough time for somebody faulting pages back?

That's useful info, thank you: I just wasn't trying hard enough then;
and you didn't even need 1024 cpus to show it either.  Right, we have
to revert my pincer, certainly on shmem.  And I think I'd better do the
same change on generic filesystems too (nobody has bothered to implement
hole-punch on ramfs, but if they did, they would hit the same problem):
though that part of it doesn't need a backport to -stable.

> 
> > > +                                break;
> > >   			index = start;
> > >   			continue;
> > >   		}
> > > --
> > > 1.8.4.5
> > 
> > But there is another problem.  We cannot break out after one pass on
> > shmem, because there's a possiblilty that a swap entry in the radix_tree
> > got swizzled into a page just as it was about to be removed - your patch
> > might then leave that data behind in the hole.
> 
> Thanks, I didn't notice that. Do I understand correctly that this could mean
> info leak for the punch hole call, but wouldn't be a problem for madvise? (In
> any case, that means the solution is not general enough for all kernels, so
> I'm asking just to be sure).

It's exactly the same issue for the madvise as for the fallocate:
data that is promised to have been punched out would still be there.

Very hard case to trigger, though, I think: since by the time we get
to this loop, we have already made one pass down the hole, getting rid
of everything that wasn't page-locked at the time, so the chance of
catching any swap in this loop is lower.

> 
> > As it happens, Konstantin Khlebnikov suggested a patch for that a few
> > weeks ago, before noticing that it's already handled by the endless loop.
> > If we make that loop no longer endless, we need to add in Konstantin's
> > "if (shmem_free_swap) goto retry" patch.
> > 
> > Right now I'm thinking that my idiocy in d0823576bf4b may actually
> > be the whole of Trinity's problem: patch below.  If we waste time
> > traversing the radix_tree to end-of-file, no wonder that concurrent
> > faults have time to put something in the hole every time.
> > 
> > Sasha, may I trespass on your time, and ask you to revert the previous
> > patch from your tree, and give this patch below a try?  I am very
> > interested to learn if in fact it fixes it for you (as it did for me).
> 
> I will try this, but as I explained above, I doubt that alone will help.

And afterwards you confirmed, thank you.

> 
> > However, I am wasting your time, in that I think we shall decide that
> > it's too unsafe to rely solely upon the patch below (what happens if
> > 1024 cpus are all faulting on it while we try to punch a 4MB hole at
> 
> My reproducer is 4MB file, where the puncher tries punching everything except
> first and last page. And there are 8 other threads (as I have 8 logical
> CPU's) that just repeatedly sweep the same range, reading only the first byte
> of each page.
> 
> > end of file? if we care).  I think we shall end up with the optimization
> > below (or some such: it can be written in various ways), plus reverting
> > d0823576bf4b's "index == start && " pincer, plus Konstantin's
> > shmem_free_swap handling, rolled into a single patch; and a similar
> 
> So that means no retry in any case (except the swap thing)? All callers can
> handle that? I guess shmem_evict_inode would be ok, as nobody else
> can be accessing that inode. But what about shmem_setattr? (i.e. straight
> truncation) As you said earlier, faulters will get a SIGBUS (which AFAIU is
> due to i_size being updated before we enter shmem_undo_range). But could
> possibly a faulter already pass the i_size test, and proceed with the fault
> only when we are already in shmem_undo_range and have passed the page in
> question?

We still have to retry indefinitely in the truncation case, as you
rightly guess.  SIGBUS beyond i_size makes it a much easier case to
handle, and there's no danger of "indefinitely" becoming "infinitely"
as in the punch-hole case.  But, depending on how the filesystem
handles its end, there is still some possibility of a race with faulting,
which some filesystems may require pagecache truncation to resolve.

Does shmem truncation itself require that?  Er, er, it would take me
too long to work out the definitive answer: perhaps it doesn't, but for
safety I certainly assume that it does require that - that is, I never
even considered removing the indefinite loop from the truncation case.

> 
> > patch (without the swap part) for several functions in truncate.c.
> > 
> > Hugh
> > 
> > --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> > +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
> > @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
> >   	for ( ; ; ) {
> >   		cond_resched();
> > 
> > +		index = min(index, end);
> >   		pvec.nr = find_get_entries(mapping, index,
> >   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> >   				pvec.pages, indices);

So let's all forget that patch, although it does help to highlight my
mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)

Here's the 3.16-rc2 patch that I've now settled on (which will also
require a revert of current git's f00cdc6df7d7; well, not require the
revert, but this makes that redundant, and cannot be tested with it in).

I've not yet had time to write up the patch description, nor to test
it fully; but thought I should get the patch itself into the open for
review and testing before then.

I've checked against v3.1 to see how it works out there: certainly
wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
"end" was included in the range, not excluded), but the same
principles apply.  Haven't checked the intermediates yet, will
probably leave those until each stable wants them - but if you've a
particular release in mind, please ask, or ask me to check your port.

I've included the mm/truncate.c part of it here, but that will be a
separate (not for -stable) patch when I post the finalized version.

Hannes, a question for you please, I just could not make up my mind.
In mm/truncate.c truncate_inode_pages_range(), what should be done
with a failed clear_exceptional_entry() in the case of hole-punch?
Is that case currently depending on the rescan loop (that I'm about
to revert) to remove a new page, so I would need to add a retry for
that rather like the shmem_free_swap() one?  Or is it irrelevant,
and can stay unchanged as below?  I've veered back and forth,
thinking first one and then the other.

Thanks,
Hugh

---

 mm/shmem.c    |   19 ++++++++++---------
 mm/truncate.c |   14 +++++---------
 2 files changed, 15 insertions(+), 18 deletions(-)

--- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
+++ linux/mm/shmem.c	2014-06-26 15:41:52.704362962 -0700
@@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}
 
--- 3.16-rc2/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
+++ linux/mm/truncate.c	2014-06-26 16:31:35.932433863 -0700
@@ -352,21 +352,17 @@ void truncate_inode_pages_range(struct a
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 		if (!pagevec_lookup_entries(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE),
-			indices)) {
-			if (index == start)
+			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
+			/* If all gone or hole-punch, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if (index == start && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-27  5:36                   ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-27  5:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Hugh Dickins, Konstantin Khlebnikov,
	Sasha Levin, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel,
	LKML

[Cc Johannes: at the end I have a particular question for you]

On Thu, 26 Jun 2014, Vlastimil Babka wrote:
> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
> > On Tue, 24 Jun 2014, Vlastimil Babka wrote:
> > 
> > Sorry for the slow response: I have got confused, learnt more, and
> > changed my mind, several times in the course of replying to you.
> > I think this reply will be stable... though not final.
> 
> Thanks a lot for looking into it!
> 
> > > 
> > > since this got a CVE,
> > 
> > Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
> 
> Sorry, I should have mentioned it explicitly.
> 
> > Looks overrated to me
> 
> I'd bet it would pass unnoticed if you didn't use the sentence "but whether
> it's a serious matter in the scale of denials of service, I'm not so sure" in
> your first reply to Sasha's report :) I wouldn't be surprised if people grep
> for this.

Hah, you're probably right,
I better choose my words more carefully in future.

> 
> > (and amusing to see my pompous words about a
> > "range notification mechanism" taken too seriously), but of course
> > we do need to address it.
> > 
> > > I've been looking at backport to an older kernel where
> > 
> > Thanks a lot for looking into it.  I didn't think it was worth a
> > Cc: stable@vger.kernel.org myself, but admit to being both naive
> > and inconsistent about that.
> > 
> > > fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
> > > range notification mechanism yet. There's just madvise(MADV_REMOVE) and
> > > since
> > 
> > Yes, that mechanism could be ported back pre-v3.5,
> > but I agree with your preference not to.
> > 
> > > it doesn't guarantee anything, it seems simpler just to give up retrying
> > > to
> > 
> > Right, I don't think we have formally documented the instant of "full hole"
> > that I strove for there, and it's probably not externally verifiable, nor
> > guaranteed by other filesystems.  I just thought it a good QoS aim, but
> > it has given us this problem.
> > 
> > > truncate really everything. Then I realized that maybe it would work for
> > > current kernel as well, without having to add any checks in the page
> > > fault
> > > path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look
> > > different
> > > from madvise(MADV_REMOVE), but it seems to me that as long as it does
> > > discard
> > > the old data from the range, it's fine from any information leak point of
> > > view.
> > > If someone races page faulting, it IMHO doesn't matter if he gets a new
> > > zeroed
> > > page before the parallel truncate has ended, or right after it has ended.
> > 
> > Yes.  I disagree with your actual patch, for more than one reason,
> > but it's in the right area; and I found myself growing to agree with
> > you, that's it's better to have one kind of fix for all these releases,
> > than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
> > v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
> 
> I was looking at our 3.0 based kernel, but it could be due to backported
> patches on top.

And later you confirm that 3.0.101 vanilla is okay: thanks, that fits.

> 
> > If I'd realized that we were going to have to backport, I'd have spent
> > longer looking for a patch like yours originally.  So my inclination
> > now is to go your route, make a new patch for v3.16 and backports,
> > and revert the f00cdc6df7d7 that has already gone in.
> > 
> > > So I'm posting it here as a RFC. I haven't thought about the
> > > i915_gem_object_truncate caller yet. I think that this path wouldn't
> > > satisfy
> > 
> > My understanding is that i915_gem_object_truncate() is not a problem,
> > that i915's dev->struct_mutex serializes all its relevant transitions,
> > plus the object woudn't even be interestingly accessible to the user.
> > 
> > > the new "lstart < inode->i_size" condition, but I don't know if it's
> > > "vulnerable"
> > > to the problem.
> > 
> > I don't think i915 is vulnerable, but if it is, that condition would
> > be fine for it, as would be the patch I'm now thinking of.
> > 
> > > 
> > > -----8<-----
> > > From: Vlastimil Babka <vbabka@suse.cz>
> > > Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole
> > > punching
> > > 
> > > ---
> > >   mm/shmem.c | 19 +++++++++++++++++++
> > >   1 file changed, 19 insertions(+)
> > > 
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index f484c27..6d6005c 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode,
> > > loff_t lstart, loff_t lend,
> > >   		if (!pvec.nr) {
> > >   			if (index == start || unfalloc)
> > >   				break;
> > > +                        /*
> > > +                         * When this condition is true, it means we were
> > > +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
> > > +                         * To prevent a livelock when someone else is
> > > faulting
> > > +                         * pages back, we are content with single pass
> > > and do
> > > +                         * not retry with index = start. It's important
> > > that
> > > +                         * previous page content has been discarded, and
> > > +                         * faulter(s) got new zeroed pages.
> > > +                         *
> > > +                         * The other callsites are shmem_setattr (for
> > > +                         * truncation) and shmem_evict_inode, which set
> > > i_size
> > > +                         * to truncated size or 0, respectively, and
> > > then call
> > > +                         * us with lstart == inode->i_size. There we do
> > > want to
> > > +                         * retry, and livelock cannot happen for other
> > > reasons.
> > > +                         *
> > > +                         * XXX what about i915_gem_object_truncate?
> > > +                         */
> > 
> > I doubt you have ever faced such a criticism before, but I'm going
> > to speak my mind and say that comment is too long!  A comment of that
> > length is okay above or just inside or at a natural break in a function,
> > but here it distracts too much from what the code is actually doing.
> 
> Fair enough. The reasoning should have gone into commit log, not comment.
> 
> > In particular, the words "this condition" are so much closer to the
> > condition above than the condition below, that it's rather confusing.
> > 
> > /* Single pass when hole-punching to not livelock on racing faults */
> > would have been enough (yes, I've cheated, that would be 2 or 4 lines).
> > 
> > > +                        if (lstart < inode->i_size)
> > 
> > For a long time I was going to suggest that you leave i_size out of it,
> > and use "lend > 0" instead.  Then suddenly I realized that this is the
> > wrong place for the test.
> 
> Well my first idea was to just add a flag about how persistent it should be.
> And set it false for the punch hole case. Then I wondered if there's already
> some bit that distinguishes it. But it makes it more subtle.
> 
> > And then that it's not your fault, it's mine,
> > in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
> > Wow, that really pessimized the hole-punch case!
> > 
> > When is pvec.nr 0?  When we've reached the end of the file.  Why should
> > we go to the end of the file, when punching a hole at the start?  Ughh!
> 
> Ah, I see (I think). But I managed to reproduce this problem when there was
> only an extra page between lend and the end of file, so I doubt this is the
> only problem. AFAIU it's enough to try punching a large enough hole, then the
> loop can only do a single pagevec worth of pages per iteration, which gives
> enough time for somebody faulting pages back?

That's useful info, thank you: I just wasn't trying hard enough then;
and you didn't even need 1024 cpus to show it either.  Right, we have
to revert my pincer, certainly on shmem.  And I think I'd better do the
same change on generic filesystems too (nobody has bothered to implement
hole-punch on ramfs, but if they did, they would hit the same problem):
though that part of it doesn't need a backport to -stable.

> 
> > > +                                break;
> > >   			index = start;
> > >   			continue;
> > >   		}
> > > --
> > > 1.8.4.5
> > 
> > But there is another problem.  We cannot break out after one pass on
> > shmem, because there's a possiblilty that a swap entry in the radix_tree
> > got swizzled into a page just as it was about to be removed - your patch
> > might then leave that data behind in the hole.
> 
> Thanks, I didn't notice that. Do I understand correctly that this could mean
> info leak for the punch hole call, but wouldn't be a problem for madvise? (In
> any case, that means the solution is not general enough for all kernels, so
> I'm asking just to be sure).

It's exactly the same issue for the madvise as for the fallocate:
data that is promised to have been punched out would still be there.

Very hard case to trigger, though, I think: since by the time we get
to this loop, we have already made one pass down the hole, getting rid
of everything that wasn't page-locked at the time, so the chance of
catching any swap in this loop is lower.

> 
> > As it happens, Konstantin Khlebnikov suggested a patch for that a few
> > weeks ago, before noticing that it's already handled by the endless loop.
> > If we make that loop no longer endless, we need to add in Konstantin's
> > "if (shmem_free_swap) goto retry" patch.
> > 
> > Right now I'm thinking that my idiocy in d0823576bf4b may actually
> > be the whole of Trinity's problem: patch below.  If we waste time
> > traversing the radix_tree to end-of-file, no wonder that concurrent
> > faults have time to put something in the hole every time.
> > 
> > Sasha, may I trespass on your time, and ask you to revert the previous
> > patch from your tree, and give this patch below a try?  I am very
> > interested to learn if in fact it fixes it for you (as it did for me).
> 
> I will try this, but as I explained above, I doubt that alone will help.

And afterwards you confirmed, thank you.

> 
> > However, I am wasting your time, in that I think we shall decide that
> > it's too unsafe to rely solely upon the patch below (what happens if
> > 1024 cpus are all faulting on it while we try to punch a 4MB hole at
> 
> My reproducer is 4MB file, where the puncher tries punching everything except
> first and last page. And there are 8 other threads (as I have 8 logical
> CPU's) that just repeatedly sweep the same range, reading only the first byte
> of each page.
> 
> > end of file? if we care).  I think we shall end up with the optimization
> > below (or some such: it can be written in various ways), plus reverting
> > d0823576bf4b's "index == start && " pincer, plus Konstantin's
> > shmem_free_swap handling, rolled into a single patch; and a similar
> 
> So that means no retry in any case (except the swap thing)? All callers can
> handle that? I guess shmem_evict_inode would be ok, as nobody else
> can be accessing that inode. But what about shmem_setattr? (i.e. straight
> truncation) As you said earlier, faulters will get a SIGBUS (which AFAIU is
> due to i_size being updated before we enter shmem_undo_range). But could
> possibly a faulter already pass the i_size test, and proceed with the fault
> only when we are already in shmem_undo_range and have passed the page in
> question?

We still have to retry indefinitely in the truncation case, as you
rightly guess.  SIGBUS beyond i_size makes it a much easier case to
handle, and there's no danger of "indefinitely" becoming "infinitely"
as in the punch-hole case.  But, depending on how the filesystem
handles its end, there is still some possibility of a race with faulting,
which some filesystems may require pagecache truncation to resolve.

Does shmem truncation itself require that?  Er, er, it would take me
too long to work out the definitive answer: perhaps it doesn't, but for
safety I certainly assume that it does require that - that is, I never
even considered removing the indefinite loop from the truncation case.

> 
> > patch (without the swap part) for several functions in truncate.c.
> > 
> > Hugh
> > 
> > --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> > +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
> > @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
> >   	for ( ; ; ) {
> >   		cond_resched();
> > 
> > +		index = min(index, end);
> >   		pvec.nr = find_get_entries(mapping, index,
> >   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> >   				pvec.pages, indices);

So let's all forget that patch, although it does help to highlight my
mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)

Here's the 3.16-rc2 patch that I've now settled on (which will also
require a revert of current git's f00cdc6df7d7; well, not require the
revert, but this makes that redundant, and cannot be tested with it in).

I've not yet had time to write up the patch description, nor to test
it fully; but thought I should get the patch itself into the open for
review and testing before then.

I've checked against v3.1 to see how it works out there: certainly
wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
"end" was included in the range, not excluded), but the same
principles apply.  Haven't checked the intermediates yet, will
probably leave those until each stable wants them - but if you've a
particular release in mind, please ask, or ask me to check your port.

I've included the mm/truncate.c part of it here, but that will be a
separate (not for -stable) patch when I post the finalized version.

Hannes, a question for you please, I just could not make up my mind.
In mm/truncate.c truncate_inode_pages_range(), what should be done
with a failed clear_exceptional_entry() in the case of hole-punch?
Is that case currently depending on the rescan loop (that I'm about
to revert) to remove a new page, so I would need to add a retry for
that rather like the shmem_free_swap() one?  Or is it irrelevant,
and can stay unchanged as below?  I've veered back and forth,
thinking first one and then the other.

Thanks,
Hugh

---

 mm/shmem.c    |   19 ++++++++++---------
 mm/truncate.c |   14 +++++---------
 2 files changed, 15 insertions(+), 18 deletions(-)

--- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
+++ linux/mm/shmem.c	2014-06-26 15:41:52.704362962 -0700
@@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}
 
--- 3.16-rc2/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
+++ linux/mm/truncate.c	2014-06-26 16:31:35.932433863 -0700
@@ -352,21 +352,17 @@ void truncate_inode_pages_range(struct a
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 		if (!pagevec_lookup_entries(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE),
-			indices)) {
-			if (index == start)
+			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
+			/* If all gone or hole-punch, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if (index == start && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-26 15:11               ` Sasha Levin
@ 2014-06-27  5:59                   ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-27  5:59 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Thu, 26 Jun 2014, Sasha Levin wrote:
> On 06/25/2014 06:36 PM, Hugh Dickins wrote:
> > Sasha, may I trespass on your time, and ask you to revert the previous
> > patch from your tree, and give this patch below a try?  I am very
> > interested to learn if in fact it fixes it for you (as it did for me).
> 
> Hi Hugh,
> 
> Happy to help,

Thank you!  Though Vlastimil has made it clear that we cannot go
forward with that one-liner patch, so I've just proposed another.

> and as I often do I will answer with a question.
> 
> I've observed two different issues after reverting the original fix and
> applying this new patch. Both of them seem semi-related, but I'm not sure.

I've rather concentrated on putting the new patch together, and just
haven't had time to give these two much thought - nor shall tomorrow,
I'm afraid.

> 
> First, this:
> 
> [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
> [  681.268621] IP: zap_pte_range (mm/memory.c:1132)

Weird, I don't think we've seen anything like that before, have we?
I'm pretty sure it's not a consequence of my "index = min(index, end)",
but what it portends I don't know.  Please confirm mm/memory.c:1132 -
that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
the code below.  So accessing page->mapping is causing an oops...

> [  681.269335] PGD 37fcc067 PUD 37fcb067 PMD 0
> [  681.269972] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  681.270952] Dumping ftrace buffer:
> [  681.270952]    (ftrace buffer empty)
> [  681.270952] Modules linked in:
> [  681.270952] CPU: 7 PID: 1952 Comm: trinity-c29 Not tainted 3.16.0-rc2-next-20140625-s
> asha-00025-g2e02e05-dirty #730
> [  681.270952] task: ffff8803e6f58000 ti: ffff8803df050000 task.ti: ffff8803df050000
> [  681.270952] RIP: zap_pte_range (mm/memory.c:1132)
> [  681.270952] RSP: 0018:ffff8803df053c58  EFLAGS: 00010246
> [  681.270952] RAX: ffffea0003480040 RBX: ffff8803edae7a70 RCX: 0000000003480040
> [  681.270952] RDX: 00000000d2001730 RSI: 0000000000000000 RDI: 00000000d2001730
> [  681.270952] RBP: ffff8803df053cf8 R08: ffff88000015cc00 R09: 0000000000000000
> [  681.270952] R10: 0000000000000001 R11: 0000000000000000 R12: ffffea0003480040
> [  681.270952] R13: ffff8803df053de8 R14: 00007fc15014f000 R15: 00007fc15014e000
> [  681.270952] FS:  00007fc15031b700(0000) GS:ffff8801ece00000(0000) knlGS:0000000000000
> 000
> [  681.270952] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  681.270952] CR2: ffffea0003480048 CR3: 000000001a02e000 CR4: 00000000000006a0
> [  681.270952] Stack:
> [  681.270952]  ffff8803df053de8 00000000d2001000 00000000d2001fff ffff8803e6f58000
> [  681.270952]  0000000000000000 0000000000000001 ffff880404dd8400 ffff8803e6e31900
> [  681.270952]  00000000d2001730 ffff88000015cc00 0000000000000000 ffff8804078f8000
> [  681.270952] Call Trace:
> [  681.270952] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/m
> emory.c:1346)
> [  681.270952] unmap_vmas (mm/memory.c:1375 (discriminator 1))
> [  681.270952] exit_mmap (mm/mmap.c:2797)
> [  681.270952] ? preempt_count_sub (kernel/sched/core.c:2606)
> [  681.270952] mmput (kernel/fork.c:638)
> [  681.270952] do_exit (kernel/exit.c:744)
> [  681.270952] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  681.270952] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/
> lockdep.c:2599)
> [  681.270952] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [  681.270952] do_group_exit (kernel/exit.c:884)
> [  681.270952] SyS_exit_group (kernel/exit.c:895)
> [  681.270952] tracesys (arch/x86/kernel/entry_64.S:542)
> [ 681.270952] Code: e8 cf 39 25 03 49 8b 4c 24 10 48 39 c8 74 1c 48 8b 7d b8 48 c1 e1 0c
>  48 89 da 48 83 c9 40 4c 89 fe e8 e5 db ff ff 0f 1f 44 00 00 <41> f6 44 24 08 01 74 08 83 6d c8 01 eb 33 66 90 f6 45 a0 40 74
> All code
> ========
>    0:   e8 cf 39 25 03          callq  0x32539d4
>    5:   49 8b 4c 24 10          mov    0x10(%r12),%rcx
>    a:   48 39 c8                cmp    %rcx,%rax
>    d:   74 1c                   je     0x2b
>    f:   48 8b 7d b8             mov    -0x48(%rbp),%rdi
>   13:   48 c1 e1 0c             shl    $0xc,%rcx
>   17:   48 89 da                mov    %rbx,%rdx
>   1a:   48 83 c9 40             or     $0x40,%rcx
>   1e:   4c 89 fe                mov    %r15,%rsi
>   21:   e8 e5 db ff ff          callq  0xffffffffffffdc0b
>   26:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   2b:*  41 f6 44 24 08 01       testb  $0x1,0x8(%r12)           <-- trapping instruction
>   31:   74 08                   je     0x3b
>   33:   83 6d c8 01             subl   $0x1,-0x38(%rbp)
>   37:   eb 33                   jmp    0x6c
>   39:   66 90                   xchg   %ax,%ax
>   3b:   f6 45 a0 40             testb  $0x40,-0x60(%rbp)
>   3f:   74 00                   je     0x41
> 
> Code starting with the faulting instruction
> ===========================================
>    0:   41 f6 44 24 08 01       testb  $0x1,0x8(%r12)
>    6:   74 08                   je     0x10
>    8:   83 6d c8 01             subl   $0x1,-0x38(%rbp)
>    c:   eb 33                   jmp    0x41
>    e:   66 90                   xchg   %ax,%ax
>   10:   f6 45 a0 40             testb  $0x40,-0x60(%rbp)
>   14:   74 00                   je     0x16
> [  681.270952] RIP zap_pte_range (mm/memory.c:1132)
> [  681.270952]  RSP <ffff8803df053c58>
> [  681.270952] CR2: ffffea0003480048
> 
> And a longer lockup that shows a few shmem_fallocate hanging, but they don't seem to be
> the main reason for the hang (log it pretty long, attached).

I wandered through the log you attached, but didn't have much idea of
what I should look for, and set it aside to get on with other things.

I think it was just confirming what Vlastimil indicated, that the
one-liner patch is not good enough: lots of hole-punches waiting
their turn for i_mutex, though I didn't see as many corresponding
faults into those holes as perhaps I expected.

Your mm/memory.c:1132 is more intriguing, but right now I can't
afford to get very intrigued by it!

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-27  5:59                   ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-27  5:59 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Thu, 26 Jun 2014, Sasha Levin wrote:
> On 06/25/2014 06:36 PM, Hugh Dickins wrote:
> > Sasha, may I trespass on your time, and ask you to revert the previous
> > patch from your tree, and give this patch below a try?  I am very
> > interested to learn if in fact it fixes it for you (as it did for me).
> 
> Hi Hugh,
> 
> Happy to help,

Thank you!  Though Vlastimil has made it clear that we cannot go
forward with that one-liner patch, so I've just proposed another.

> and as I often do I will answer with a question.
> 
> I've observed two different issues after reverting the original fix and
> applying this new patch. Both of them seem semi-related, but I'm not sure.

I've rather concentrated on putting the new patch together, and just
haven't had time to give these two much thought - nor shall tomorrow,
I'm afraid.

> 
> First, this:
> 
> [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
> [  681.268621] IP: zap_pte_range (mm/memory.c:1132)

Weird, I don't think we've seen anything like that before, have we?
I'm pretty sure it's not a consequence of my "index = min(index, end)",
but what it portends I don't know.  Please confirm mm/memory.c:1132 -
that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
the code below.  So accessing page->mapping is causing an oops...

> [  681.269335] PGD 37fcc067 PUD 37fcb067 PMD 0
> [  681.269972] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  681.270952] Dumping ftrace buffer:
> [  681.270952]    (ftrace buffer empty)
> [  681.270952] Modules linked in:
> [  681.270952] CPU: 7 PID: 1952 Comm: trinity-c29 Not tainted 3.16.0-rc2-next-20140625-s
> asha-00025-g2e02e05-dirty #730
> [  681.270952] task: ffff8803e6f58000 ti: ffff8803df050000 task.ti: ffff8803df050000
> [  681.270952] RIP: zap_pte_range (mm/memory.c:1132)
> [  681.270952] RSP: 0018:ffff8803df053c58  EFLAGS: 00010246
> [  681.270952] RAX: ffffea0003480040 RBX: ffff8803edae7a70 RCX: 0000000003480040
> [  681.270952] RDX: 00000000d2001730 RSI: 0000000000000000 RDI: 00000000d2001730
> [  681.270952] RBP: ffff8803df053cf8 R08: ffff88000015cc00 R09: 0000000000000000
> [  681.270952] R10: 0000000000000001 R11: 0000000000000000 R12: ffffea0003480040
> [  681.270952] R13: ffff8803df053de8 R14: 00007fc15014f000 R15: 00007fc15014e000
> [  681.270952] FS:  00007fc15031b700(0000) GS:ffff8801ece00000(0000) knlGS:0000000000000
> 000
> [  681.270952] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  681.270952] CR2: ffffea0003480048 CR3: 000000001a02e000 CR4: 00000000000006a0
> [  681.270952] Stack:
> [  681.270952]  ffff8803df053de8 00000000d2001000 00000000d2001fff ffff8803e6f58000
> [  681.270952]  0000000000000000 0000000000000001 ffff880404dd8400 ffff8803e6e31900
> [  681.270952]  00000000d2001730 ffff88000015cc00 0000000000000000 ffff8804078f8000
> [  681.270952] Call Trace:
> [  681.270952] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/m
> emory.c:1346)
> [  681.270952] unmap_vmas (mm/memory.c:1375 (discriminator 1))
> [  681.270952] exit_mmap (mm/mmap.c:2797)
> [  681.270952] ? preempt_count_sub (kernel/sched/core.c:2606)
> [  681.270952] mmput (kernel/fork.c:638)
> [  681.270952] do_exit (kernel/exit.c:744)
> [  681.270952] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  681.270952] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/
> lockdep.c:2599)
> [  681.270952] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [  681.270952] do_group_exit (kernel/exit.c:884)
> [  681.270952] SyS_exit_group (kernel/exit.c:895)
> [  681.270952] tracesys (arch/x86/kernel/entry_64.S:542)
> [ 681.270952] Code: e8 cf 39 25 03 49 8b 4c 24 10 48 39 c8 74 1c 48 8b 7d b8 48 c1 e1 0c
>  48 89 da 48 83 c9 40 4c 89 fe e8 e5 db ff ff 0f 1f 44 00 00 <41> f6 44 24 08 01 74 08 83 6d c8 01 eb 33 66 90 f6 45 a0 40 74
> All code
> ========
>    0:   e8 cf 39 25 03          callq  0x32539d4
>    5:   49 8b 4c 24 10          mov    0x10(%r12),%rcx
>    a:   48 39 c8                cmp    %rcx,%rax
>    d:   74 1c                   je     0x2b
>    f:   48 8b 7d b8             mov    -0x48(%rbp),%rdi
>   13:   48 c1 e1 0c             shl    $0xc,%rcx
>   17:   48 89 da                mov    %rbx,%rdx
>   1a:   48 83 c9 40             or     $0x40,%rcx
>   1e:   4c 89 fe                mov    %r15,%rsi
>   21:   e8 e5 db ff ff          callq  0xffffffffffffdc0b
>   26:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   2b:*  41 f6 44 24 08 01       testb  $0x1,0x8(%r12)           <-- trapping instruction
>   31:   74 08                   je     0x3b
>   33:   83 6d c8 01             subl   $0x1,-0x38(%rbp)
>   37:   eb 33                   jmp    0x6c
>   39:   66 90                   xchg   %ax,%ax
>   3b:   f6 45 a0 40             testb  $0x40,-0x60(%rbp)
>   3f:   74 00                   je     0x41
> 
> Code starting with the faulting instruction
> ===========================================
>    0:   41 f6 44 24 08 01       testb  $0x1,0x8(%r12)
>    6:   74 08                   je     0x10
>    8:   83 6d c8 01             subl   $0x1,-0x38(%rbp)
>    c:   eb 33                   jmp    0x41
>    e:   66 90                   xchg   %ax,%ax
>   10:   f6 45 a0 40             testb  $0x40,-0x60(%rbp)
>   14:   74 00                   je     0x16
> [  681.270952] RIP zap_pte_range (mm/memory.c:1132)
> [  681.270952]  RSP <ffff8803df053c58>
> [  681.270952] CR2: ffffea0003480048
> 
> And a longer lockup that shows a few shmem_fallocate hanging, but they don't seem to be
> the main reason for the hang (log it pretty long, attached).

I wandered through the log you attached, but didn't have much idea of
what I should look for, and set it aside to get on with other things.

I think it was just confirming what Vlastimil indicated, that the
one-liner patch is not good enough: lots of hole-punches waiting
their turn for i_mutex, though I didn't see as many corresponding
faults into those holes as perhaps I expected.

Your mm/memory.c:1132 is more intriguing, but right now I can't
afford to get very intrigued by it!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-27  5:59                   ` Hugh Dickins
@ 2014-06-27 14:50                     ` Sasha Levin
  -1 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-27 14:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/27/2014 01:59 AM, Hugh Dickins wrote:
>> > First, this:
>> > 
>> > [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
>> > [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
> Weird, I don't think we've seen anything like that before, have we?
> I'm pretty sure it's not a consequence of my "index = min(index, end)",
> but what it portends I don't know.  Please confirm mm/memory.c:1132 -
> that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
> the code below.  So accessing page->mapping is causing an oops...

Right, that's the correct line.

At this point I'm pretty sure that it's somehow related to that one line
patch since it reproduced fairly quickly after applying it, and when I
removed it I didn't see it happening again during the overnight fuzzing.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-27 14:50                     ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-27 14:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/27/2014 01:59 AM, Hugh Dickins wrote:
>> > First, this:
>> > 
>> > [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
>> > [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
> Weird, I don't think we've seen anything like that before, have we?
> I'm pretty sure it's not a consequence of my "index = min(index, end)",
> but what it portends I don't know.  Please confirm mm/memory.c:1132 -
> that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
> the code below.  So accessing page->mapping is causing an oops...

Right, that's the correct line.

At this point I'm pretty sure that it's somehow related to that one line
patch since it reproduced fairly quickly after applying it, and when I
removed it I didn't see it happening again during the overnight fuzzing.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-27 14:50                     ` Sasha Levin
@ 2014-06-27 18:03                       ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-27 18:03 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Fri, 27 Jun 2014, Sasha Levin wrote:
> On 06/27/2014 01:59 AM, Hugh Dickins wrote:
> >> > First, this:
> >> > 
> >> > [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
> >> > [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
> > Weird, I don't think we've seen anything like that before, have we?
> > I'm pretty sure it's not a consequence of my "index = min(index, end)",
> > but what it portends I don't know.  Please confirm mm/memory.c:1132 -
> > that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
> > the code below.  So accessing page->mapping is causing an oops...
> 
> Right, that's the correct line.
> 
> At this point I'm pretty sure that it's somehow related to that one line
> patch since it reproduced fairly quickly after applying it, and when I
> removed it I didn't see it happening again during the overnight fuzzing.

Oh, I assumed it was a one-off: you're saying that you saw it more than
once with the min(index, end) patch in?  But not since removing it (did
you replace that by the newer patch? or by the older? or by nothing?).

I want to exclaim "That makes no sense!", but bugs don't make sense
anyway.  It's going to be a challenge to work out a connection though.
I think I want to ask for more attempts to reproduce, with and without
the min(index, end) patch (if you have enough time - there must be a
limit to the amount of time you can give me on this).

I rather hoped that the oops on PageAnon might shed light from another
direction on the outstanding page_mapped bug: both seem like page table
corruption of some kind (though I've not seen a plausible path to either).

And regarding the page_mapped bug: we've heard nothing since Dave
Hansen suggested a VM_BUG_ON_PAGE for that - has it gone away now?

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-27 18:03                       ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-06-27 18:03 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Fri, 27 Jun 2014, Sasha Levin wrote:
> On 06/27/2014 01:59 AM, Hugh Dickins wrote:
> >> > First, this:
> >> > 
> >> > [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
> >> > [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
> > Weird, I don't think we've seen anything like that before, have we?
> > I'm pretty sure it's not a consequence of my "index = min(index, end)",
> > but what it portends I don't know.  Please confirm mm/memory.c:1132 -
> > that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
> > the code below.  So accessing page->mapping is causing an oops...
> 
> Right, that's the correct line.
> 
> At this point I'm pretty sure that it's somehow related to that one line
> patch since it reproduced fairly quickly after applying it, and when I
> removed it I didn't see it happening again during the overnight fuzzing.

Oh, I assumed it was a one-off: you're saying that you saw it more than
once with the min(index, end) patch in?  But not since removing it (did
you replace that by the newer patch? or by the older? or by nothing?).

I want to exclaim "That makes no sense!", but bugs don't make sense
anyway.  It's going to be a challenge to work out a connection though.
I think I want to ask for more attempts to reproduce, with and without
the min(index, end) patch (if you have enough time - there must be a
limit to the amount of time you can give me on this).

I rather hoped that the oops on PageAnon might shed light from another
direction on the outstanding page_mapped bug: both seem like page table
corruption of some kind (though I've not seen a plausible path to either).

And regarding the page_mapped bug: we've heard nothing since Dave
Hansen suggested a VM_BUG_ON_PAGE for that - has it gone away now?

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-27 18:03                       ` Hugh Dickins
@ 2014-06-28 21:41                         ` Sasha Levin
  -1 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-28 21:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/27/2014 02:03 PM, Hugh Dickins wrote:
> On Fri, 27 Jun 2014, Sasha Levin wrote:
>> On 06/27/2014 01:59 AM, Hugh Dickins wrote:
>>>>> First, this:
>>>>>
>>>>> [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
>>>>> [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
>>> Weird, I don't think we've seen anything like that before, have we?
>>> I'm pretty sure it's not a consequence of my "index = min(index, end)",
>>> but what it portends I don't know.  Please confirm mm/memory.c:1132 -
>>> that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
>>> the code below.  So accessing page->mapping is causing an oops...
>>
>> Right, that's the correct line.
>>
>> At this point I'm pretty sure that it's somehow related to that one line
>> patch since it reproduced fairly quickly after applying it, and when I
>> removed it I didn't see it happening again during the overnight fuzzing.
> 
> Oh, I assumed it was a one-off: you're saying that you saw it more than
> once with the min(index, end) patch in?  But not since removing it (did
> you replace that by the newer patch? or by the older? or by nothing?).

It reproduced exactly twice, can't say it happens too often.

What I did was revert your original fix for the issue and apply the one-liner.

I've spent most of yesterday chasing a different bug with a "clean" -next
tree (without the revert and the one-line patch) and didn't see any mm/
issues.

However, about 2 hours after doing the revert and applying the one-line
patch I've encountered the following:

[ 3686.797859] BUG: unable to handle kernel paging request at ffff88028a488f98
[ 3686.805732] IP: do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.805732] PGD 12b82067 PUD 704d49067 PMD 704cf6067 PTE 800000028a488060
[ 3686.805732] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3686.815852] Dumping ftrace buffer:
[ 3686.815852]    (ftrace buffer empty)
[ 3686.815852] Modules linked in:
[ 3686.815852] CPU: 10 PID: 8890 Comm: modprobe Not tainted 3.16.0-rc2-next-20140627-sasha-00024-ga284b83-dirty #753
[ 3686.815852] task: ffff8801d1c20000 ti: ffff8801c6a08000 task.ti: ffff8801c6a08000
[ 3686.826134] RIP: do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.826134] RSP: 0000:ffff8801c6a0bc78  EFLAGS: 00010297
[ 3686.826134] RAX: 0000000000000000 RBX: ffff880288531200 RCX: 000000000000001f
[ 3686.826134] RDX: 0000000000000014 RSI: 00007f22949f3000 RDI: ffff88028a488f98
[ 3686.826134] RBP: ffff8801c6a0bd18 R08: 00007f2294a13000 R09: 000000000000000c
[ 3686.826134] R10: 0000000000000000 R11: 00000000000000a8 R12: 00007f2294a07c50
[ 3686.826134] R13: ffff880279fec4b0 R14: 00007f22949f3000 R15: ffff88028ebbc528
[ 3686.826134] FS:  0000000000000000(0000) GS:ffff880292e00000(0000) knlGS:0000000000000000
[ 3686.826134] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3686.826134] CR2: ffff88028a488f98 CR3: 000000026b766000 CR4: 00000000000006a0
[ 3686.826134] Stack:
[ 3686.826134]  ffff8801c6a0bc98 0000000000000001 ffff8802000000a8 0000000000000014
[ 3686.826134]  ffff88028a489038 0000000000000000 ffff88026d40d000 ffff88028eafaee0
[ 3686.826134]  ffff8801c6a0bcd8 ffffffff8e572715 ffffea000a292240 000000000028a489
[ 3686.826134] Call Trace:
[ 3686.826134] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 3686.826134] ? __pte_alloc (mm/memory.c:598 mm/memory.c:593)
[ 3686.826134] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 mm/memory.c:3322)
[ 3686.826134] handle_mm_fault (include/linux/memcontrol.h:124 mm/memory.c:3348)
[ 3686.826134] ? __do_page_fault (arch/x86/mm/fault.c:1163)
[ 3686.826134] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 3686.826134] ? vtime_account_user (kernel/sched/cputime.c:687)
[ 3686.826134] ? get_parent_ip (kernel/sched/core.c:2550)
[ 3686.826134] ? context_tracking_user_exit (include/linux/vtime.h:89 include/linux/jump_label.h:115 include/trace/events/context_tracking.h:47 kernel/context_tracking.c:180)
[ 3686.826134] ? preempt_count_sub (kernel/sched/core.c:2606)
[ 3686.826134] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 3686.826134] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 3686.826134] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 3686.826134] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 3686.826134] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 3686.826134] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 3686.826134] Code: 89 c0 4c 8b 43 08 48 8d 4c 08 ff 49 01 c1 49 39 c9 4c 0f 47 c9 4c 89 c1 4c 29 f1 48 c1 e9 0c 49 8d 4c 0a ff 49 39 c9 4c 0f 47 c9 <48> 83 3f 00 74 3c 48 83 c0 01 4c 39 c8 77 74 48 81 c6 00 10 00
All code
========
   0:	89 c0                	mov    %eax,%eax
   2:	4c 8b 43 08          	mov    0x8(%rbx),%r8
   6:	48 8d 4c 08 ff       	lea    -0x1(%rax,%rcx,1),%rcx
   b:	49 01 c1             	add    %rax,%r9
   e:	49 39 c9             	cmp    %rcx,%r9
  11:	4c 0f 47 c9          	cmova  %rcx,%r9
  15:	4c 89 c1             	mov    %r8,%rcx
  18:	4c 29 f1             	sub    %r14,%rcx
  1b:	48 c1 e9 0c          	shr    $0xc,%rcx
  1f:	49 8d 4c 0a ff       	lea    -0x1(%r10,%rcx,1),%rcx
  24:	49 39 c9             	cmp    %rcx,%r9
  27:	4c 0f 47 c9          	cmova  %rcx,%r9
  2b:*	48 83 3f 00          	cmpq   $0x0,(%rdi)		<-- trapping instruction
  2f:	74 3c                	je     0x6d
  31:	48 83 c0 01          	add    $0x1,%rax
  35:	4c 39 c8             	cmp    %r9,%rax
  38:	77 74                	ja     0xae
  3a:	48 81 c6 00 10 00 00 	add    $0x1000,%rsi

Code starting with the faulting instruction
===========================================
   0:	48 83 3f 00          	cmpq   $0x0,(%rdi)
   4:	74 3c                	je     0x42
   6:	48 83 c0 01          	add    $0x1,%rax
   a:	4c 39 c8             	cmp    %r9,%rax
   d:	77 74                	ja     0x83
   f:	48 81 c6 00 10 00 00 	add    $0x1000,%rsi
[ 3686.826134] RIP do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.826134]  RSP <ffff8801c6a0bc78>
[ 3686.826134] CR2: ffff88028a488f98

Association is not causation but this is pretty suspicious...

> I want to exclaim "That makes no sense!", but bugs don't make sense
> anyway.  It's going to be a challenge to work out a connection though.
> I think I want to ask for more attempts to reproduce, with and without
> the min(index, end) patch (if you have enough time - there must be a
> limit to the amount of time you can give me on this).
> 
> I rather hoped that the oops on PageAnon might shed light from another
> direction on the outstanding page_mapped bug: both seem like page table
> corruption of some kind (though I've not seen a plausible path to either).
> 
> And regarding the page_mapped bug: we've heard nothing since Dave
> Hansen suggested a VM_BUG_ON_PAGE for that - has it gone away now?

Seems like it. I'm carrying Dave's patch still, but haven't seen it
triggering.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-06-28 21:41                         ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-06-28 21:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/27/2014 02:03 PM, Hugh Dickins wrote:
> On Fri, 27 Jun 2014, Sasha Levin wrote:
>> On 06/27/2014 01:59 AM, Hugh Dickins wrote:
>>>>> First, this:
>>>>>
>>>>> [  681.267487] BUG: unable to handle kernel paging request at ffffea0003480048
>>>>> [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
>>> Weird, I don't think we've seen anything like that before, have we?
>>> I'm pretty sure it's not a consequence of my "index = min(index, end)",
>>> but what it portends I don't know.  Please confirm mm/memory.c:1132 -
>>> that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
>>> the code below.  So accessing page->mapping is causing an oops...
>>
>> Right, that's the correct line.
>>
>> At this point I'm pretty sure that it's somehow related to that one line
>> patch since it reproduced fairly quickly after applying it, and when I
>> removed it I didn't see it happening again during the overnight fuzzing.
> 
> Oh, I assumed it was a one-off: you're saying that you saw it more than
> once with the min(index, end) patch in?  But not since removing it (did
> you replace that by the newer patch? or by the older? or by nothing?).

It reproduced exactly twice, can't say it happens too often.

What I did was revert your original fix for the issue and apply the one-liner.

I've spent most of yesterday chasing a different bug with a "clean" -next
tree (without the revert and the one-line patch) and didn't see any mm/
issues.

However, about 2 hours after doing the revert and applying the one-line
patch I've encountered the following:

[ 3686.797859] BUG: unable to handle kernel paging request at ffff88028a488f98
[ 3686.805732] IP: do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.805732] PGD 12b82067 PUD 704d49067 PMD 704cf6067 PTE 800000028a488060
[ 3686.805732] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3686.815852] Dumping ftrace buffer:
[ 3686.815852]    (ftrace buffer empty)
[ 3686.815852] Modules linked in:
[ 3686.815852] CPU: 10 PID: 8890 Comm: modprobe Not tainted 3.16.0-rc2-next-20140627-sasha-00024-ga284b83-dirty #753
[ 3686.815852] task: ffff8801d1c20000 ti: ffff8801c6a08000 task.ti: ffff8801c6a08000
[ 3686.826134] RIP: do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.826134] RSP: 0000:ffff8801c6a0bc78  EFLAGS: 00010297
[ 3686.826134] RAX: 0000000000000000 RBX: ffff880288531200 RCX: 000000000000001f
[ 3686.826134] RDX: 0000000000000014 RSI: 00007f22949f3000 RDI: ffff88028a488f98
[ 3686.826134] RBP: ffff8801c6a0bd18 R08: 00007f2294a13000 R09: 000000000000000c
[ 3686.826134] R10: 0000000000000000 R11: 00000000000000a8 R12: 00007f2294a07c50
[ 3686.826134] R13: ffff880279fec4b0 R14: 00007f22949f3000 R15: ffff88028ebbc528
[ 3686.826134] FS:  0000000000000000(0000) GS:ffff880292e00000(0000) knlGS:0000000000000000
[ 3686.826134] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3686.826134] CR2: ffff88028a488f98 CR3: 000000026b766000 CR4: 00000000000006a0
[ 3686.826134] Stack:
[ 3686.826134]  ffff8801c6a0bc98 0000000000000001 ffff8802000000a8 0000000000000014
[ 3686.826134]  ffff88028a489038 0000000000000000 ffff88026d40d000 ffff88028eafaee0
[ 3686.826134]  ffff8801c6a0bcd8 ffffffff8e572715 ffffea000a292240 000000000028a489
[ 3686.826134] Call Trace:
[ 3686.826134] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 3686.826134] ? __pte_alloc (mm/memory.c:598 mm/memory.c:593)
[ 3686.826134] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 mm/memory.c:3322)
[ 3686.826134] handle_mm_fault (include/linux/memcontrol.h:124 mm/memory.c:3348)
[ 3686.826134] ? __do_page_fault (arch/x86/mm/fault.c:1163)
[ 3686.826134] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 3686.826134] ? vtime_account_user (kernel/sched/cputime.c:687)
[ 3686.826134] ? get_parent_ip (kernel/sched/core.c:2550)
[ 3686.826134] ? context_tracking_user_exit (include/linux/vtime.h:89 include/linux/jump_label.h:115 include/trace/events/context_tracking.h:47 kernel/context_tracking.c:180)
[ 3686.826134] ? preempt_count_sub (kernel/sched/core.c:2606)
[ 3686.826134] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 3686.826134] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 3686.826134] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 3686.826134] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 3686.826134] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 3686.826134] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 3686.826134] Code: 89 c0 4c 8b 43 08 48 8d 4c 08 ff 49 01 c1 49 39 c9 4c 0f 47 c9 4c 89 c1 4c 29 f1 48 c1 e9 0c 49 8d 4c 0a ff 49 39 c9 4c 0f 47 c9 <48> 83 3f 00 74 3c 48 83 c0 01 4c 39 c8 77 74 48 81 c6 00 10 00
All code
========
   0:	89 c0                	mov    %eax,%eax
   2:	4c 8b 43 08          	mov    0x8(%rbx),%r8
   6:	48 8d 4c 08 ff       	lea    -0x1(%rax,%rcx,1),%rcx
   b:	49 01 c1             	add    %rax,%r9
   e:	49 39 c9             	cmp    %rcx,%r9
  11:	4c 0f 47 c9          	cmova  %rcx,%r9
  15:	4c 89 c1             	mov    %r8,%rcx
  18:	4c 29 f1             	sub    %r14,%rcx
  1b:	48 c1 e9 0c          	shr    $0xc,%rcx
  1f:	49 8d 4c 0a ff       	lea    -0x1(%r10,%rcx,1),%rcx
  24:	49 39 c9             	cmp    %rcx,%r9
  27:	4c 0f 47 c9          	cmova  %rcx,%r9
  2b:*	48 83 3f 00          	cmpq   $0x0,(%rdi)		<-- trapping instruction
  2f:	74 3c                	je     0x6d
  31:	48 83 c0 01          	add    $0x1,%rax
  35:	4c 39 c8             	cmp    %r9,%rax
  38:	77 74                	ja     0xae
  3a:	48 81 c6 00 10 00 00 	add    $0x1000,%rsi

Code starting with the faulting instruction
===========================================
   0:	48 83 3f 00          	cmpq   $0x0,(%rdi)
   4:	74 3c                	je     0x42
   6:	48 83 c0 01          	add    $0x1,%rax
   a:	4c 39 c8             	cmp    %r9,%rax
   d:	77 74                	ja     0x83
   f:	48 81 c6 00 10 00 00 	add    $0x1000,%rsi
[ 3686.826134] RIP do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.826134]  RSP <ffff8801c6a0bc78>
[ 3686.826134] CR2: ffff88028a488f98

Association is not causation but this is pretty suspicious...

> I want to exclaim "That makes no sense!", but bugs don't make sense
> anyway.  It's going to be a challenge to work out a connection though.
> I think I want to ask for more attempts to reproduce, with and without
> the min(index, end) patch (if you have enough time - there must be a
> limit to the amount of time you can give me on this).
> 
> I rather hoped that the oops on PageAnon might shed light from another
> direction on the outstanding page_mapped bug: both seem like page table
> corruption of some kind (though I've not seen a plausible path to either).
> 
> And regarding the page_mapped bug: we've heard nothing since Dave
> Hansen suggested a VM_BUG_ON_PAGE for that - has it gone away now?

Seems like it. I'm carrying Dave's patch still, but haven't seen it
triggering.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-27  5:36                   ` Hugh Dickins
@ 2014-07-01 11:52                     ` Vlastimil Babka
  -1 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-07-01 11:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Konstantin Khlebnikov, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/27/2014 07:36 AM, Hugh Dickins wrote:> [Cc Johannes: at the end I 
have a particular question for you]
>
> On Thu, 26 Jun 2014, Vlastimil Babka wrote:
>> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
>>> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>>>
>>> Sorry for the slow response: I have got confused, learnt more, and
>>> changed my mind, several times in the course of replying to you.
>>> I think this reply will be stable... though not final.
>>
>> Thanks a lot for looking into it!
>>
>>>>
>>>> since this got a CVE,
>>>
>>> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
>>
>> Sorry, I should have mentioned it explicitly.
>>
>>> Looks overrated to me
>>
>> I'd bet it would pass unnoticed if you didn't use the sentence "but whether
>> it's a serious matter in the scale of denials of service, I'm not so sure" in
>> your first reply to Sasha's report :) I wouldn't be surprised if people grep
>> for this.
>
> Hah, you're probably right,
> I better choose my words more carefully in future.
>
>>
>>> (and amusing to see my pompous words about a
>>> "range notification mechanism" taken too seriously), but of course
>>> we do need to address it.
>>>
>>>> I've been looking at backport to an older kernel where
>>>
>>> Thanks a lot for looking into it.  I didn't think it was worth a
>>> Cc: stable@vger.kernel.org myself, but admit to being both naive
>>> and inconsistent about that.
>>>
>>>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>>>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and
>>>> since
>>>
>>> Yes, that mechanism could be ported back pre-v3.5,
>>> but I agree with your preference not to.
>>>
>>>> it doesn't guarantee anything, it seems simpler just to give up retrying
>>>> to
>>>
>>> Right, I don't think we have formally documented the instant of "full hole"
>>> that I strove for there, and it's probably not externally verifiable, nor
>>> guaranteed by other filesystems.  I just thought it a good QoS aim, but
>>> it has given us this problem.
>>>
>>>> truncate really everything. Then I realized that maybe it would work for
>>>> current kernel as well, without having to add any checks in the page
>>>> fault
>>>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look
>>>> different
>>>> from madvise(MADV_REMOVE), but it seems to me that as long as it does
>>>> discard
>>>> the old data from the range, it's fine from any information leak point of
>>>> view.
>>>> If someone races page faulting, it IMHO doesn't matter if he gets a new
>>>> zeroed
>>>> page before the parallel truncate has ended, or right after it has ended.
>>>
>>> Yes.  I disagree with your actual patch, for more than one reason,
>>> but it's in the right area; and I found myself growing to agree with
>>> you, that's it's better to have one kind of fix for all these releases,
>>> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
>>> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
>>
>> I was looking at our 3.0 based kernel, but it could be due to backported
>> patches on top.
>
> And later you confirm that 3.0.101 vanilla is okay: thanks, that fits.
>
>>
>>> If I'd realized that we were going to have to backport, I'd have spent
>>> longer looking for a patch like yours originally.  So my inclination
>>> now is to go your route, make a new patch for v3.16 and backports,
>>> and revert the f00cdc6df7d7 that has already gone in.
>>>
>>>> So I'm posting it here as a RFC. I haven't thought about the
>>>> i915_gem_object_truncate caller yet. I think that this path wouldn't
>>>> satisfy
>>>
>>> My understanding is that i915_gem_object_truncate() is not a problem,
>>> that i915's dev->struct_mutex serializes all its relevant transitions,
>>> plus the object woudn't even be interestingly accessible to the user.
>>>
>>>> the new "lstart < inode->i_size" condition, but I don't know if it's
>>>> "vulnerable"
>>>> to the problem.
>>>
>>> I don't think i915 is vulnerable, but if it is, that condition would
>>> be fine for it, as would be the patch I'm now thinking of.
>>>
>>>>
>>>> -----8<-----
>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole
>>>> punching
>>>>
>>>> ---
>>>>    mm/shmem.c | 19 +++++++++++++++++++
>>>>    1 file changed, 19 insertions(+)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index f484c27..6d6005c 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode,
>>>> loff_t lstart, loff_t lend,
>>>>    		if (!pvec.nr) {
>>>>    			if (index == start || unfalloc)
>>>>    				break;
>>>> +                        /*
>>>> +                         * When this condition is true, it means we were
>>>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>>>> +                         * To prevent a livelock when someone else is
>>>> faulting
>>>> +                         * pages back, we are content with single pass
>>>> and do
>>>> +                         * not retry with index = start. It's important
>>>> that
>>>> +                         * previous page content has been discarded, and
>>>> +                         * faulter(s) got new zeroed pages.
>>>> +                         *
>>>> +                         * The other callsites are shmem_setattr (for
>>>> +                         * truncation) and shmem_evict_inode, which set
>>>> i_size
>>>> +                         * to truncated size or 0, respectively, and
>>>> then call
>>>> +                         * us with lstart == inode->i_size. There we do
>>>> want to
>>>> +                         * retry, and livelock cannot happen for other
>>>> reasons.
>>>> +                         *
>>>> +                         * XXX what about i915_gem_object_truncate?
>>>> +                         */
>>>
>>> I doubt you have ever faced such a criticism before, but I'm going
>>> to speak my mind and say that comment is too long!  A comment of that
>>> length is okay above or just inside or at a natural break in a function,
>>> but here it distracts too much from what the code is actually doing.
>>
>> Fair enough. The reasoning should have gone into commit log, not comment.
>>
>>> In particular, the words "this condition" are so much closer to the
>>> condition above than the condition below, that it's rather confusing.
>>>
>>> /* Single pass when hole-punching to not livelock on racing faults */
>>> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>>>
>>>> +                        if (lstart < inode->i_size)
>>>
>>> For a long time I was going to suggest that you leave i_size out of it,
>>> and use "lend > 0" instead.  Then suddenly I realized that this is the
>>> wrong place for the test.
>>
>> Well my first idea was to just add a flag about how persistent it should be.
>> And set it false for the punch hole case. Then I wondered if there's already
>> some bit that distinguishes it. But it makes it more subtle.
>>
>>> And then that it's not your fault, it's mine,
>>> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
>>> Wow, that really pessimized the hole-punch case!
>>>
>>> When is pvec.nr 0?  When we've reached the end of the file.  Why should
>>> we go to the end of the file, when punching a hole at the start?  Ughh!
>>
>> Ah, I see (I think). But I managed to reproduce this problem when there was
>> only an extra page between lend and the end of file, so I doubt this is the
>> only problem. AFAIU it's enough to try punching a large enough hole, then the
>> loop can only do a single pagevec worth of pages per iteration, which gives
>> enough time for somebody faulting pages back?
>
> That's useful info, thank you: I just wasn't trying hard enough then;
> and you didn't even need 1024 cpus to show it either.  Right, we have
> to revert my pincer, certainly on shmem.  And I think I'd better do the
> same change on generic filesystems too (nobody has bothered to implement
> hole-punch on ramfs, but if they did, they would hit the same problem):
> though that part of it doesn't need a backport to -stable.
>
>>
>>>> +                                break;
>>>>    			index = start;
>>>>    			continue;
>>>>    		}
>>>> --
>>>> 1.8.4.5
>>>
>>> But there is another problem.  We cannot break out after one pass on
>>> shmem, because there's a possiblilty that a swap entry in the radix_tree
>>> got swizzled into a page just as it was about to be removed - your patch
>>> might then leave that data behind in the hole.
>>
>> Thanks, I didn't notice that. Do I understand correctly that this could mean
>> info leak for the punch hole call, but wouldn't be a problem for madvise? (In
>> any case, that means the solution is not general enough for all kernels, so
>> I'm asking just to be sure).
>
> It's exactly the same issue for the madvise as for the fallocate:
> data that is promised to have been punched out would still be there.

AFAIK madvise doesn't promise anything. But nevermind.

> Very hard case to trigger, though, I think: since by the time we get
> to this loop, we have already made one pass down the hole, getting rid
> of everything that wasn't page-locked at the time, so the chance of
> catching any swap in this loop is lower.
>
>>
>>> As it happens, Konstantin Khlebnikov suggested a patch for that a few
>>> weeks ago, before noticing that it's already handled by the endless loop.
>>> If we make that loop no longer endless, we need to add in Konstantin's
>>> "if (shmem_free_swap) goto retry" patch.
>>>
>>> Right now I'm thinking that my idiocy in d0823576bf4b may actually
>>> be the whole of Trinity's problem: patch below.  If we waste time
>>> traversing the radix_tree to end-of-file, no wonder that concurrent
>>> faults have time to put something in the hole every time.
>>>
>>> Sasha, may I trespass on your time, and ask you to revert the previous
>>> patch from your tree, and give this patch below a try?  I am very
>>> interested to learn if in fact it fixes it for you (as it did for me).
>>
>> I will try this, but as I explained above, I doubt that alone will help.
>
> And afterwards you confirmed, thank you.
>
>>
>>> However, I am wasting your time, in that I think we shall decide that
>>> it's too unsafe to rely solely upon the patch below (what happens if
>>> 1024 cpus are all faulting on it while we try to punch a 4MB hole at
>>
>> My reproducer is 4MB file, where the puncher tries punching everything except
>> first and last page. And there are 8 other threads (as I have 8 logical
>> CPU's) that just repeatedly sweep the same range, reading only the first byte
>> of each page.
>>
>>> end of file? if we care).  I think we shall end up with the optimization
>>> below (or some such: it can be written in various ways), plus reverting
>>> d0823576bf4b's "index == start && " pincer, plus Konstantin's
>>> shmem_free_swap handling, rolled into a single patch; and a similar
>>
>> So that means no retry in any case (except the swap thing)? All callers can
>> handle that? I guess shmem_evict_inode would be ok, as nobody else
>> can be accessing that inode. But what about shmem_setattr? (i.e. straight
>> truncation) As you said earlier, faulters will get a SIGBUS (which AFAIU is
>> due to i_size being updated before we enter shmem_undo_range). But could
>> possibly a faulter already pass the i_size test, and proceed with the fault
>> only when we are already in shmem_undo_range and have passed the page in
>> question?
>
> We still have to retry indefinitely in the truncation case, as you
> rightly guess.  SIGBUS beyond i_size makes it a much easier case to
> handle, and there's no danger of "indefinitely" becoming "infinitely"
> as in the punch-hole case.  But, depending on how the filesystem
> handles its end, there is still some possibility of a race with faulting,
> which some filesystems may require pagecache truncation to resolve.
>
> Does shmem truncation itself require that?  Er, er, it would take me
> too long to work out the definitive answer: perhaps it doesn't, but for
> safety I certainly assume that it does require that - that is, I never
> even considered removing the indefinite loop from the truncation case.
>
>>
>>> patch (without the swap part) for several functions in truncate.c.
>>>
>>> Hugh
>>>
>>> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
>>> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
>>> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>>>    	for ( ; ; ) {
>>>    		cond_resched();
>>>
>>> +		index = min(index, end);
>>>    		pvec.nr = find_get_entries(mapping, index,
>>>    				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>>    				pvec.pages, indices);
>
> So let's all forget that patch, although it does help to highlight my
> mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)

What patch? What mistake? :)

> Here's the 3.16-rc2 patch that I've now settled on (which will also
> require a revert of current git's f00cdc6df7d7; well, not require the
> revert, but this makes that redundant, and cannot be tested with it in).
>
> I've not yet had time to write up the patch description, nor to test
> it fully; but thought I should get the patch itself into the open for
> review and testing before then.

It seems to work here (tested 3.16-rc1 which didn't have f00cdc6df7d7 yet).
Checking for end != -1 is indeed much more elegant solution than i_size.
Thanks. So you can add my Tested-by.

> I've checked against v3.1 to see how it works out there: certainly
> wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
> "end" was included in the range, not excluded), but the same
> principles apply.  Haven't checked the intermediates yet, will
> probably leave those until each stable wants them - but if you've a
> particular release in mind, please ask, or ask me to check your port.

I will try, thanks.

> I've included the mm/truncate.c part of it here, but that will be a
> separate (not for -stable) patch when I post the finalized version.
>
> Hannes, a question for you please, I just could not make up my mind.
> In mm/truncate.c truncate_inode_pages_range(), what should be done
> with a failed clear_exceptional_entry() in the case of hole-punch?
> Is that case currently depending on the rescan loop (that I'm about
> to revert) to remove a new page, so I would need to add a retry for
> that rather like the shmem_free_swap() one?  Or is it irrelevant,
> and can stay unchanged as below?  I've veered back and forth,
> thinking first one and then the other.
>
> Thanks,
> Hugh
>
> ---
>
>   mm/shmem.c    |   19 ++++++++++---------
>   mm/truncate.c |   14 +++++---------
>   2 files changed, 15 insertions(+), 18 deletions(-)
>
> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> +++ linux/mm/shmem.c	2014-06-26 15:41:52.704362962 -0700
> @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>   		if (!pvec.nr) {
> -			if (index == start || unfalloc)
> +			/* If all gone or hole-punch or unfalloc, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if ((index == start || unfalloc) && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
> @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
>   			if (radix_tree_exceptional_entry(page)) {
>   				if (unfalloc)
>   					continue;
> -				nr_swaps_freed += !shmem_free_swap(mapping,
> -								index, page);
> +				if (shmem_free_swap(mapping, index, page)) {
> +					/* Swap was replaced by page: retry */
> +					index--;
> +					break;
> +				}
> +				nr_swaps_freed++;
>   				continue;
>   			}
>
> --- 3.16-rc2/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
> +++ linux/mm/truncate.c	2014-06-26 16:31:35.932433863 -0700
> @@ -352,21 +352,17 @@ void truncate_inode_pages_range(struct a
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>   		if (!pagevec_lookup_entries(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -			indices)) {
> -			if (index == start)
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
> +			/* If all gone or hole-punch, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if (index == start && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-07-01 11:52                     ` Vlastimil Babka
  0 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-07-01 11:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Konstantin Khlebnikov, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On 06/27/2014 07:36 AM, Hugh Dickins wrote:> [Cc Johannes: at the end I 
have a particular question for you]
>
> On Thu, 26 Jun 2014, Vlastimil Babka wrote:
>> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
>>> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>>>
>>> Sorry for the slow response: I have got confused, learnt more, and
>>> changed my mind, several times in the course of replying to you.
>>> I think this reply will be stable... though not final.
>>
>> Thanks a lot for looking into it!
>>
>>>>
>>>> since this got a CVE,
>>>
>>> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
>>
>> Sorry, I should have mentioned it explicitly.
>>
>>> Looks overrated to me
>>
>> I'd bet it would pass unnoticed if you didn't use the sentence "but whether
>> it's a serious matter in the scale of denials of service, I'm not so sure" in
>> your first reply to Sasha's report :) I wouldn't be surprised if people grep
>> for this.
>
> Hah, you're probably right,
> I better choose my words more carefully in future.
>
>>
>>> (and amusing to see my pompous words about a
>>> "range notification mechanism" taken too seriously), but of course
>>> we do need to address it.
>>>
>>>> I've been looking at backport to an older kernel where
>>>
>>> Thanks a lot for looking into it.  I didn't think it was worth a
>>> Cc: stable@vger.kernel.org myself, but admit to being both naive
>>> and inconsistent about that.
>>>
>>>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>>>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and
>>>> since
>>>
>>> Yes, that mechanism could be ported back pre-v3.5,
>>> but I agree with your preference not to.
>>>
>>>> it doesn't guarantee anything, it seems simpler just to give up retrying
>>>> to
>>>
>>> Right, I don't think we have formally documented the instant of "full hole"
>>> that I strove for there, and it's probably not externally verifiable, nor
>>> guaranteed by other filesystems.  I just thought it a good QoS aim, but
>>> it has given us this problem.
>>>
>>>> truncate really everything. Then I realized that maybe it would work for
>>>> current kernel as well, without having to add any checks in the page
>>>> fault
>>>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look
>>>> different
>>>> from madvise(MADV_REMOVE), but it seems to me that as long as it does
>>>> discard
>>>> the old data from the range, it's fine from any information leak point of
>>>> view.
>>>> If someone races page faulting, it IMHO doesn't matter if he gets a new
>>>> zeroed
>>>> page before the parallel truncate has ended, or right after it has ended.
>>>
>>> Yes.  I disagree with your actual patch, for more than one reason,
>>> but it's in the right area; and I found myself growing to agree with
>>> you, that's it's better to have one kind of fix for all these releases,
>>> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
>>> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
>>
>> I was looking at our 3.0 based kernel, but it could be due to backported
>> patches on top.
>
> And later you confirm that 3.0.101 vanilla is okay: thanks, that fits.
>
>>
>>> If I'd realized that we were going to have to backport, I'd have spent
>>> longer looking for a patch like yours originally.  So my inclination
>>> now is to go your route, make a new patch for v3.16 and backports,
>>> and revert the f00cdc6df7d7 that has already gone in.
>>>
>>>> So I'm posting it here as a RFC. I haven't thought about the
>>>> i915_gem_object_truncate caller yet. I think that this path wouldn't
>>>> satisfy
>>>
>>> My understanding is that i915_gem_object_truncate() is not a problem,
>>> that i915's dev->struct_mutex serializes all its relevant transitions,
>>> plus the object woudn't even be interestingly accessible to the user.
>>>
>>>> the new "lstart < inode->i_size" condition, but I don't know if it's
>>>> "vulnerable"
>>>> to the problem.
>>>
>>> I don't think i915 is vulnerable, but if it is, that condition would
>>> be fine for it, as would be the patch I'm now thinking of.
>>>
>>>>
>>>> -----8<-----
>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole
>>>> punching
>>>>
>>>> ---
>>>>    mm/shmem.c | 19 +++++++++++++++++++
>>>>    1 file changed, 19 insertions(+)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index f484c27..6d6005c 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode,
>>>> loff_t lstart, loff_t lend,
>>>>    		if (!pvec.nr) {
>>>>    			if (index == start || unfalloc)
>>>>    				break;
>>>> +                        /*
>>>> +                         * When this condition is true, it means we were
>>>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>>>> +                         * To prevent a livelock when someone else is
>>>> faulting
>>>> +                         * pages back, we are content with single pass
>>>> and do
>>>> +                         * not retry with index = start. It's important
>>>> that
>>>> +                         * previous page content has been discarded, and
>>>> +                         * faulter(s) got new zeroed pages.
>>>> +                         *
>>>> +                         * The other callsites are shmem_setattr (for
>>>> +                         * truncation) and shmem_evict_inode, which set
>>>> i_size
>>>> +                         * to truncated size or 0, respectively, and
>>>> then call
>>>> +                         * us with lstart == inode->i_size. There we do
>>>> want to
>>>> +                         * retry, and livelock cannot happen for other
>>>> reasons.
>>>> +                         *
>>>> +                         * XXX what about i915_gem_object_truncate?
>>>> +                         */
>>>
>>> I doubt you have ever faced such a criticism before, but I'm going
>>> to speak my mind and say that comment is too long!  A comment of that
>>> length is okay above or just inside or at a natural break in a function,
>>> but here it distracts too much from what the code is actually doing.
>>
>> Fair enough. The reasoning should have gone into commit log, not comment.
>>
>>> In particular, the words "this condition" are so much closer to the
>>> condition above than the condition below, that it's rather confusing.
>>>
>>> /* Single pass when hole-punching to not livelock on racing faults */
>>> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>>>
>>>> +                        if (lstart < inode->i_size)
>>>
>>> For a long time I was going to suggest that you leave i_size out of it,
>>> and use "lend > 0" instead.  Then suddenly I realized that this is the
>>> wrong place for the test.
>>
>> Well my first idea was to just add a flag about how persistent it should be.
>> And set it false for the punch hole case. Then I wondered if there's already
>> some bit that distinguishes it. But it makes it more subtle.
>>
>>> And then that it's not your fault, it's mine,
>>> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
>>> Wow, that really pessimized the hole-punch case!
>>>
>>> When is pvec.nr 0?  When we've reached the end of the file.  Why should
>>> we go to the end of the file, when punching a hole at the start?  Ughh!
>>
>> Ah, I see (I think). But I managed to reproduce this problem when there was
>> only an extra page between lend and the end of file, so I doubt this is the
>> only problem. AFAIU it's enough to try punching a large enough hole, then the
>> loop can only do a single pagevec worth of pages per iteration, which gives
>> enough time for somebody faulting pages back?
>
> That's useful info, thank you: I just wasn't trying hard enough then;
> and you didn't even need 1024 cpus to show it either.  Right, we have
> to revert my pincer, certainly on shmem.  And I think I'd better do the
> same change on generic filesystems too (nobody has bothered to implement
> hole-punch on ramfs, but if they did, they would hit the same problem):
> though that part of it doesn't need a backport to -stable.
>
>>
>>>> +                                break;
>>>>    			index = start;
>>>>    			continue;
>>>>    		}
>>>> --
>>>> 1.8.4.5
>>>
>>> But there is another problem.  We cannot break out after one pass on
>>> shmem, because there's a possiblilty that a swap entry in the radix_tree
>>> got swizzled into a page just as it was about to be removed - your patch
>>> might then leave that data behind in the hole.
>>
>> Thanks, I didn't notice that. Do I understand correctly that this could mean
>> info leak for the punch hole call, but wouldn't be a problem for madvise? (In
>> any case, that means the solution is not general enough for all kernels, so
>> I'm asking just to be sure).
>
> It's exactly the same issue for the madvise as for the fallocate:
> data that is promised to have been punched out would still be there.

AFAIK madvise doesn't promise anything. But nevermind.

> Very hard case to trigger, though, I think: since by the time we get
> to this loop, we have already made one pass down the hole, getting rid
> of everything that wasn't page-locked at the time, so the chance of
> catching any swap in this loop is lower.
>
>>
>>> As it happens, Konstantin Khlebnikov suggested a patch for that a few
>>> weeks ago, before noticing that it's already handled by the endless loop.
>>> If we make that loop no longer endless, we need to add in Konstantin's
>>> "if (shmem_free_swap) goto retry" patch.
>>>
>>> Right now I'm thinking that my idiocy in d0823576bf4b may actually
>>> be the whole of Trinity's problem: patch below.  If we waste time
>>> traversing the radix_tree to end-of-file, no wonder that concurrent
>>> faults have time to put something in the hole every time.
>>>
>>> Sasha, may I trespass on your time, and ask you to revert the previous
>>> patch from your tree, and give this patch below a try?  I am very
>>> interested to learn if in fact it fixes it for you (as it did for me).
>>
>> I will try this, but as I explained above, I doubt that alone will help.
>
> And afterwards you confirmed, thank you.
>
>>
>>> However, I am wasting your time, in that I think we shall decide that
>>> it's too unsafe to rely solely upon the patch below (what happens if
>>> 1024 cpus are all faulting on it while we try to punch a 4MB hole at
>>
>> My reproducer is 4MB file, where the puncher tries punching everything except
>> first and last page. And there are 8 other threads (as I have 8 logical
>> CPU's) that just repeatedly sweep the same range, reading only the first byte
>> of each page.
>>
>>> end of file? if we care).  I think we shall end up with the optimization
>>> below (or some such: it can be written in various ways), plus reverting
>>> d0823576bf4b's "index == start && " pincer, plus Konstantin's
>>> shmem_free_swap handling, rolled into a single patch; and a similar
>>
>> So that means no retry in any case (except the swap thing)? All callers can
>> handle that? I guess shmem_evict_inode would be ok, as nobody else
>> can be accessing that inode. But what about shmem_setattr? (i.e. straight
>> truncation) As you said earlier, faulters will get a SIGBUS (which AFAIU is
>> due to i_size being updated before we enter shmem_undo_range). But could
>> possibly a faulter already pass the i_size test, and proceed with the fault
>> only when we are already in shmem_undo_range and have passed the page in
>> question?
>
> We still have to retry indefinitely in the truncation case, as you
> rightly guess.  SIGBUS beyond i_size makes it a much easier case to
> handle, and there's no danger of "indefinitely" becoming "infinitely"
> as in the punch-hole case.  But, depending on how the filesystem
> handles its end, there is still some possibility of a race with faulting,
> which some filesystems may require pagecache truncation to resolve.
>
> Does shmem truncation itself require that?  Er, er, it would take me
> too long to work out the definitive answer: perhaps it doesn't, but for
> safety I certainly assume that it does require that - that is, I never
> even considered removing the indefinite loop from the truncation case.
>
>>
>>> patch (without the swap part) for several functions in truncate.c.
>>>
>>> Hugh
>>>
>>> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
>>> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
>>> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>>>    	for ( ; ; ) {
>>>    		cond_resched();
>>>
>>> +		index = min(index, end);
>>>    		pvec.nr = find_get_entries(mapping, index,
>>>    				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>>    				pvec.pages, indices);
>
> So let's all forget that patch, although it does help to highlight my
> mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)

What patch? What mistake? :)

> Here's the 3.16-rc2 patch that I've now settled on (which will also
> require a revert of current git's f00cdc6df7d7; well, not require the
> revert, but this makes that redundant, and cannot be tested with it in).
>
> I've not yet had time to write up the patch description, nor to test
> it fully; but thought I should get the patch itself into the open for
> review and testing before then.

It seems to work here (tested 3.16-rc1 which didn't have f00cdc6df7d7 yet).
Checking for end != -1 is indeed much more elegant solution than i_size.
Thanks. So you can add my Tested-by.

> I've checked against v3.1 to see how it works out there: certainly
> wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
> "end" was included in the range, not excluded), but the same
> principles apply.  Haven't checked the intermediates yet, will
> probably leave those until each stable wants them - but if you've a
> particular release in mind, please ask, or ask me to check your port.

I will try, thanks.

> I've included the mm/truncate.c part of it here, but that will be a
> separate (not for -stable) patch when I post the finalized version.
>
> Hannes, a question for you please, I just could not make up my mind.
> In mm/truncate.c truncate_inode_pages_range(), what should be done
> with a failed clear_exceptional_entry() in the case of hole-punch?
> Is that case currently depending on the rescan loop (that I'm about
> to revert) to remove a new page, so I would need to add a retry for
> that rather like the shmem_free_swap() one?  Or is it irrelevant,
> and can stay unchanged as below?  I've veered back and forth,
> thinking first one and then the other.
>
> Thanks,
> Hugh
>
> ---
>
>   mm/shmem.c    |   19 ++++++++++---------
>   mm/truncate.c |   14 +++++---------
>   2 files changed, 15 insertions(+), 18 deletions(-)
>
> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> +++ linux/mm/shmem.c	2014-06-26 15:41:52.704362962 -0700
> @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>   		if (!pvec.nr) {
> -			if (index == start || unfalloc)
> +			/* If all gone or hole-punch or unfalloc, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if ((index == start || unfalloc) && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
> @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
>   			if (radix_tree_exceptional_entry(page)) {
>   				if (unfalloc)
>   					continue;
> -				nr_swaps_freed += !shmem_free_swap(mapping,
> -								index, page);
> +				if (shmem_free_swap(mapping, index, page)) {
> +					/* Swap was replaced by page: retry */
> +					index--;
> +					break;
> +				}
> +				nr_swaps_freed++;
>   				continue;
>   			}
>
> --- 3.16-rc2/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
> +++ linux/mm/truncate.c	2014-06-26 16:31:35.932433863 -0700
> @@ -352,21 +352,17 @@ void truncate_inode_pages_range(struct a
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>   		if (!pagevec_lookup_entries(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -			indices)) {
> -			if (index == start)
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
> +			/* If all gone or hole-punch, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if (index == start && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* mm: shmem: hang in shmem_fault (WAS: mm: shm: hang in shmem_fallocate)
  2014-06-27 18:03                       ` Hugh Dickins
@ 2014-07-01 22:37                         ` Sasha Levin
  -1 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-07-01 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

Hi Hugh,

I've been observing a very nonspecific hang involving some mutexes from fs/ but
without any lockdep output or a concrete way to track it down.

It seems that today was my lucky day, and after enough tinkering I've managed
to get output out of lockdep, which pointed me to shmem:

[ 1871.989131] =============================================
[ 1871.990028] [ INFO: possible recursive locking detected ]
[ 1871.992591] 3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758 Tainted: G        W
[ 1871.992591] ---------------------------------------------
[ 1871.992591] trinity-c84/27757 is trying to acquire lock:
[ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: shmem_fault (mm/shmem.c:1289)
[ 1871.992591]
[ 1871.992591] but task is already holding lock:
[ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
[ 1871.992591]
[ 1871.992591] other info that might help us debug this:
[ 1871.992591]  Possible unsafe locking scenario:
[ 1871.992591]
[ 1871.992591]        CPU0
[ 1871.992591]        ----
[ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
[ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
[ 1872.013889]
[ 1872.013889]  *** DEADLOCK ***
[ 1872.013889]
[ 1872.013889]  May be due to missing lock nesting notation
[ 1872.013889]
[ 1872.013889] 3 locks held by trinity-c84/27757:
[ 1872.013889] #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
[ 1872.030221] #1: (sb_writers#13){.+.+.+}, at: do_readv_writev (include/linux/fs.h:2264 fs/read_write.c:830)
[ 1872.030221] #2: (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
[ 1872.030221]
[ 1872.030221] stack backtrace:
[ 1872.030221] CPU: 6 PID: 27757 Comm: trinity-c84 Tainted: G        W      3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758
[ 1872.030221]  ffffffff9fc112b0 ffff8803c844f5d8 ffffffff9c531022 0000000000000002
[ 1872.030221]  ffffffff9fc112b0 ffff8803c844f6d8 ffffffff991d1a8d ffff8803c5da3000
[ 1872.030221]  ffff8803c5da3d70 ffff880300000001 ffff8803c5da3000 ffff8803c5da3da8
[ 1872.030221] Call Trace:
[ 1872.030221] dump_stack (lib/dump_stack.c:52)
[ 1872.030221] __lock_acquire (kernel/locking/lockdep.c:3034 kernel/locking/lockdep.c:3180)
[ 1872.030221] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1288)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] shmem_fault (mm/shmem.c:1289)
[ 1872.030221] __do_fault (mm/memory.c:2705)
[ 1872.030221] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 1872.030221] do_read_fault.isra.40 (mm/memory.c:2896)
[ 1872.030221] ? get_parent_ip (kernel/sched/core.c:2550)
[ 1872.030221] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 mm/memory.c:3322)
[ 1872.030221] handle_mm_fault (mm/memory.c:3345)
[ 1872.030221] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 1872.030221] ? retint_restore_args (arch/x86/kernel/entry_64.S:829)
[ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1872.030221] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
[ 1872.030221] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1872.030221] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 1872.030221] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 1872.030221] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 1872.030221] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 1872.030221] ? iov_iter_fault_in_readable (include/linux/pagemap.h:598 mm/iov_iter.c:267)
[ 1872.030221] generic_perform_write (mm/filemap.c:2461)
[ 1872.030221] ? __mnt_drop_write (./arch/x86/include/asm/preempt.h:98 fs/namespace.c:455)
[ 1872.030221] __generic_file_write_iter (mm/filemap.c:2608)
[ 1872.030221] ? generic_file_llseek (fs/read_write.c:467)
[ 1872.030221] generic_file_write_iter (mm/filemap.c:2634)
[ 1872.030221] do_iter_readv_writev (fs/read_write.c:666)
[ 1872.030221] do_readv_writev (fs/read_write.c:834)
[ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
[ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
[ 1872.030221] ? mutex_lock_nested (./arch/x86/include/asm/preempt.h:98 kernel/locking/mutex.c:570 kernel/locking/mutex.c:587)
[ 1872.030221] ? __fdget_pos (fs/file.c:714)
[ 1872.030221] ? __fdget_pos (fs/file.c:714)
[ 1872.030221] ? __fget_light (include/linux/rcupdate.h:402 include/linux/fdtable.h:80 fs/file.c:684)
[ 1872.101905] vfs_writev (fs/read_write.c:879)
[ 1872.101905] SyS_writev (fs/read_write.c:912 fs/read_write.c:904)
[ 1872.101905] tracesys (arch/x86/kernel/entry_64.S:542)

It seems like it was introduced by your fix to the shmem_fallocate hang, and is
triggered in shmem_fault():

+               if (shmem_falloc) {
+                       if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+                          !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+                               up_read(&vma->vm_mm->mmap_sem);
+                               mutex_lock(&inode->i_mutex);		<=== HERE
+                               mutex_unlock(&inode->i_mutex);
+                               return VM_FAULT_RETRY;
+                       }
+                       /* cond_resched? Leave that to GUP or return to user */
+                       return VM_FAULT_NOPAGE;


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* mm: shmem: hang in shmem_fault (WAS: mm: shm: hang in shmem_fallocate)
@ 2014-07-01 22:37                         ` Sasha Levin
  0 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2014-07-01 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

Hi Hugh,

I've been observing a very nonspecific hang involving some mutexes from fs/ but
without any lockdep output or a concrete way to track it down.

It seems that today was my lucky day, and after enough tinkering I've managed
to get output out of lockdep, which pointed me to shmem:

[ 1871.989131] =============================================
[ 1871.990028] [ INFO: possible recursive locking detected ]
[ 1871.992591] 3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758 Tainted: G        W
[ 1871.992591] ---------------------------------------------
[ 1871.992591] trinity-c84/27757 is trying to acquire lock:
[ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: shmem_fault (mm/shmem.c:1289)
[ 1871.992591]
[ 1871.992591] but task is already holding lock:
[ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
[ 1871.992591]
[ 1871.992591] other info that might help us debug this:
[ 1871.992591]  Possible unsafe locking scenario:
[ 1871.992591]
[ 1871.992591]        CPU0
[ 1871.992591]        ----
[ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
[ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
[ 1872.013889]
[ 1872.013889]  *** DEADLOCK ***
[ 1872.013889]
[ 1872.013889]  May be due to missing lock nesting notation
[ 1872.013889]
[ 1872.013889] 3 locks held by trinity-c84/27757:
[ 1872.013889] #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
[ 1872.030221] #1: (sb_writers#13){.+.+.+}, at: do_readv_writev (include/linux/fs.h:2264 fs/read_write.c:830)
[ 1872.030221] #2: (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
[ 1872.030221]
[ 1872.030221] stack backtrace:
[ 1872.030221] CPU: 6 PID: 27757 Comm: trinity-c84 Tainted: G        W      3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758
[ 1872.030221]  ffffffff9fc112b0 ffff8803c844f5d8 ffffffff9c531022 0000000000000002
[ 1872.030221]  ffffffff9fc112b0 ffff8803c844f6d8 ffffffff991d1a8d ffff8803c5da3000
[ 1872.030221]  ffff8803c5da3d70 ffff880300000001 ffff8803c5da3000 ffff8803c5da3da8
[ 1872.030221] Call Trace:
[ 1872.030221] dump_stack (lib/dump_stack.c:52)
[ 1872.030221] __lock_acquire (kernel/locking/lockdep.c:3034 kernel/locking/lockdep.c:3180)
[ 1872.030221] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1288)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] shmem_fault (mm/shmem.c:1289)
[ 1872.030221] __do_fault (mm/memory.c:2705)
[ 1872.030221] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 1872.030221] do_read_fault.isra.40 (mm/memory.c:2896)
[ 1872.030221] ? get_parent_ip (kernel/sched/core.c:2550)
[ 1872.030221] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 mm/memory.c:3322)
[ 1872.030221] handle_mm_fault (mm/memory.c:3345)
[ 1872.030221] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 1872.030221] ? retint_restore_args (arch/x86/kernel/entry_64.S:829)
[ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1872.030221] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
[ 1872.030221] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1872.030221] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 1872.030221] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 1872.030221] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 1872.030221] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 1872.030221] ? iov_iter_fault_in_readable (include/linux/pagemap.h:598 mm/iov_iter.c:267)
[ 1872.030221] generic_perform_write (mm/filemap.c:2461)
[ 1872.030221] ? __mnt_drop_write (./arch/x86/include/asm/preempt.h:98 fs/namespace.c:455)
[ 1872.030221] __generic_file_write_iter (mm/filemap.c:2608)
[ 1872.030221] ? generic_file_llseek (fs/read_write.c:467)
[ 1872.030221] generic_file_write_iter (mm/filemap.c:2634)
[ 1872.030221] do_iter_readv_writev (fs/read_write.c:666)
[ 1872.030221] do_readv_writev (fs/read_write.c:834)
[ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
[ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
[ 1872.030221] ? mutex_lock_nested (./arch/x86/include/asm/preempt.h:98 kernel/locking/mutex.c:570 kernel/locking/mutex.c:587)
[ 1872.030221] ? __fdget_pos (fs/file.c:714)
[ 1872.030221] ? __fdget_pos (fs/file.c:714)
[ 1872.030221] ? __fget_light (include/linux/rcupdate.h:402 include/linux/fdtable.h:80 fs/file.c:684)
[ 1872.101905] vfs_writev (fs/read_write.c:879)
[ 1872.101905] SyS_writev (fs/read_write.c:912 fs/read_write.c:904)
[ 1872.101905] tracesys (arch/x86/kernel/entry_64.S:542)

It seems like it was introduced by your fix to the shmem_fallocate hang, and is
triggered in shmem_fault():

+               if (shmem_falloc) {
+                       if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+                          !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+                               up_read(&vma->vm_mm->mmap_sem);
+                               mutex_lock(&inode->i_mutex);		<=== HERE
+                               mutex_unlock(&inode->i_mutex);
+                               return VM_FAULT_RETRY;
+                       }
+                       /* cond_resched? Leave that to GUP or return to user */
+                       return VM_FAULT_NOPAGE;


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shmem: hang in shmem_fault (WAS: mm: shm: hang in shmem_fallocate)
  2014-07-01 22:37                         ` Sasha Levin
@ 2014-07-02  0:17                           ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-02  0:17 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Tue, 1 Jul 2014, Sasha Levin wrote:

> Hi Hugh,
> 
> I've been observing a very nonspecific hang involving some mutexes from fs/ but
> without any lockdep output or a concrete way to track it down.
> 
> It seems that today was my lucky day, and after enough tinkering I've managed
> to get output out of lockdep, which pointed me to shmem:
> 
> [ 1871.989131] =============================================
> [ 1871.990028] [ INFO: possible recursive locking detected ]
> [ 1871.992591] 3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758 Tainted: G        W
> [ 1871.992591] ---------------------------------------------
> [ 1871.992591] trinity-c84/27757 is trying to acquire lock:
> [ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: shmem_fault (mm/shmem.c:1289)
> [ 1871.992591]
> [ 1871.992591] but task is already holding lock:
> [ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
> [ 1871.992591]
> [ 1871.992591] other info that might help us debug this:
> [ 1871.992591]  Possible unsafe locking scenario:
> [ 1871.992591]
> [ 1871.992591]        CPU0
> [ 1871.992591]        ----
> [ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
> [ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
> [ 1872.013889]
> [ 1872.013889]  *** DEADLOCK ***
> [ 1872.013889]
> [ 1872.013889]  May be due to missing lock nesting notation
> [ 1872.013889]
> [ 1872.013889] 3 locks held by trinity-c84/27757:
> [ 1872.013889] #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
> [ 1872.030221] #1: (sb_writers#13){.+.+.+}, at: do_readv_writev (include/linux/fs.h:2264 fs/read_write.c:830)
> [ 1872.030221] #2: (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
> [ 1872.030221]
> [ 1872.030221] stack backtrace:
> [ 1872.030221] CPU: 6 PID: 27757 Comm: trinity-c84 Tainted: G        W      3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758
> [ 1872.030221]  ffffffff9fc112b0 ffff8803c844f5d8 ffffffff9c531022 0000000000000002
> [ 1872.030221]  ffffffff9fc112b0 ffff8803c844f6d8 ffffffff991d1a8d ffff8803c5da3000
> [ 1872.030221]  ffff8803c5da3d70 ffff880300000001 ffff8803c5da3000 ffff8803c5da3da8
> [ 1872.030221] Call Trace:
> [ 1872.030221] dump_stack (lib/dump_stack.c:52)
> [ 1872.030221] __lock_acquire (kernel/locking/lockdep.c:3034 kernel/locking/lockdep.c:3180)
> [ 1872.030221] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1288)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] __do_fault (mm/memory.c:2705)
> [ 1872.030221] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 1872.030221] do_read_fault.isra.40 (mm/memory.c:2896)
> [ 1872.030221] ? get_parent_ip (kernel/sched/core.c:2550)
> [ 1872.030221] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 mm/memory.c:3322)
> [ 1872.030221] handle_mm_fault (mm/memory.c:3345)
> [ 1872.030221] __do_page_fault (arch/x86/mm/fault.c:1230)
> [ 1872.030221] ? retint_restore_args (arch/x86/kernel/entry_64.S:829)
> [ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 1872.030221] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
> [ 1872.030221] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 1872.030221] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
> [ 1872.030221] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
> [ 1872.030221] do_async_page_fault (arch/x86/kernel/kvm.c:264)
> [ 1872.030221] async_page_fault (arch/x86/kernel/entry_64.S:1322)
> [ 1872.030221] ? iov_iter_fault_in_readable (include/linux/pagemap.h:598 mm/iov_iter.c:267)
> [ 1872.030221] generic_perform_write (mm/filemap.c:2461)
> [ 1872.030221] ? __mnt_drop_write (./arch/x86/include/asm/preempt.h:98 fs/namespace.c:455)
> [ 1872.030221] __generic_file_write_iter (mm/filemap.c:2608)
> [ 1872.030221] ? generic_file_llseek (fs/read_write.c:467)
> [ 1872.030221] generic_file_write_iter (mm/filemap.c:2634)
> [ 1872.030221] do_iter_readv_writev (fs/read_write.c:666)
> [ 1872.030221] do_readv_writev (fs/read_write.c:834)
> [ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
> [ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
> [ 1872.030221] ? mutex_lock_nested (./arch/x86/include/asm/preempt.h:98 kernel/locking/mutex.c:570 kernel/locking/mutex.c:587)
> [ 1872.030221] ? __fdget_pos (fs/file.c:714)
> [ 1872.030221] ? __fdget_pos (fs/file.c:714)
> [ 1872.030221] ? __fget_light (include/linux/rcupdate.h:402 include/linux/fdtable.h:80 fs/file.c:684)
> [ 1872.101905] vfs_writev (fs/read_write.c:879)
> [ 1872.101905] SyS_writev (fs/read_write.c:912 fs/read_write.c:904)
> [ 1872.101905] tracesys (arch/x86/kernel/entry_64.S:542)
> 
> It seems like it was introduced by your fix to the shmem_fallocate hang, and is
> triggered in shmem_fault():
> 
> +               if (shmem_falloc) {
> +                       if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
> +                          !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +                               up_read(&vma->vm_mm->mmap_sem);
> +                               mutex_lock(&inode->i_mutex);		<=== HERE
> +                               mutex_unlock(&inode->i_mutex);
> +                               return VM_FAULT_RETRY;
> +                       }
> +                       /* cond_resched? Leave that to GUP or return to user */
> +                       return VM_FAULT_NOPAGE;

That is very very helpful: many thanks, Sasha.

Yes, of course, it's a standard pattern, that the write syscall from
userspace has to fault in a page of the buffer from kernel mode, while
holding i_mutex.  Danger of deadlock if I take any i_mutex down there
in the fault.

Shame on me for forgetting that one, and you've saved me from some egg
on my face.  Though I'll give myself a little pat for holding this one
back from rushing into stable.

And how convenient to have a really good strong reason to revert this
"fix", when we wanted to revert it anyway, to meet Vlastimil's backport
concerns.  I'll get on with that, and give an update in that thread.

Thanks again,
Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shmem: hang in shmem_fault (WAS: mm: shm: hang in shmem_fallocate)
@ 2014-07-02  0:17                           ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-02  0:17 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Tue, 1 Jul 2014, Sasha Levin wrote:

> Hi Hugh,
> 
> I've been observing a very nonspecific hang involving some mutexes from fs/ but
> without any lockdep output or a concrete way to track it down.
> 
> It seems that today was my lucky day, and after enough tinkering I've managed
> to get output out of lockdep, which pointed me to shmem:
> 
> [ 1871.989131] =============================================
> [ 1871.990028] [ INFO: possible recursive locking detected ]
> [ 1871.992591] 3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758 Tainted: G        W
> [ 1871.992591] ---------------------------------------------
> [ 1871.992591] trinity-c84/27757 is trying to acquire lock:
> [ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: shmem_fault (mm/shmem.c:1289)
> [ 1871.992591]
> [ 1871.992591] but task is already holding lock:
> [ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
> [ 1871.992591]
> [ 1871.992591] other info that might help us debug this:
> [ 1871.992591]  Possible unsafe locking scenario:
> [ 1871.992591]
> [ 1871.992591]        CPU0
> [ 1871.992591]        ----
> [ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
> [ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
> [ 1872.013889]
> [ 1872.013889]  *** DEADLOCK ***
> [ 1872.013889]
> [ 1872.013889]  May be due to missing lock nesting notation
> [ 1872.013889]
> [ 1872.013889] 3 locks held by trinity-c84/27757:
> [ 1872.013889] #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
> [ 1872.030221] #1: (sb_writers#13){.+.+.+}, at: do_readv_writev (include/linux/fs.h:2264 fs/read_write.c:830)
> [ 1872.030221] #2: (&sb->s_type->i_mutex_key#17){+.+.+.}, at: generic_file_write_iter (mm/filemap.c:2633)
> [ 1872.030221]
> [ 1872.030221] stack backtrace:
> [ 1872.030221] CPU: 6 PID: 27757 Comm: trinity-c84 Tainted: G        W      3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758
> [ 1872.030221]  ffffffff9fc112b0 ffff8803c844f5d8 ffffffff9c531022 0000000000000002
> [ 1872.030221]  ffffffff9fc112b0 ffff8803c844f6d8 ffffffff991d1a8d ffff8803c5da3000
> [ 1872.030221]  ffff8803c5da3d70 ffff880300000001 ffff8803c5da3000 ffff8803c5da3da8
> [ 1872.030221] Call Trace:
> [ 1872.030221] dump_stack (lib/dump_stack.c:52)
> [ 1872.030221] __lock_acquire (kernel/locking/lockdep.c:3034 kernel/locking/lockdep.c:3180)
> [ 1872.030221] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1288)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] __do_fault (mm/memory.c:2705)
> [ 1872.030221] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 1872.030221] do_read_fault.isra.40 (mm/memory.c:2896)
> [ 1872.030221] ? get_parent_ip (kernel/sched/core.c:2550)
> [ 1872.030221] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 mm/memory.c:3322)
> [ 1872.030221] handle_mm_fault (mm/memory.c:3345)
> [ 1872.030221] __do_page_fault (arch/x86/mm/fault.c:1230)
> [ 1872.030221] ? retint_restore_args (arch/x86/kernel/entry_64.S:829)
> [ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 1872.030221] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
> [ 1872.030221] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 1872.030221] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
> [ 1872.030221] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
> [ 1872.030221] do_async_page_fault (arch/x86/kernel/kvm.c:264)
> [ 1872.030221] async_page_fault (arch/x86/kernel/entry_64.S:1322)
> [ 1872.030221] ? iov_iter_fault_in_readable (include/linux/pagemap.h:598 mm/iov_iter.c:267)
> [ 1872.030221] generic_perform_write (mm/filemap.c:2461)
> [ 1872.030221] ? __mnt_drop_write (./arch/x86/include/asm/preempt.h:98 fs/namespace.c:455)
> [ 1872.030221] __generic_file_write_iter (mm/filemap.c:2608)
> [ 1872.030221] ? generic_file_llseek (fs/read_write.c:467)
> [ 1872.030221] generic_file_write_iter (mm/filemap.c:2634)
> [ 1872.030221] do_iter_readv_writev (fs/read_write.c:666)
> [ 1872.030221] do_readv_writev (fs/read_write.c:834)
> [ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
> [ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
> [ 1872.030221] ? mutex_lock_nested (./arch/x86/include/asm/preempt.h:98 kernel/locking/mutex.c:570 kernel/locking/mutex.c:587)
> [ 1872.030221] ? __fdget_pos (fs/file.c:714)
> [ 1872.030221] ? __fdget_pos (fs/file.c:714)
> [ 1872.030221] ? __fget_light (include/linux/rcupdate.h:402 include/linux/fdtable.h:80 fs/file.c:684)
> [ 1872.101905] vfs_writev (fs/read_write.c:879)
> [ 1872.101905] SyS_writev (fs/read_write.c:912 fs/read_write.c:904)
> [ 1872.101905] tracesys (arch/x86/kernel/entry_64.S:542)
> 
> It seems like it was introduced by your fix to the shmem_fallocate hang, and is
> triggered in shmem_fault():
> 
> +               if (shmem_falloc) {
> +                       if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
> +                          !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +                               up_read(&vma->vm_mm->mmap_sem);
> +                               mutex_lock(&inode->i_mutex);		<=== HERE
> +                               mutex_unlock(&inode->i_mutex);
> +                               return VM_FAULT_RETRY;
> +                       }
> +                       /* cond_resched? Leave that to GUP or return to user */
> +                       return VM_FAULT_NOPAGE;

That is very very helpful: many thanks, Sasha.

Yes, of course, it's a standard pattern, that the write syscall from
userspace has to fault in a page of the buffer from kernel mode, while
holding i_mutex.  Danger of deadlock if I take any i_mutex down there
in the fault.

Shame on me for forgetting that one, and you've saved me from some egg
on my face.  Though I'll give myself a little pat for holding this one
back from rushing into stable.

And how convenient to have a really good strong reason to revert this
"fix", when we wanted to revert it anyway, to meet Vlastimil's backport
concerns.  I'll get on with that, and give an update in that thread.

Thanks again,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-07-01 11:52                     ` Vlastimil Babka
@ 2014-07-02  1:49                       ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-02  1:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Johannes Weiner, Konstantin Khlebnikov,
	Sasha Levin, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel,
	LKML

On Tue, 1 Jul 2014, Vlastimil Babka wrote:
> On 06/27/2014 07:36 AM, Hugh Dickins wrote:> [Cc Johannes: at the end I have
> a particular question for you]
> > On Thu, 26 Jun 2014, Vlastimil Babka wrote:
> > > 
> > > Thanks, I didn't notice that. Do I understand correctly that this could
> > > mean
> > > info leak for the punch hole call, but wouldn't be a problem for madvise?
> > > (In
> > > any case, that means the solution is not general enough for all kernels,
> > > so
> > > I'm asking just to be sure).
> > 
> > It's exactly the same issue for the madvise as for the fallocate:
> > data that is promised to have been punched out would still be there.
> 
> AFAIK madvise doesn't promise anything. But nevermind.

Good point.  I was looking at it from an implementation point of
view, that the implementation is the same for both, so therefore the
same issue for both.  But you are right, madvise makes no promise,
so we can therefore excuse it.  You'd make a fine lawyer :)

> > 
> > So let's all forget that patch, although it does help to highlight my
> > mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)
> 
> What patch? What mistake? :)

Yes, which of my increasingly many? :(

> 
> > Here's the 3.16-rc2 patch that I've now settled on (which will also
> > require a revert of current git's f00cdc6df7d7; well, not require the
> > revert, but this makes that redundant, and cannot be tested with it in).
> > 
> > I've not yet had time to write up the patch description, nor to test
> > it fully; but thought I should get the patch itself into the open for
> > review and testing before then.
> 
> It seems to work here (tested 3.16-rc1 which didn't have f00cdc6df7d7 yet).
> Checking for end != -1 is indeed much more elegant solution than i_size.
> Thanks. So you can add my Tested-by.

Thanks a lot for the testing, Vlastimir.

Though I'm happy with the new shmem.c patch, I've thought more about my
truncate.c patch meanwhile, and grown unhappy with it for two reasons.

One was remembering that XFS still uses lend -1 even when punching a
hole: it writes out dirty pages, then throws away the pagecache from
start of hole to end of file; not brilliant (and a violation of mlock),
but that's how it is, and I'm not about to become an XFS hacker to fix
it (I did long ago send a patch I thought fixed it, but it never went
in, and I could easily have overlooked all kinds of XFS subtleties).

So although the end -1 test is more satisfying in tmpfs, and I don't
particularly like making assumptions in truncate_inode_pages_range()
about what i_size will show at that point, XFS would probably push
me back to using your original i_size test in truncate.c.

If we are to stop the endless pincer in truncate.c like in shmem.c.

But the other reason I'm unhappy with it, is really a generalization
of that.  Starting from the question I asked Hannes below, I came to
realize that truncate_inode_pages_range() is serving many filesystems,
and I don't know what all their assumptions are; and even if I spent
days researching what each requires of truncate_inode_pages_range(),
chances are that I wouldn't get the right answer on all of them.
Maybe there is a filesystem which now depends upon it to clean out
that hole completely: obviously not before I made the change, but
perhaps in the years since.

So, although I dislike tmpfs behaviour diverging from the others here,
we do have Sasha's assurance that tmpfs was the only one to show the
problem, and no intention of implementing hole-punch on ramfs: so I
think the safest course is for me not to interfere with the other
filesystems, just fix the pessimization I introduced back then.

And now that we have hard evidence that my "fix" there in -rc3
must be reverted, I should move forward with the alternative.

Hugh

> 
> > I've checked against v3.1 to see how it works out there: certainly
> > wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
> > "end" was included in the range, not excluded), but the same
> > principles apply.  Haven't checked the intermediates yet, will
> > probably leave those until each stable wants them - but if you've a
> > particular release in mind, please ask, or ask me to check your port.
> 
> I will try, thanks.
> 
> > I've included the mm/truncate.c part of it here, but that will be a
> > separate (not for -stable) patch when I post the finalized version.
> > 
> > Hannes, a question for you please, I just could not make up my mind.
> > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > with a failed clear_exceptional_entry() in the case of hole-punch?
> > Is that case currently depending on the rescan loop (that I'm about
> > to revert) to remove a new page, so I would need to add a retry for
> > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > and can stay unchanged as below?  I've veered back and forth,
> > thinking first one and then the other.
> > 
> > Thanks,
> > Hugh
> > 
> > ---
> > 
> >   mm/shmem.c    |   19 ++++++++++---------
> >   mm/truncate.c |   14 +++++---------
> >   2 files changed, 15 insertions(+), 18 deletions(-)
> > 
> > --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> > +++ linux/mm/shmem.c	2014-06-26 15:41:52.704362962 -0700
> > @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
> >   		return;
> > 
> >   	index = start;
> > -	for ( ; ; ) {
> > +	while (index < end) {
> >   		cond_resched();
> > 
> >   		pvec.nr = find_get_entries(mapping, index,
> >   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> >   				pvec.pages, indices);
> >   		if (!pvec.nr) {
> > -			if (index == start || unfalloc)
> > +			/* If all gone or hole-punch or unfalloc, we're done
> > */
> > +			if (index == start || end != -1)
> >   				break;
> > +			/* But if truncating, restart to make sure all gone
> > */
> >   			index = start;
> >   			continue;
> >   		}
> > -		if ((index == start || unfalloc) && indices[0] >= end) {
> > -			pagevec_remove_exceptionals(&pvec);
> > -			pagevec_release(&pvec);
> > -			break;
> > -		}
> >   		mem_cgroup_uncharge_start();
> >   		for (i = 0; i < pagevec_count(&pvec); i++) {
> >   			struct page *page = pvec.pages[i];
> > @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
> >   			if (radix_tree_exceptional_entry(page)) {
> >   				if (unfalloc)
> >   					continue;
> > -				nr_swaps_freed += !shmem_free_swap(mapping,
> > -								index, page);
> > +				if (shmem_free_swap(mapping, index, page)) {
> > +					/* Swap was replaced by page: retry
> > */
> > +					index--;
> > +					break;
> > +				}
> > +				nr_swaps_freed++;
> >   				continue;
> >   			}
> > 
> > --- 3.16-rc2/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
> > +++ linux/mm/truncate.c	2014-06-26 16:31:35.932433863 -0700
> > @@ -352,21 +352,17 @@ void truncate_inode_pages_range(struct a
> >   		return;
> > 
> >   	index = start;
> > -	for ( ; ; ) {
> > +	while (index < end) {
> >   		cond_resched();
> >   		if (!pagevec_lookup_entries(&pvec, mapping, index,
> > -			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> > -			indices)) {
> > -			if (index == start)
> > +			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
> > +			/* If all gone or hole-punch, we're done */
> > +			if (index == start || end != -1)
> >   				break;
> > +			/* But if truncating, restart to make sure all gone
> > */
> >   			index = start;
> >   			continue;
> >   		}
> > -		if (index == start && indices[0] >= end) {
> > -			pagevec_remove_exceptionals(&pvec);
> > -			pagevec_release(&pvec);
> > -			break;
> > -		}
> >   		mem_cgroup_uncharge_start();
> >   		for (i = 0; i < pagevec_count(&pvec); i++) {
> >   			struct page *page = pvec.pages[i];

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-07-02  1:49                       ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-02  1:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Johannes Weiner, Konstantin Khlebnikov,
	Sasha Levin, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel,
	LKML

On Tue, 1 Jul 2014, Vlastimil Babka wrote:
> On 06/27/2014 07:36 AM, Hugh Dickins wrote:> [Cc Johannes: at the end I have
> a particular question for you]
> > On Thu, 26 Jun 2014, Vlastimil Babka wrote:
> > > 
> > > Thanks, I didn't notice that. Do I understand correctly that this could
> > > mean
> > > info leak for the punch hole call, but wouldn't be a problem for madvise?
> > > (In
> > > any case, that means the solution is not general enough for all kernels,
> > > so
> > > I'm asking just to be sure).
> > 
> > It's exactly the same issue for the madvise as for the fallocate:
> > data that is promised to have been punched out would still be there.
> 
> AFAIK madvise doesn't promise anything. But nevermind.

Good point.  I was looking at it from an implementation point of
view, that the implementation is the same for both, so therefore the
same issue for both.  But you are right, madvise makes no promise,
so we can therefore excuse it.  You'd make a fine lawyer :)

> > 
> > So let's all forget that patch, although it does help to highlight my
> > mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)
> 
> What patch? What mistake? :)

Yes, which of my increasingly many? :(

> 
> > Here's the 3.16-rc2 patch that I've now settled on (which will also
> > require a revert of current git's f00cdc6df7d7; well, not require the
> > revert, but this makes that redundant, and cannot be tested with it in).
> > 
> > I've not yet had time to write up the patch description, nor to test
> > it fully; but thought I should get the patch itself into the open for
> > review and testing before then.
> 
> It seems to work here (tested 3.16-rc1 which didn't have f00cdc6df7d7 yet).
> Checking for end != -1 is indeed much more elegant solution than i_size.
> Thanks. So you can add my Tested-by.

Thanks a lot for the testing, Vlastimir.

Though I'm happy with the new shmem.c patch, I've thought more about my
truncate.c patch meanwhile, and grown unhappy with it for two reasons.

One was remembering that XFS still uses lend -1 even when punching a
hole: it writes out dirty pages, then throws away the pagecache from
start of hole to end of file; not brilliant (and a violation of mlock),
but that's how it is, and I'm not about to become an XFS hacker to fix
it (I did long ago send a patch I thought fixed it, but it never went
in, and I could easily have overlooked all kinds of XFS subtleties).

So although the end -1 test is more satisfying in tmpfs, and I don't
particularly like making assumptions in truncate_inode_pages_range()
about what i_size will show at that point, XFS would probably push
me back to using your original i_size test in truncate.c.

If we are to stop the endless pincer in truncate.c like in shmem.c.

But the other reason I'm unhappy with it, is really a generalization
of that.  Starting from the question I asked Hannes below, I came to
realize that truncate_inode_pages_range() is serving many filesystems,
and I don't know what all their assumptions are; and even if I spent
days researching what each requires of truncate_inode_pages_range(),
chances are that I wouldn't get the right answer on all of them.
Maybe there is a filesystem which now depends upon it to clean out
that hole completely: obviously not before I made the change, but
perhaps in the years since.

So, although I dislike tmpfs behaviour diverging from the others here,
we do have Sasha's assurance that tmpfs was the only one to show the
problem, and no intention of implementing hole-punch on ramfs: so I
think the safest course is for me not to interfere with the other
filesystems, just fix the pessimization I introduced back then.

And now that we have hard evidence that my "fix" there in -rc3
must be reverted, I should move forward with the alternative.

Hugh

> 
> > I've checked against v3.1 to see how it works out there: certainly
> > wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
> > "end" was included in the range, not excluded), but the same
> > principles apply.  Haven't checked the intermediates yet, will
> > probably leave those until each stable wants them - but if you've a
> > particular release in mind, please ask, or ask me to check your port.
> 
> I will try, thanks.
> 
> > I've included the mm/truncate.c part of it here, but that will be a
> > separate (not for -stable) patch when I post the finalized version.
> > 
> > Hannes, a question for you please, I just could not make up my mind.
> > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > with a failed clear_exceptional_entry() in the case of hole-punch?
> > Is that case currently depending on the rescan loop (that I'm about
> > to revert) to remove a new page, so I would need to add a retry for
> > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > and can stay unchanged as below?  I've veered back and forth,
> > thinking first one and then the other.
> > 
> > Thanks,
> > Hugh
> > 
> > ---
> > 
> >   mm/shmem.c    |   19 ++++++++++---------
> >   mm/truncate.c |   14 +++++---------
> >   2 files changed, 15 insertions(+), 18 deletions(-)
> > 
> > --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> > +++ linux/mm/shmem.c	2014-06-26 15:41:52.704362962 -0700
> > @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
> >   		return;
> > 
> >   	index = start;
> > -	for ( ; ; ) {
> > +	while (index < end) {
> >   		cond_resched();
> > 
> >   		pvec.nr = find_get_entries(mapping, index,
> >   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> >   				pvec.pages, indices);
> >   		if (!pvec.nr) {
> > -			if (index == start || unfalloc)
> > +			/* If all gone or hole-punch or unfalloc, we're done
> > */
> > +			if (index == start || end != -1)
> >   				break;
> > +			/* But if truncating, restart to make sure all gone
> > */
> >   			index = start;
> >   			continue;
> >   		}
> > -		if ((index == start || unfalloc) && indices[0] >= end) {
> > -			pagevec_remove_exceptionals(&pvec);
> > -			pagevec_release(&pvec);
> > -			break;
> > -		}
> >   		mem_cgroup_uncharge_start();
> >   		for (i = 0; i < pagevec_count(&pvec); i++) {
> >   			struct page *page = pvec.pages[i];
> > @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
> >   			if (radix_tree_exceptional_entry(page)) {
> >   				if (unfalloc)
> >   					continue;
> > -				nr_swaps_freed += !shmem_free_swap(mapping,
> > -								index, page);
> > +				if (shmem_free_swap(mapping, index, page)) {
> > +					/* Swap was replaced by page: retry
> > */
> > +					index--;
> > +					break;
> > +				}
> > +				nr_swaps_freed++;
> >   				continue;
> >   			}
> > 
> > --- 3.16-rc2/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
> > +++ linux/mm/truncate.c	2014-06-26 16:31:35.932433863 -0700
> > @@ -352,21 +352,17 @@ void truncate_inode_pages_range(struct a
> >   		return;
> > 
> >   	index = start;
> > -	for ( ; ; ) {
> > +	while (index < end) {
> >   		cond_resched();
> >   		if (!pagevec_lookup_entries(&pvec, mapping, index,
> > -			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> > -			indices)) {
> > -			if (index == start)
> > +			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
> > +			/* If all gone or hole-punch, we're done */
> > +			if (index == start || end != -1)
> >   				break;
> > +			/* But if truncating, restart to make sure all gone
> > */
> >   			index = start;
> >   			continue;
> >   		}
> > -		if (index == start && indices[0] >= end) {
> > -			pagevec_remove_exceptionals(&pvec);
> > -			pagevec_release(&pvec);
> > -			break;
> > -		}
> >   		mem_cgroup_uncharge_start();
> >   		for (i = 0; i < pagevec_count(&pvec); i++) {
> >   			struct page *page = pvec.pages[i];

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-06-27  5:36                   ` Hugh Dickins
@ 2014-07-09 21:59                     ` Johannes Weiner
  -1 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2014-07-09 21:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

Hi Hugh,

On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> Hannes, a question for you please, I just could not make up my mind.
> In mm/truncate.c truncate_inode_pages_range(), what should be done
> with a failed clear_exceptional_entry() in the case of hole-punch?
> Is that case currently depending on the rescan loop (that I'm about
> to revert) to remove a new page, so I would need to add a retry for
> that rather like the shmem_free_swap() one?  Or is it irrelevant,
> and can stay unchanged as below?  I've veered back and forth,
> thinking first one and then the other.

I realize you have given up on changing truncate.c in the meantime,
but I'm still asking myself about the swap retry case: why retry for
swap-to-page changes, yet not for page-to-page changes?

In case faults are disabled through i_size, concurrent swapin could
still turn swap entries into pages, so I can see the need to retry.
There is no equivalent for shadow entries, though, and they can only
be turned through page faults, so no retry necessary in that case.

However, you explicitely mentioned the hole-punch case above: if that
can't guarantee the hole will be reliably cleared under concurrent
faults, I'm not sure why it would put in more effort to free it of
swap (or shadow) entries than to free it of pages.

What am I missing?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-07-09 21:59                     ` Johannes Weiner
  0 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2014-07-09 21:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

Hi Hugh,

On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> Hannes, a question for you please, I just could not make up my mind.
> In mm/truncate.c truncate_inode_pages_range(), what should be done
> with a failed clear_exceptional_entry() in the case of hole-punch?
> Is that case currently depending on the rescan loop (that I'm about
> to revert) to remove a new page, so I would need to add a retry for
> that rather like the shmem_free_swap() one?  Or is it irrelevant,
> and can stay unchanged as below?  I've veered back and forth,
> thinking first one and then the other.

I realize you have given up on changing truncate.c in the meantime,
but I'm still asking myself about the swap retry case: why retry for
swap-to-page changes, yet not for page-to-page changes?

In case faults are disabled through i_size, concurrent swapin could
still turn swap entries into pages, so I can see the need to retry.
There is no equivalent for shadow entries, though, and they can only
be turned through page faults, so no retry necessary in that case.

However, you explicitely mentioned the hole-punch case above: if that
can't guarantee the hole will be reliably cleared under concurrent
faults, I'm not sure why it would put in more effort to free it of
swap (or shadow) entries than to free it of pages.

What am I missing?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-07-09 21:59                     ` Johannes Weiner
@ 2014-07-09 22:48                       ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-09 22:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov,
	Sasha Levin, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel,
	LKML

On Wed, 9 Jul 2014, Johannes Weiner wrote:
> On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> > Hannes, a question for you please, I just could not make up my mind.
> > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > with a failed clear_exceptional_entry() in the case of hole-punch?
> > Is that case currently depending on the rescan loop (that I'm about
> > to revert) to remove a new page, so I would need to add a retry for
> > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > and can stay unchanged as below?  I've veered back and forth,
> > thinking first one and then the other.
> 
> I realize you have given up on changing truncate.c in the meantime,
> but I'm still asking myself about the swap retry case: why retry for
> swap-to-page changes, yet not for page-to-page changes?
> 
> In case faults are disabled through i_size, concurrent swapin could
> still turn swap entries into pages, so I can see the need to retry.
> There is no equivalent for shadow entries, though, and they can only
> be turned through page faults, so no retry necessary in that case.
> 
> However, you explicitely mentioned the hole-punch case above: if that
> can't guarantee the hole will be reliably cleared under concurrent
> faults, I'm not sure why it would put in more effort to free it of
> swap (or shadow) entries than to free it of pages.
> 
> What am I missing?

In dropping the pincer effect, I am conceding that data written (via
mmap) racily into the hole, behind the punching cursor, between the
starting and the ending of the punch operation, may be allowed to
remain.  It will not often happen (given the two loops), but it might.

But I insist that all data in the hole at the starting of the punch
operation must be removed by the ending of the punch operation (though
of course, given the paragraph above, identical data might be written
in its place concurrently, via mmap, if the application chooses).

I think you probably agree with both of those propositions.

As the punching cursor moves along the radix_tree, it gathers page
pointers and swap entries (the emply slots are already skipped at
the level below; and tmpfs takes care that there is no instant in
switching between page and swap when the slot appears empty).

Dealing with the page pointers is easy: a reference is already held,
then shmem_undo_range takes the page lock which prevents swizzling
to swap, then truncates that page out of the tree.

But dealing with swap entries is slippery: there is no reference
held, and no lock to prevent swizzling to page (outside of the
tree_lock taken in shmem_free_swap).

So, as I see it, the page lock ensures that any pages present at
the starting of the punch operation will be removed, without any
need to go back and retry.  But a swap entry present at the starting
of the punch operation might be swizzled back to page (and, if we
imagine massive preemption, even back to swap again, and to page
again, etc) at the wrong moment: so for swap we do need to retry.

(What I said there is not quite correct: that swap would actually
have to be a locked page at the time when the first loop meets it.)

Does that make sense?

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-07-09 22:48                       ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-09 22:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov,
	Sasha Levin, Dave Jones, Andrew Morton, linux-mm, linux-fsdevel,
	LKML

On Wed, 9 Jul 2014, Johannes Weiner wrote:
> On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> > Hannes, a question for you please, I just could not make up my mind.
> > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > with a failed clear_exceptional_entry() in the case of hole-punch?
> > Is that case currently depending on the rescan loop (that I'm about
> > to revert) to remove a new page, so I would need to add a retry for
> > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > and can stay unchanged as below?  I've veered back and forth,
> > thinking first one and then the other.
> 
> I realize you have given up on changing truncate.c in the meantime,
> but I'm still asking myself about the swap retry case: why retry for
> swap-to-page changes, yet not for page-to-page changes?
> 
> In case faults are disabled through i_size, concurrent swapin could
> still turn swap entries into pages, so I can see the need to retry.
> There is no equivalent for shadow entries, though, and they can only
> be turned through page faults, so no retry necessary in that case.
> 
> However, you explicitely mentioned the hole-punch case above: if that
> can't guarantee the hole will be reliably cleared under concurrent
> faults, I'm not sure why it would put in more effort to free it of
> swap (or shadow) entries than to free it of pages.
> 
> What am I missing?

In dropping the pincer effect, I am conceding that data written (via
mmap) racily into the hole, behind the punching cursor, between the
starting and the ending of the punch operation, may be allowed to
remain.  It will not often happen (given the two loops), but it might.

But I insist that all data in the hole at the starting of the punch
operation must be removed by the ending of the punch operation (though
of course, given the paragraph above, identical data might be written
in its place concurrently, via mmap, if the application chooses).

I think you probably agree with both of those propositions.

As the punching cursor moves along the radix_tree, it gathers page
pointers and swap entries (the emply slots are already skipped at
the level below; and tmpfs takes care that there is no instant in
switching between page and swap when the slot appears empty).

Dealing with the page pointers is easy: a reference is already held,
then shmem_undo_range takes the page lock which prevents swizzling
to swap, then truncates that page out of the tree.

But dealing with swap entries is slippery: there is no reference
held, and no lock to prevent swizzling to page (outside of the
tree_lock taken in shmem_free_swap).

So, as I see it, the page lock ensures that any pages present at
the starting of the punch operation will be removed, without any
need to go back and retry.  But a swap entry present at the starting
of the punch operation might be swizzled back to page (and, if we
imagine massive preemption, even back to swap again, and to page
again, etc) at the wrong moment: so for swap we do need to retry.

(What I said there is not quite correct: that swap would actually
have to be a locked page at the time when the first loop meets it.)

Does that make sense?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
  2014-07-09 22:48                       ` Hugh Dickins
@ 2014-07-10  0:51                         ` Hugh Dickins
  -1 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-10  0:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Wed, 9 Jul 2014, Hugh Dickins wrote:
> On Wed, 9 Jul 2014, Johannes Weiner wrote:
> > On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> > > Hannes, a question for you please, I just could not make up my mind.
> > > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > > with a failed clear_exceptional_entry() in the case of hole-punch?
> > > Is that case currently depending on the rescan loop (that I'm about
> > > to revert) to remove a new page, so I would need to add a retry for
> > > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > > and can stay unchanged as below?  I've veered back and forth,
> > > thinking first one and then the other.
> > 
> > I realize you have given up on changing truncate.c in the meantime,
> > but I'm still asking myself about the swap retry case: why retry for
> > swap-to-page changes, yet not for page-to-page changes?
> > 
> > In case faults are disabled through i_size, concurrent swapin could
> > still turn swap entries into pages, so I can see the need to retry.
> > There is no equivalent for shadow entries, though, and they can only
> > be turned through page faults, so no retry necessary in that case.
> > 
> > However, you explicitely mentioned the hole-punch case above: if that
> > can't guarantee the hole will be reliably cleared under concurrent
> > faults, I'm not sure why it would put in more effort to free it of
> > swap (or shadow) entries than to free it of pages.
> > 
> > What am I missing?
> 
> In dropping the pincer effect, I am conceding that data written (via
> mmap) racily into the hole, behind the punching cursor, between the
> starting and the ending of the punch operation, may be allowed to
> remain.  It will not often happen (given the two loops), but it might.
> 
> But I insist that all data in the hole at the starting of the punch
> operation must be removed by the ending of the punch operation (though
> of course, given the paragraph above, identical data might be written
> in its place concurrently, via mmap, if the application chooses).
> 
> I think you probably agree with both of those propositions.
> 
> As the punching cursor moves along the radix_tree, it gathers page
> pointers and swap entries (the emply slots are already skipped at
> the level below; and tmpfs takes care that there is no instant in
> switching between page and swap when the slot appears empty).
> 
> Dealing with the page pointers is easy: a reference is already held,
> then shmem_undo_range takes the page lock which prevents swizzling
> to swap, then truncates that page out of the tree.
> 
> But dealing with swap entries is slippery: there is no reference
> held, and no lock to prevent swizzling to page (outside of the
> tree_lock taken in shmem_free_swap).
> 
> So, as I see it, the page lock ensures that any pages present at
> the starting of the punch operation will be removed, without any
> need to go back and retry.  But a swap entry present at the starting
> of the punch operation might be swizzled back to page (and, if we
> imagine massive preemption, even back to swap again, and to page
> again, etc) at the wrong moment: so for swap we do need to retry.
> 
> (What I said there is not quite correct: that swap would actually
> have to be a locked page at the time when the first loop meets it.)
> 
> Does that make sense?

Allow me to disagree with myself: no, I now think you were right
to press me on this, and we need also the patch below.  Do you
agree, or do you see something more is needed?

Note to onlookers: this would have no bearing on the shmem_fallocate
hang which Sasha reported seeing on -next yesterday (in another thread),
but is nonetheless a correction to the patch we have there - I think.

Hugh

--- 3.16-rc3-mm1/mm/shmem.c	2014-07-02 15:32:22.220311543 -0700
+++ linux/mm/shmem.c	2014-07-09 17:38:49.972818635 -0700
@@ -516,6 +516,11 @@ static void shmem_undo_range(struct inod
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
 					truncate_inode_page(mapping, page);
+				} else {
+					/* Page was replaced by swap: retry */
+					unlock_page(page);
+					index--;
+					break;
 				}
 			}
 			unlock_page(page);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mm: shm: hang in shmem_fallocate
@ 2014-07-10  0:51                         ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2014-07-10  0:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Sasha Levin, Dave Jones,
	Andrew Morton, linux-mm, linux-fsdevel, LKML

On Wed, 9 Jul 2014, Hugh Dickins wrote:
> On Wed, 9 Jul 2014, Johannes Weiner wrote:
> > On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> > > Hannes, a question for you please, I just could not make up my mind.
> > > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > > with a failed clear_exceptional_entry() in the case of hole-punch?
> > > Is that case currently depending on the rescan loop (that I'm about
> > > to revert) to remove a new page, so I would need to add a retry for
> > > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > > and can stay unchanged as below?  I've veered back and forth,
> > > thinking first one and then the other.
> > 
> > I realize you have given up on changing truncate.c in the meantime,
> > but I'm still asking myself about the swap retry case: why retry for
> > swap-to-page changes, yet not for page-to-page changes?
> > 
> > In case faults are disabled through i_size, concurrent swapin could
> > still turn swap entries into pages, so I can see the need to retry.
> > There is no equivalent for shadow entries, though, and they can only
> > be turned through page faults, so no retry necessary in that case.
> > 
> > However, you explicitely mentioned the hole-punch case above: if that
> > can't guarantee the hole will be reliably cleared under concurrent
> > faults, I'm not sure why it would put in more effort to free it of
> > swap (or shadow) entries than to free it of pages.
> > 
> > What am I missing?
> 
> In dropping the pincer effect, I am conceding that data written (via
> mmap) racily into the hole, behind the punching cursor, between the
> starting and the ending of the punch operation, may be allowed to
> remain.  It will not often happen (given the two loops), but it might.
> 
> But I insist that all data in the hole at the starting of the punch
> operation must be removed by the ending of the punch operation (though
> of course, given the paragraph above, identical data might be written
> in its place concurrently, via mmap, if the application chooses).
> 
> I think you probably agree with both of those propositions.
> 
> As the punching cursor moves along the radix_tree, it gathers page
> pointers and swap entries (the emply slots are already skipped at
> the level below; and tmpfs takes care that there is no instant in
> switching between page and swap when the slot appears empty).
> 
> Dealing with the page pointers is easy: a reference is already held,
> then shmem_undo_range takes the page lock which prevents swizzling
> to swap, then truncates that page out of the tree.
> 
> But dealing with swap entries is slippery: there is no reference
> held, and no lock to prevent swizzling to page (outside of the
> tree_lock taken in shmem_free_swap).
> 
> So, as I see it, the page lock ensures that any pages present at
> the starting of the punch operation will be removed, without any
> need to go back and retry.  But a swap entry present at the starting
> of the punch operation might be swizzled back to page (and, if we
> imagine massive preemption, even back to swap again, and to page
> again, etc) at the wrong moment: so for swap we do need to retry.
> 
> (What I said there is not quite correct: that swap would actually
> have to be a locked page at the time when the first loop meets it.)
> 
> Does that make sense?

Allow me to disagree with myself: no, I now think you were right
to press me on this, and we need also the patch below.  Do you
agree, or do you see something more is needed?

Note to onlookers: this would have no bearing on the shmem_fallocate
hang which Sasha reported seeing on -next yesterday (in another thread),
but is nonetheless a correction to the patch we have there - I think.

Hugh

--- 3.16-rc3-mm1/mm/shmem.c	2014-07-02 15:32:22.220311543 -0700
+++ linux/mm/shmem.c	2014-07-09 17:38:49.972818635 -0700
@@ -516,6 +516,11 @@ static void shmem_undo_range(struct inod
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
 					truncate_inode_page(mapping, page);
+				} else {
+					/* Page was replaced by swap: retry */
+					unlock_page(page);
+					index--;
+					break;
 				}
 			}
 			unlock_page(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2014-07-10  0:53 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-16  4:01 mm: shm: hang in shmem_fallocate Sasha Levin
2013-12-16  4:01 ` Sasha Levin
2014-02-08 19:46 ` Sasha Levin
2014-02-08 19:46   ` Sasha Levin
2014-02-09  3:25   ` Hugh Dickins
2014-02-09  3:25     ` Hugh Dickins
2014-02-10  1:41     ` Sasha Levin
2014-02-10  1:41       ` Sasha Levin
2014-06-12 20:38       ` Sasha Levin
2014-06-12 20:38         ` Sasha Levin
2014-06-16  2:29         ` Hugh Dickins
2014-06-16  2:29           ` Hugh Dickins
2014-06-17 20:32           ` Sasha Levin
2014-06-17 20:32             ` Sasha Levin
2014-06-24 16:31           ` Vlastimil Babka
2014-06-24 16:31             ` Vlastimil Babka
2014-06-25 22:36             ` Hugh Dickins
2014-06-25 22:36               ` Hugh Dickins
2014-06-26  9:14               ` Vlastimil Babka
2014-06-26  9:14                 ` Vlastimil Babka
2014-06-26 15:19                 ` Vlastimil Babka
2014-06-26 15:19                   ` Vlastimil Babka
2014-06-27  5:36                 ` Hugh Dickins
2014-06-27  5:36                   ` Hugh Dickins
2014-07-01 11:52                   ` Vlastimil Babka
2014-07-01 11:52                     ` Vlastimil Babka
2014-07-02  1:49                     ` Hugh Dickins
2014-07-02  1:49                       ` Hugh Dickins
2014-07-09 21:59                   ` Johannes Weiner
2014-07-09 21:59                     ` Johannes Weiner
2014-07-09 22:48                     ` Hugh Dickins
2014-07-09 22:48                       ` Hugh Dickins
2014-07-10  0:51                       ` Hugh Dickins
2014-07-10  0:51                         ` Hugh Dickins
2014-06-26 15:11               ` Sasha Levin
2014-06-27  5:59                 ` Hugh Dickins
2014-06-27  5:59                   ` Hugh Dickins
2014-06-27 14:50                   ` Sasha Levin
2014-06-27 14:50                     ` Sasha Levin
2014-06-27 18:03                     ` Hugh Dickins
2014-06-27 18:03                       ` Hugh Dickins
2014-06-28 21:41                       ` Sasha Levin
2014-06-28 21:41                         ` Sasha Levin
2014-07-01 22:37                       ` mm: shmem: hang in shmem_fault (WAS: mm: shm: hang in shmem_fallocate) Sasha Levin
2014-07-01 22:37                         ` Sasha Levin
2014-07-02  0:17                         ` Hugh Dickins
2014-07-02  0:17                           ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.