linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] tmpfs: fix race between umount and writepage
@ 2011-04-05 10:34 Konstantin Khlebnikov
  2011-04-08 12:27 ` Konstantin Khlebnikov
  2011-04-20 20:04 ` Andrew Morton
  0 siblings, 2 replies; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-04-05 10:34 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, linux-kernel

shmem_writepage() call igrab() on the inode for the page which is came from
reclaimer to add it later into shmem_swaplist for swap-unuse operation.

This igrab() can race with super-block deactivating process:

shrink_inactive_list()		deactivate_super()
pageout()			tmpfs_fs_type->kill_sb()
shmem_writepage()		kill_litter_super()
				generic_shutdown_super()
				 evict_inodes()
 igrab()
				  atomic_read(&inode->i_count)
				   skip-inode
 iput()
				 if (!list_empty(&sb->s_inodes))
					printk("VFS: Busy inodes after...

To avoid this race after this patch shmem_writepage() also try grab sb->s_active.

If sb->s_active == 0 adding to the shmem_swaplist not required, because
super-block deactivation in progress and swap-entries will be released soon.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/shmem.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 58da7c1..1f49c03 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1038,11 +1038,13 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	struct address_space *mapping;
 	unsigned long index;
 	struct inode *inode;
+	struct super_block *sb;
 
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
 	index = page->index;
 	inode = mapping->host;
+	sb = inode->i_sb;
 	info = SHMEM_I(inode);
 	if (info->flags & VM_LOCKED)
 		goto redirty;
@@ -1083,7 +1085,10 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		delete_from_page_cache(page);
 		shmem_swp_set(info, entry, swap.val);
 		shmem_swp_unmap(entry);
-		if (list_empty(&info->swaplist))
+		if (!list_empty(&info->swaplist) ||
+				!atomic_inc_not_zero(&sb->s_active))
+			sb = NULL;
+		if (sb)
 			inode = igrab(inode);
 		else
 			inode = NULL;
@@ -1098,6 +1103,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 			mutex_unlock(&shmem_swaplist_mutex);
 			iput(inode);
 		}
+		if (sb)
+			deactivate_super(sb);
 		return 0;
 	}
 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] tmpfs: fix race between umount and writepage
  2011-04-05 10:34 [PATCH] tmpfs: fix race between umount and writepage Konstantin Khlebnikov
@ 2011-04-08 12:27 ` Konstantin Khlebnikov
  2011-04-20 20:04 ` Andrew Morton
  1 sibling, 0 replies; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-04-08 12:27 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, linux-kernel

Bug is easily reproduced by this script:

for i in {1..300} ; do
	mkdir $i
	while true ; do
		mount -t tmpfs none $i
		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100)) status=noxfer
		umount $i
	done &
done

At 6xCPU node with 8Gb RAM. Kernel is very unstable after this accident. =)
Kernel with this patch is working fine for at least an hour.

Kernel log:

[  584.544461] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds.  Have a nice day...
[  585.409221] ------------[ cut here ]------------
[  585.409268] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
[  585.409331] Hardware name: System Product Name
[  585.409372] list_del corruption. prev->next should be ffff880222fdaac8, but was           (null)
[  585.409928] Modules linked in: [last unloaded: scsi_wait_scan]
[  585.410279] Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
[  585.410540] Call Trace:
[  585.410819]  [<ffffffff8103b710>] warn_slowpath_common+0x80/0x98
[  585.411113]  [<ffffffff8103b7bc>] warn_slowpath_fmt+0x41/0x43
[  585.411377]  [<ffffffff81227145>] __list_del_entry+0x8d/0x98
[  585.411649]  [<ffffffff810f68af>] evict+0x50/0x113
[  585.411919]  [<ffffffff810f6ce6>] iput+0x138/0x141
[  585.412187]  [<ffffffff810bc2e7>] shmem_writepage+0x18b/0x1dc
[  585.412434]  [<ffffffff810b6f4a>] pageout+0x13c/0x24c
[  585.412677]  [<ffffffff810b7479>] shrink_page_list+0x28e/0x4be
[  585.412922]  [<ffffffff810b78c8>] shrink_inactive_list+0x21f/0x382
[  585.413173]  [<ffffffff810b7d85>] shrink_zone+0x35a/0x447
[  585.413417]  [<ffffffff810b7f50>] do_try_to_free_pages+0xde/0x3bc
[  585.413661]  [<ffffffff810b8373>] try_to_free_pages+0x9d/0xe2
[  585.413906]  [<ffffffff810b091a>] __alloc_pages_nodemask+0x45a/0x72b
[  585.414157]  [<ffffffff810d9005>] alloc_pages_current+0xaa/0xcd
[  585.414402]  [<ffffffff81028745>] pte_alloc_one+0x15/0x38
[  585.414647]  [<ffffffff810c4480>] __pte_alloc+0x1b/0xa0
[  585.414890]  [<ffffffff810c6dcf>] handle_mm_fault+0x11c/0x173
[  585.415138]  [<ffffffff81591cd1>] do_page_fault+0x348/0x36a
[  585.415383]  [<ffffffff8158f35f>] page_fault+0x1f/0x30
[  585.415627]  [<ffffffff812231ed>] ? __put_user_4+0x1d/0x30
[  585.415869]  [<ffffffff81038fb1>] ? schedule_tail+0x5c/0x60
[  585.416117]  [<ffffffff81595003>] ret_from_fork+0x13/0x80
[  585.416428] ---[ end trace 39cf2c656ee772fe ]---
[  585.416690] BUG: unable to handle kernel paging request at ffffffffffffffff
[  585.417001] IP: [<ffffffff810b946a>] shmem_free_blocks+0x18/0x4c
[  585.417001] PGD 1805067 PUD 1806067 PMD 0
[  585.417001] Oops: 0000 [#1] SMP
[  585.417839] last sysfs file: /sys/kernel/kexec_crash_size
[  585.418156] CPU 1
[  585.418156] Modules linked in: [last unloaded: scsi_wait_scan]
[  585.418851]
[  585.418851] Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4 System manufacturer System Product Name/Crosshair IV Formula
[  585.419541] RIP: 0010:[<ffffffff810b946a>]  [<ffffffff810b946a>] shmem_free_blocks+0x18/0x4c
[  585.419857] RSP: 0018:ffff880163e9f4b8  EFLAGS: 00010206
[  585.419857] RAX: ffff88021b513400 RBX: ffff880222fdaa40 RCX: 0000000000000020
[  585.419857] RDX: ffffffffffffffe0 RSI: 000000000000000e RDI: ffffffffffffffff
[  585.419857] RBP: ffff880163e9f4c8 R08: ffffea000653b090 R09: 0000000000014df0
[  585.419857] R10: 0000000000000028 R11: 000000000000002a R12: 000000000000000e
[  585.419857] R13: 000000000003cc76 R14: ffff880222fda970 R15: ffff880202b5d588
[  585.419857] FS:  00007f1c5b0cb700(0000) GS:ffff88024fc40000(0000) knlGS:0000000000000000
[  585.419857] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  585.419857] CR2: ffffffffffffffff CR3: 0000000187431000 CR4: 00000000000006e0
[  585.419857] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  585.419857] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  585.419857] Process dd (pid: 10422, threadinfo ffff880163e9e000, task ffff880098f65700)
[  585.419857] Stack:
[  585.419857]  ffff880222fdaa40 000000000000000e ffff880163e9f4e8 ffffffff810bac88
[  585.419857]  ffff880222fdaa40 ffffea000653b068 ffff880163e9f538 ffffffff810bc216
[  585.419857]  0000000000000000 ffff880163e9f548 0000000000000000 ffffea000653b068
[  585.419857] Call Trace:
[  585.419857]  [<ffffffff810bac88>] shmem_recalc_inode+0x61/0x66
[  585.419857]  [<ffffffff810bc216>] shmem_writepage+0xba/0x1dc
[  585.419857]  [<ffffffff810b6f4a>] pageout+0x13c/0x24c
[  585.419857]  [<ffffffff810b7479>] shrink_page_list+0x28e/0x4be
[  585.419857]  [<ffffffff810b78c8>] shrink_inactive_list+0x21f/0x382
[  585.419857]  [<ffffffff810b7d85>] shrink_zone+0x35a/0x447
[  585.419857]  [<ffffffff810b7f50>] do_try_to_free_pages+0xde/0x3bc
[  585.419857]  [<ffffffff810b8373>] try_to_free_pages+0x9d/0xe2
[  585.419857]  [<ffffffff810b091a>] __alloc_pages_nodemask+0x45a/0x72b
[  585.419857]  [<ffffffff810d9594>] alloc_pages_vma+0x117/0x133
[  585.419857]  [<ffffffff810b9448>] shmem_alloc_page+0x50/0x5a
[  585.419857]  [<ffffffff810317bf>] ? __dequeue_entity+0x2e/0x33
[  585.419857]  [<ffffffff810317e7>] ? set_next_entity+0x23/0x73
[  585.419857]  [<ffffffff810aa496>] ? find_get_page+0x23/0x69
[  585.419857]  [<ffffffff810aba4f>] ? find_lock_page+0x1e/0x5b
[  585.419857]  [<ffffffff810bad90>] shmem_getpage+0x103/0x644
[  585.419857]  [<ffffffff810bb35a>] shmem_write_begin+0x28/0x2a
[  585.419857]  [<ffffffff810a9cd0>] generic_file_buffered_write+0x100/0x24e
[  585.419857]  [<ffffffff810aaf2e>] __generic_file_aio_write+0x23a/0x26f
[  585.419857]  [<ffffffff810b5d41>] ? page_evictable+0x12/0x7d
[  585.419857]  [<ffffffff810b32c2>] ? lru_cache_add_lru+0x50/0x5e
[  585.419857]  [<ffffffff810aafc1>] generic_file_aio_write+0x5e/0xb3
[  585.419857]  [<ffffffff810e32c8>] do_sync_write+0xc6/0x103
[  585.419857]  [<ffffffff811e1a44>] ? security_file_permission+0x29/0x2e
[  585.419857]  [<ffffffff810e3c58>] vfs_write+0xa9/0x105
[  585.419857]  [<ffffffff810e3d6d>] sys_write+0x45/0x6c
[  585.419857]  [<ffffffff815950fb>] system_call_fastpath+0x16/0x1b
[  585.419857] Code: 88 e8 00 00 e8 35 00 02 00 48 81 c4 c8 00 00 00 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53 48 8b 47 18 48 89 fb 48 8b b8 88 02 00 00
[  585.419857]  83 3f 00 74 29 8b 15 ba 8a 81 00 48 f7 de 48 83 c7 08 49 c1
[  585.419857] RIP  [<ffffffff810b946a>] shmem_free_blocks+0x18/0x4c
[  585.419857]  RSP <ffff880163e9f4b8>
[  585.419857] CR2: ffffffffffffffff
[  585.419857] ---[ end trace 39cf2c656ee772ff ]---
[  585.419857] ------------[ cut here ]------------
[  585.419857] WARNING: at kernel/exit.c:911 do_exit+0x71/0x794()
[  585.419857] Hardware name: System Product Name
[  585.419857] Modules linked in: [last unloaded: scsi_wait_scan]
[  585.419857] Pid: 10422, comm: dd Tainted: G      D W   2.6.39-rc2+ #4
[  585.419857] Call Trace:
[  585.419857]  [<ffffffff8103b710>] warn_slowpath_common+0x80/0x98
[  585.419857]  [<ffffffff8103b73d>] warn_slowpath_null+0x15/0x17
[  585.419857]  [<ffffffff8103ec3c>] do_exit+0x71/0x794
[  585.419857]  [<ffffffff8103bbf6>] ? kmsg_dump+0x44/0xea
[  585.419857]  [<ffffffff8158fde9>] oops_end+0xb1/0xb9
[  585.419857]  [<ffffffff810256f9>] no_context+0x1f7/0x206
[  585.419857]  [<ffffffff8102588a>] __bad_area_nosemaphore+0x182/0x1a5
[  585.419857]  [<ffffffff810258bb>] bad_area_nosemaphore+0xe/0x10
[  585.419857]  [<ffffffff81591b33>] do_page_fault+0x1aa/0x36a
[  585.419857]  [<ffffffff8120ab4a>] ? drive_stat_acct+0x105/0x140
[  585.419857]  [<ffffffff8120c3d5>] ? __make_request+0x23a/0x27c
[  585.419857]  [<ffffffff8120b007>] ? generic_make_request+0x2b6/0x31f
[  585.419857]  [<ffffffff8158f35f>] page_fault+0x1f/0x30
[  585.419857]  [<ffffffff810b946a>] ? shmem_free_blocks+0x18/0x4c
[  585.419857]  [<ffffffff810bac88>] shmem_recalc_inode+0x61/0x66
[  585.419857]  [<ffffffff810bc216>] shmem_writepage+0xba/0x1dc
[  585.419857]  [<ffffffff810b6f4a>] pageout+0x13c/0x24c
[  585.419857]  [<ffffffff810b7479>] shrink_page_list+0x28e/0x4be
[  585.419857]  [<ffffffff810b78c8>] shrink_inactive_list+0x21f/0x382
[  585.419857]  [<ffffffff810b7d85>] shrink_zone+0x35a/0x447
[  585.419857]  [<ffffffff810b7f50>] do_try_to_free_pages+0xde/0x3bc
[  585.419857]  [<ffffffff810b8373>] try_to_free_pages+0x9d/0xe2
[  585.419857]  [<ffffffff810b091a>] __alloc_pages_nodemask+0x45a/0x72b
[  585.419857]  [<ffffffff810d9594>] alloc_pages_vma+0x117/0x133
[  585.419857]  [<ffffffff810b9448>] shmem_alloc_page+0x50/0x5a
[  585.419857]  [<ffffffff810317bf>] ? __dequeue_entity+0x2e/0x33
[  585.419857]  [<ffffffff810317e7>] ? set_next_entity+0x23/0x73
[  585.419857]  [<ffffffff810aa496>] ? find_get_page+0x23/0x69
[  585.419857]  [<ffffffff810aba4f>] ? find_lock_page+0x1e/0x5b
[  585.419857]  [<ffffffff810bad90>] shmem_getpage+0x103/0x644
[  585.419857]  [<ffffffff810bb35a>] shmem_write_begin+0x28/0x2a
[  585.419857]  [<ffffffff810a9cd0>] generic_file_buffered_write+0x100/0x24e
[  585.419857]  [<ffffffff810aaf2e>] __generic_file_aio_write+0x23a/0x26f
[  585.419857]  [<ffffffff810b5d41>] ? page_evictable+0x12/0x7d
[  585.419857]  [<ffffffff810b32c2>] ? lru_cache_add_lru+0x50/0x5e
[  585.419857]  [<ffffffff810aafc1>] generic_file_aio_write+0x5e/0xb3
[  585.419857]  [<ffffffff810e32c8>] do_sync_write+0xc6/0x103
[  585.419857]  [<ffffffff811e1a44>] ? security_file_permission+0x29/0x2e
[  585.419857]  [<ffffffff810e3c58>] vfs_write+0xa9/0x105
[  585.419857]  [<ffffffff810e3d6d>] sys_write+0x45/0x6c
[  585.419857]  [<ffffffff815950fb>] system_call_fastpath+0x16/0x1b
[  585.419857] ---[ end trace 39cf2c656ee77300 ]---
[  585.419857] note: dd[10422] exited with preempt_count 1
[  585.419857] BUG: scheduling while atomic: dd/10422/0x10000001
[  585.419857] Modules linked in: [last unloaded: scsi_wait_scan]
[  585.419857] Pid: 10422, comm: dd Tainted: G      D W   2.6.39-rc2+ #4
[  585.419857] Call Trace:
[  585.419857]  [<ffffffff81035c4d>] __schedule_bug+0x57/0x5c
[  585.419857]  [<ffffffff8158d4c6>] schedule+0x95/0x655
[  585.419857]  [<ffffffff8158d19e>] ? printk+0x3c/0x3e
[  585.419857]  [<ffffffff81038d64>] __cond_resched+0x25/0x30
[  585.419857]  [<ffffffff8158dccc>] _cond_resched+0x27/0x32
[  585.419857]  [<ffffffff8158e73b>] down_read+0x11/0x23
[  585.419857]  [<ffffffff8107169f>] acct_collect+0x3f/0x177
[  585.419857]  [<ffffffff8103edf8>] do_exit+0x22d/0x794
[  585.419857]  [<ffffffff8103bbf6>] ? kmsg_dump+0x44/0xea
[  585.419857]  [<ffffffff8158fde9>] oops_end+0xb1/0xb9
[  585.419857]  [<ffffffff810256f9>] no_context+0x1f7/0x206
[  585.419857]  [<ffffffff8102588a>] __bad_area_nosemaphore+0x182/0x1a5
[  585.419857]  [<ffffffff810258bb>] bad_area_nosemaphore+0xe/0x10
[  585.419857]  [<ffffffff81591b33>] do_page_fault+0x1aa/0x36a
[  585.419857]  [<ffffffff8120ab4a>] ? drive_stat_acct+0x105/0x140
[  585.419857]  [<ffffffff8120c3d5>] ? __make_request+0x23a/0x27c
[  585.419857]  [<ffffffff8120b007>] ? generic_make_request+0x2b6/0x31f
[  585.419857]  [<ffffffff8158f35f>] page_fault+0x1f/0x30
[  585.419857]  [<ffffffff810b946a>] ? shmem_free_blocks+0x18/0x4c
[  585.419857]  [<ffffffff810bac88>] shmem_recalc_inode+0x61/0x66
[  585.419857]  [<ffffffff810bc216>] shmem_writepage+0xba/0x1dc
[  585.419857]  [<ffffffff810b6f4a>] pageout+0x13c/0x24c
[  585.419857]  [<ffffffff810b7479>] shrink_page_list+0x28e/0x4be
[  585.419857]  [<ffffffff810b78c8>] shrink_inactive_list+0x21f/0x382
[  585.419857]  [<ffffffff810b7d85>] shrink_zone+0x35a/0x447
[  585.419857]  [<ffffffff810b7f50>] do_try_to_free_pages+0xde/0x3bc
[  585.419857]  [<ffffffff810b8373>] try_to_free_pages+0x9d/0xe2
[  585.419857]  [<ffffffff810b091a>] __alloc_pages_nodemask+0x45a/0x72b
[  585.419857]  [<ffffffff810d9594>] alloc_pages_vma+0x117/0x133
[  585.419857]  [<ffffffff810b9448>] shmem_alloc_page+0x50/0x5a
[  585.419857]  [<ffffffff810317bf>] ? __dequeue_entity+0x2e/0x33
[  585.419857]  [<ffffffff810317e7>] ? set_next_entity+0x23/0x73
[  585.419857]  [<ffffffff810aa496>] ? find_get_page+0x23/0x69
[  585.419857]  [<ffffffff810aba4f>] ? find_lock_page+0x1e/0x5b
[  585.419857]  [<ffffffff810bad90>] shmem_getpage+0x103/0x644
[  585.419857]  [<ffffffff810bb35a>] shmem_write_begin+0x28/0x2a
[  585.419857]  [<ffffffff810a9cd0>] generic_file_buffered_write+0x100/0x24e
[  585.419857]  [<ffffffff810aaf2e>] __generic_file_aio_write+0x23a/0x26f
[  585.419857]  [<ffffffff810b5d41>] ? page_evictable+0x12/0x7d
[  585.419857]  [<ffffffff810b32c2>] ? lru_cache_add_lru+0x50/0x5e
[  585.419857]  [<ffffffff810aafc1>] generic_file_aio_write+0x5e/0xb3
[  585.419857]  [<ffffffff810e32c8>] do_sync_write+0xc6/0x103
[  585.419857]  [<ffffffff811e1a44>] ? security_file_permission+0x29/0x2e
[  585.419857]  [<ffffffff810e3c58>] vfs_write+0xa9/0x105
[  585.419857]  [<ffffffff810e3d6d>] sys_write+0x45/0x6c
[  585.419857]  [<ffffffff815950fb>] system_call_fastpath+0x16/0x1b
[  585.468858] dd used greatest stack depth: 2360 bytes left
[  585.650873] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds.  Have a nice day...
[  585.940034] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds.  Have a nice day...
[  586.336870] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds.  Have a nice day...

Konstantin Khlebnikov wrote:
> shmem_writepage() call igrab() on the inode for the page which is came from
> reclaimer to add it later into shmem_swaplist for swap-unuse operation.
>
> This igrab() can race with super-block deactivating process:
>
> shrink_inactive_list()		deactivate_super()
> pageout()			tmpfs_fs_type->kill_sb()
> shmem_writepage()		kill_litter_super()
> 				generic_shutdown_super()
> 				 evict_inodes()
>   igrab()
> 				  atomic_read(&inode->i_count)
> 				   skip-inode
>   iput()
> 				 if (!list_empty(&sb->s_inodes))
> 					printk("VFS: Busy inodes after...
>
> To avoid this race after this patch shmem_writepage() also try grab sb->s_active.
>
> If sb->s_active == 0 adding to the shmem_swaplist not required, because
> super-block deactivation in progress and swap-entries will be released soon.
>
> Signed-off-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
> ---
>   mm/shmem.c |    9 ++++++++-
>   1 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 58da7c1..1f49c03 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1038,11 +1038,13 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>   	struct address_space *mapping;
>   	unsigned long index;
>   	struct inode *inode;
> +	struct super_block *sb;
>
>   	BUG_ON(!PageLocked(page));
>   	mapping = page->mapping;
>   	index = page->index;
>   	inode = mapping->host;
> +	sb = inode->i_sb;
>   	info = SHMEM_I(inode);
>   	if (info->flags&  VM_LOCKED)
>   		goto redirty;
> @@ -1083,7 +1085,10 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>   		delete_from_page_cache(page);
>   		shmem_swp_set(info, entry, swap.val);
>   		shmem_swp_unmap(entry);
> -		if (list_empty(&info->swaplist))
> +		if (!list_empty(&info->swaplist) ||
> +				!atomic_inc_not_zero(&sb->s_active))
> +			sb = NULL;
> +		if (sb)
>   			inode = igrab(inode);
>   		else
>   			inode = NULL;
> @@ -1098,6 +1103,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>   			mutex_unlock(&shmem_swaplist_mutex);
>   			iput(inode);
>   		}
> +		if (sb)
> +			deactivate_super(sb);
>   		return 0;
>   	}
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tmpfs: fix race between umount and writepage
  2011-04-05 10:34 [PATCH] tmpfs: fix race between umount and writepage Konstantin Khlebnikov
  2011-04-08 12:27 ` Konstantin Khlebnikov
@ 2011-04-20 20:04 ` Andrew Morton
  2011-04-21  6:37   ` Konstantin Khlebnikov
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2011-04-20 20:04 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Hugh Dickins, linux-mm, linux-kernel

On Tue, 5 Apr 2011 14:34:52 +0400
Konstantin Khlebnikov <khlebnikov@openvz.org> wrote:

> shmem_writepage() call igrab() on the inode for the page which is came from
> reclaimer to add it later into shmem_swaplist for swap-unuse operation.
> 
> This igrab() can race with super-block deactivating process:
> 
> shrink_inactive_list()		deactivate_super()
> pageout()			tmpfs_fs_type->kill_sb()
> shmem_writepage()		kill_litter_super()
> 				generic_shutdown_super()
> 				 evict_inodes()
>  igrab()
> 				  atomic_read(&inode->i_count)
> 				   skip-inode
>  iput()
> 				 if (!list_empty(&sb->s_inodes))
> 					printk("VFS: Busy inodes after...

Generally, ->writepage implementations shouldn't play with the inode,
for the reasons you've discovered.  A more common race is
writepage-versus-reclaim, where writepage is playing with the inode
when a concurrent reclaim frees the inode (and hence the
address_space).

It is safe to play with the inode while the passed-in page is locked
because nobody will free an inode which has an attached locked page. 
But once the page is unlocked, nothing pins the inode.  Typically,
tmpfs goes and breakes this rule.


Question is: why is shmem_writepage() doing the igrab/iput?

Read 1b1b32f2c6f6bb3253 and weep.

That changelog is a little incorrect:

: Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation
: is unsafe: though still hold page lock, which would hold off inode
: deletion if the page were i pagecache, it doesn't hold off once it's in
: swapcache (free_swap_and_cache doesn't wait on locked pages).  Hmm: we
: could put the the inode on swaplist earlier, but then shmem_unuse_inode
: could never prune unswapped inodes.

We don't actually hold the page lock when altering the swaplist:
swap_writepage() unlocks the page.  Doesn't seem to matter.


I think we should get the igrab/iput out of there and come up with a
different way of pinning the inode in ->writepage().

Can we do it in this order?

	mutex_lock(&shmem_swaplist_mutex);
	list_move_tail(&info->swaplist, &shmem_swaplist);
	delete_from_page_cache(page);
	shmem_swp_set(info, entry, swap.val);
	shmem_swp_unmap(entry);
	mutex_unlock(&shmem_swaplist_mutex);
	swap_writepage(page, wbc);									


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tmpfs: fix race between umount and writepage
  2011-04-20 20:04 ` Andrew Morton
@ 2011-04-21  6:37   ` Konstantin Khlebnikov
  2011-04-21  6:41     ` [PATCH v2] " Konstantin Khlebnikov
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-04-21  6:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel

Andrew Morton wrote:
> On Tue, 5 Apr 2011 14:34:52 +0400
> Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
>
>> shmem_writepage() call igrab() on the inode for the page which is came from
>> reclaimer to add it later into shmem_swaplist for swap-unuse operation.
>>
>> This igrab() can race with super-block deactivating process:
>>
>> shrink_inactive_list()		deactivate_super()
>> pageout()			tmpfs_fs_type->kill_sb()
>> shmem_writepage()		kill_litter_super()
>> 				generic_shutdown_super()
>> 				 evict_inodes()
>>   igrab()
>> 				  atomic_read(&inode->i_count)
>> 				   skip-inode
>>   iput()
>> 				 if (!list_empty(&sb->s_inodes))
>> 					printk("VFS: Busy inodes after...
>
> Generally, ->writepage implementations shouldn't play with the inode,
> for the reasons you've discovered.  A more common race is
> writepage-versus-reclaim, where writepage is playing with the inode
> when a concurrent reclaim frees the inode (and hence the
> address_space).
>
> It is safe to play with the inode while the passed-in page is locked
> because nobody will free an inode which has an attached locked page.
> But once the page is unlocked, nothing pins the inode.  Typically,
> tmpfs goes and breakes this rule.
>
>
> Question is: why is shmem_writepage() doing the igrab/iput?
>
> Read 1b1b32f2c6f6bb3253 and weep.
>
> That changelog is a little incorrect:
>
> : Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation
> : is unsafe: though still hold page lock, which would hold off inode
> : deletion if the page were i pagecache, it doesn't hold off once it's in
> : swapcache (free_swap_and_cache doesn't wait on locked pages).  Hmm: we
> : could put the the inode on swaplist earlier, but then shmem_unuse_inode
> : could never prune unswapped inodes.
>
> We don't actually hold the page lock when altering the swaplist:
> swap_writepage() unlocks the page.  Doesn't seem to matter.
>
>
> I think we should get the igrab/iput out of there and come up with a
> different way of pinning the inode in ->writepage().
>
> Can we do it in this order?
>
> 	mutex_lock(&shmem_swaplist_mutex);
> 	list_move_tail(&info->swaplist,&shmem_swaplist);
> 	delete_from_page_cache(page);
> 	shmem_swp_set(info, entry, swap.val);
> 	shmem_swp_unmap(entry);
> 	mutex_unlock(&shmem_swaplist_mutex);
> 	swap_writepage(page, wbc);									
>

Yes, we can, but of course without locking shmem_swaplist_mutex if inode already in shmem_swaplist.

I saw that igrab redundancy, but I was confused with lock-nesting and
shmem_swaplist spinlock to mutex conversion.
Seems to shmem_swaplist_mutex is already nested inside PageLock, so all ok.

We can simply revert last hunk from that commit, patch follows.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] tmpfs: fix race between umount and writepage
  2011-04-21  6:37   ` Konstantin Khlebnikov
@ 2011-04-21  6:41     ` Konstantin Khlebnikov
  2011-04-21 19:44       ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-04-21  6:41 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins; +Cc: linux-mm, linux-kernel

shmem_writepage() call igrab() on the inode for the page which is came from
reclaimer to add it later into shmem_swaplist for swap-unuse operation.

This igrab() can race with super-block deactivating process:

shrink_inactive_list()		deactivate_super()
pageout()			tmpfs_fs_type->kill_sb()
shmem_writepage()		kill_litter_super()
				generic_shutdown_super()
				 evict_inodes()
 igrab()
				  atomic_read(&inode->i_count)
				   skip-inode
 iput()
				 if (!list_empty(&sb->s_inodes))
					printk("VFS: Busy inodes after...

This igrap-iput pair was added in commit 1b1b32f2c6f6bb3253
based on incorrect assumptions:

: Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation
: is unsafe: though still hold page lock, which would hold off inode
: deletion if the page were i pagecache, it doesn't hold off once it's in
: swapcache (free_swap_and_cache doesn't wait on locked pages).  Hmm: we
: could put the the inode on swaplist earlier, but then shmem_unuse_inode
: could never prune unswapped inodes.

Attached locked page actually protect inode from deletion because
truncate_inode_pages_range() will sleep on this, so igrab not required.
This patch actually revert last hunk from that commit.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/shmem.c |   13 ++++---------
 1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8fa27e4..ea55704 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1084,21 +1084,16 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		delete_from_page_cache(page);
 		shmem_swp_set(info, entry, swap.val);
 		shmem_swp_unmap(entry);
-		if (list_empty(&info->swaplist))
-			inode = igrab(inode);
-		else
-			inode = NULL;
 		spin_unlock(&info->lock);
-		swap_shmem_alloc(swap);
-		BUG_ON(page_mapped(page));
-		swap_writepage(page, wbc);
-		if (inode) {
+		if (list_empty(&info->swaplist)) {
 			mutex_lock(&shmem_swaplist_mutex);
 			/* move instead of add in case we're racing */
 			list_move_tail(&info->swaplist, &shmem_swaplist);
 			mutex_unlock(&shmem_swaplist_mutex);
-			iput(inode);
 		}
+		swap_shmem_alloc(swap);
+		BUG_ON(page_mapped(page));
+		swap_writepage(page, wbc);
 		return 0;
 	}
 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-04-21  6:41     ` [PATCH v2] " Konstantin Khlebnikov
@ 2011-04-21 19:44       ` Andrew Morton
  2011-04-22  4:05         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2011-04-21 19:44 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Hugh Dickins, linux-mm, linux-kernel

On Thu, 21 Apr 2011 10:41:50 +0400
Konstantin Khlebnikov <khlebnikov@openvz.org> wrote:

> shmem_writepage() call igrab() on the inode for the page which is came from
> reclaimer to add it later into shmem_swaplist for swap-unuse operation.
> 
> This igrab() can race with super-block deactivating process:
> 
> shrink_inactive_list()		deactivate_super()
> pageout()			tmpfs_fs_type->kill_sb()
> shmem_writepage()		kill_litter_super()
> 				generic_shutdown_super()
> 				 evict_inodes()
>  igrab()
> 				  atomic_read(&inode->i_count)
> 				   skip-inode
>  iput()
> 				 if (!list_empty(&sb->s_inodes))
> 					printk("VFS: Busy inodes after...
> 
> This igrap-iput pair was added in commit 1b1b32f2c6f6bb3253
> based on incorrect assumptions:
> 
> : Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation
> : is unsafe: though still hold page lock, which would hold off inode
> : deletion if the page were i pagecache, it doesn't hold off once it's in
> : swapcache (free_swap_and_cache doesn't wait on locked pages).  Hmm: we
> : could put the the inode on swaplist earlier, but then shmem_unuse_inode
> : could never prune unswapped inodes.
> 
> Attached locked page actually protect inode from deletion because
> truncate_inode_pages_range() will sleep on this, so igrab not required.
> This patch actually revert last hunk from that commit.
> 

hm, is that last paragraph true?  Let's look at the resulting code.


: 	if (swap.val && add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
: 		delete_from_page_cache(page);

Here, the page is removed from inode->i_mapping.  So
truncate_inode_pages() won't see that page and will not block on its
lock.

: 		shmem_swp_set(info, entry, swap.val);
: 		shmem_swp_unmap(entry);
: 		spin_unlock(&info->lock);
: 		if (list_empty(&info->swaplist)) {
: 			mutex_lock(&shmem_swaplist_mutex);
: 			/* move instead of add in case we're racing */
: 			list_move_tail(&info->swaplist, &shmem_swaplist);
: 			mutex_unlock(&shmem_swaplist_mutex);
: 		}

Here, the code plays with `info', which points at storage which is
embedded within the inode's filesystem-private part.

But because the inode now has no attached locked page, a concurrent
umount can free the inode while this code is using it.

: 		swap_shmem_alloc(swap);
: 		BUG_ON(page_mapped(page));
: 		swap_writepage(page, wbc);
: 		return 0;
: 	}

However, I assume that you reran your testcase with the v2 patch and
that things ran OK.  How come?  Either my analysis is wrong or the
testcase doesn't trigger races in this code path?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-04-21 19:44       ` Andrew Morton
@ 2011-04-22  4:05         ` Konstantin Khlebnikov
  2011-05-03 20:06           ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-04-22  4:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel

Andrew Morton wrote:
> On Thu, 21 Apr 2011 10:41:50 +0400
> Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
>
>> shmem_writepage() call igrab() on the inode for the page which is came from
>> reclaimer to add it later into shmem_swaplist for swap-unuse operation.
>>
>> This igrab() can race with super-block deactivating process:
>>
>> shrink_inactive_list()		deactivate_super()
>> pageout()			tmpfs_fs_type->kill_sb()
>> shmem_writepage()		kill_litter_super()
>> 				generic_shutdown_super()
>> 				 evict_inodes()
>>   igrab()
>> 				  atomic_read(&inode->i_count)
>> 				   skip-inode
>>   iput()
>> 				 if (!list_empty(&sb->s_inodes))
>> 					printk("VFS: Busy inodes after...
>>
>> This igrap-iput pair was added in commit 1b1b32f2c6f6bb3253
>> based on incorrect assumptions:
>>
>> : Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation
>> : is unsafe: though still hold page lock, which would hold off inode
>> : deletion if the page were i pagecache, it doesn't hold off once it's in
>> : swapcache (free_swap_and_cache doesn't wait on locked pages).  Hmm: we
>> : could put the the inode on swaplist earlier, but then shmem_unuse_inode
>> : could never prune unswapped inodes.
>>
>> Attached locked page actually protect inode from deletion because
>> truncate_inode_pages_range() will sleep on this, so igrab not required.
>> This patch actually revert last hunk from that commit.
>>
>
> hm, is that last paragraph true?  Let's look at the resulting code.
>
>
> : 	if (swap.val&&  add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
> : 		delete_from_page_cache(page);
>
> Here, the page is removed from inode->i_mapping.  So
> truncate_inode_pages() won't see that page and will not block on its
> lock.

Oops, right. Sorry. It produce use-after-free race, but it is quiet and small.
My test is using too few files to catch it in a reasonable time,
and I ran it without slab poisoning.

So, v1 patch is correct but little ugly, while v2 -- broken.

>
> : 		shmem_swp_set(info, entry, swap.val);
> : 		shmem_swp_unmap(entry);
> : 		spin_unlock(&info->lock);
> : 		if (list_empty(&info->swaplist)) {
> : 			mutex_lock(&shmem_swaplist_mutex);
> : 			/* move instead of add in case we're racing */
> : 			list_move_tail(&info->swaplist,&shmem_swaplist);
> : 			mutex_unlock(&shmem_swaplist_mutex);
> : 		}
>
> Here, the code plays with `info', which points at storage which is
> embedded within the inode's filesystem-private part.
>
> But because the inode now has no attached locked page, a concurrent
> umount can free the inode while this code is using it.

I guess we can try to put delete_from_page_cache(page); right before swap_writepage
but it move it outside info->lock...

>
> : 		swap_shmem_alloc(swap);
> : 		BUG_ON(page_mapped(page));
> : 		swap_writepage(page, wbc);
> : 		return 0;
> : 	}
>
> However, I assume that you reran your testcase with the v2 patch and
> that things ran OK.  How come?  Either my analysis is wrong or the
> testcase doesn't trigger races in this code path?
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-04-22  4:05         ` Konstantin Khlebnikov
@ 2011-05-03 20:06           ` Hugh Dickins
  2011-05-07  5:33             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2011-05-03 20:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, 22 Apr 2011, Konstantin Khlebnikov wrote:
> Andrew Morton wrote:
> > On Thu, 21 Apr 2011 10:41:50 +0400
> > Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
> > 
> > > shmem_writepage() call igrab() on the inode for the page which is came
> > > from
> > > reclaimer to add it later into shmem_swaplist for swap-unuse operation.
> > > 
> > > This igrab() can race with super-block deactivating process:
> > > 
> > > shrink_inactive_list()		deactivate_super()
> > > pageout()			tmpfs_fs_type->kill_sb()
> > > shmem_writepage()		kill_litter_super()
> > > 				generic_shutdown_super()
> > > 				 evict_inodes()
> > >   igrab()
> > > 				  atomic_read(&inode->i_count)
> > > 				   skip-inode
> > >   iput()
> > > 				 if (!list_empty(&sb->s_inodes))
> > > 					printk("VFS: Busy inodes after...
> > > 
> > > This igrap-iput pair was added in commit 1b1b32f2c6f6bb3253
> > > based on incorrect assumptions:

Konstantin, many thanks for discovering this issue, and please accept
my apology for being so slow to respond.  I have to find enough "cool"
time to think my way back into all the races which may occur here.

I am disappointed with igrab!  It appeared to be the tool I needed
when I added that in, when I was concerned with deletion racing with
writepage and swapoff.  Clearly I didn't research igrab enough: its
interface seemed right, and I never imagined it gave no equivalent
protection against unmount - well, it does protect the inode, but
you have to pay for that with a "Self-destruct in 5 seconds" message.

I'm surprised that nothing else has such a problem with igrab, but
must accept I got it wrong, and I'm using it where it's not usable.
And it got more obviously unusable with 2.6.37's 63997e98a3be which
removed the second "safety" call to invalidate/evict_inodes() from
generic_shutdown_super().

> > > 
> > > : Ah, I'd never suspected it, but shmem_writepage's swaplist
> > > manipulation
> > > : is unsafe: though still hold page lock, which would hold off inode
> > > : deletion if the page were i pagecache, it doesn't hold off once it's
> > > in
> > > : swapcache (free_swap_and_cache doesn't wait on locked pages).  Hmm: we
> > > : could put the the inode on swaplist earlier, but then
> > > shmem_unuse_inode
> > > : could never prune unswapped inodes.
> > > 
> > > Attached locked page actually protect inode from deletion because
> > > truncate_inode_pages_range() will sleep on this, so igrab not required.
> > > This patch actually revert last hunk from that commit.
> > > 
> > 
> > hm, is that last paragraph true?  Let's look at the resulting code.
> > 
> > 
> > : 	if (swap.val&&  add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
> > : 		delete_from_page_cache(page);
> > 
> > Here, the page is removed from inode->i_mapping.  So
> > truncate_inode_pages() won't see that page and will not block on its
> > lock.
> 
> Oops, right. Sorry. It produce use-after-free race, but it is quiet and
> small.
> My test is using too few files to catch it in a reasonable time,
> and I ran it without slab poisoning.
> 
> So, v1 patch is correct but little ugly, while v2 -- broken.

Yes.  But even if the v1 patch is correct so far as it goes (and I've
not spent much time considering it, being disapppointed enough with
igrab that I'd rather get away from it and solve the races internally
than rely further upon VFS constructs here), it must be incomplete
since the same problem will apply to the igrab in shmem_unuse_inode too.

> 
> > 
> > : 		shmem_swp_set(info, entry, swap.val);
> > : 		shmem_swp_unmap(entry);
> > : 		spin_unlock(&info->lock);
> > : 		if (list_empty(&info->swaplist)) {
> > : 			mutex_lock(&shmem_swaplist_mutex);
> > : 			/* move instead of add in case we're racing */
> > : 			list_move_tail(&info->swaplist,&shmem_swaplist);
> > : 			mutex_unlock(&shmem_swaplist_mutex);
> > : 		}
> > 
> > Here, the code plays with `info', which points at storage which is
> > embedded within the inode's filesystem-private part.
> > 
> > But because the inode now has no attached locked page, a concurrent
> > umount can free the inode while this code is using it.
> 
> I guess we can try to put delete_from_page_cache(page); right before
> swap_writepage
> but it move it outside info->lock...

You're right to be wary, we do need to do all the swizzling within
info->lock.

> 
> > 
> > : 		swap_shmem_alloc(swap);
> > : 		BUG_ON(page_mapped(page));
> > : 		swap_writepage(page, wbc);
> > : 		return 0;
> > : 	}
> > 
> > However, I assume that you reran your testcase with the v2 patch and
> > that things ran OK.  How come?  Either my analysis is wrong or the
> > testcase doesn't trigger races in this code path?

Here's the patch I was testing last night, but I do want to test it
some more (I've not even tried your unmounting case yet), and I do want
to make some changes to it (some comments, and see if I can move the
mem_cgroup_cache_charge outside of the mutex, making it GFP_KERNEL
rather than GFP_NOFS - at the time that mem_cgroup charging went in,
we did not know here if it was actually a shmem swap page, whereas
nowadays we can be sure, since that's noted in the swap_map).

In shmem_unuse_inode I'm widening the shmem_swaplist_mutex to protect
against shmem_evict_inode; and in shmem_writepage adding to the list
earlier, while holding lock on page still in pagecache to protect it.

But testing last night showed corruption on this laptop (no problem
on other machines): I'm guessing it's unrelated, but I can't be sure
of that without more extended testing.

Hugh

--- 2.6.39-rc5/mm/shmem.c	2011-04-28 09:52:49.066135001 -0700
+++ linux/mm/shmem.c	2011-05-02 21:02:21.745633214 -0700
@@ -852,7 +852,7 @@ static inline int shmem_find_swp(swp_ent
 
 static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
 {
-	struct inode *inode;
+	struct address_space *mapping;
 	unsigned long idx;
 	unsigned long size;
 	unsigned long limit;
@@ -928,7 +928,7 @@ lost2:
 	return 0;
 found:
 	idx += offset;
-	inode = igrab(&info->vfs_inode);
+	mapping = info->vfs_inode.i_mapping;
 	spin_unlock(&info->lock);
 
 	/*
@@ -940,20 +940,16 @@ found:
 	 */
 	if (shmem_swaplist.next != &info->swaplist)
 		list_move_tail(&shmem_swaplist, &info->swaplist);
-	mutex_unlock(&shmem_swaplist_mutex);
 
-	error = 1;
-	if (!inode)
-		goto out;
 	/*
-	 * Charge page using GFP_KERNEL while we can wait.
+	 * Charge page using GFP_NOFS while we can wait.
 	 * Charged back to the user(not to caller) when swap account is used.
 	 * add_to_page_cache() will be called with GFP_NOWAIT.
 	 */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_cache_charge(page, current->mm, GFP_NOFS);
 	if (error)
 		goto out;
-	error = radix_tree_preload(GFP_KERNEL);
+	error = radix_tree_preload(GFP_NOFS);
 	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
 		goto out;
@@ -963,14 +959,14 @@ found:
 	spin_lock(&info->lock);
 	ptr = shmem_swp_entry(info, idx, NULL);
 	if (ptr && ptr->val == entry.val) {
-		error = add_to_page_cache_locked(page, inode->i_mapping,
+		error = add_to_page_cache_locked(page, mapping,
 						idx, GFP_NOWAIT);
 		/* does mem_cgroup_uncharge_cache_page on error */
 	} else	/* we must compensate for our precharge above */
 		mem_cgroup_uncharge_cache_page(page);
 
 	if (error == -EEXIST) {
-		struct page *filepage = find_get_page(inode->i_mapping, idx);
+		struct page *filepage = find_get_page(mapping, idx);
 		error = 1;
 		if (filepage) {
 			/*
@@ -995,9 +991,6 @@ found:
 	spin_unlock(&info->lock);
 	radix_tree_preload_end();
 out:
-	unlock_page(page);
-	page_cache_release(page);
-	iput(inode);		/* allows for NULL */
 	return error;
 }
 
@@ -1016,7 +1009,7 @@ int shmem_unuse(swp_entry_t entry, struc
 		found = shmem_unuse_inode(info, entry, page);
 		cond_resched();
 		if (found)
-			goto out;
+			break;
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 	/*
@@ -1025,7 +1018,6 @@ int shmem_unuse(swp_entry_t entry, struc
 	 */
 	unlock_page(page);
 	page_cache_release(page);
-out:
 	return (found < 0) ? found : 0;
 }
 
@@ -1039,6 +1031,7 @@ static int shmem_writepage(struct page *
 	struct address_space *mapping;
 	unsigned long index;
 	struct inode *inode;
+	bool unlock_mutex = false;
 
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
@@ -1064,7 +1057,17 @@ static int shmem_writepage(struct page *
 	else
 		swap.val = 0;
 
+	if (swap.val && list_empty(&info->swaplist)) {
+		mutex_lock(&shmem_swaplist_mutex);
+		/* move instead of add in case we're racing */
+		list_move_tail(&info->swaplist, &shmem_swaplist);
+		unlock_mutex = true;
+	}
+
 	spin_lock(&info->lock);
+	if (unlock_mutex)
+		mutex_unlock(&shmem_swaplist_mutex);
+
 	if (index >= info->next_index) {
 		BUG_ON(!(info->flags & SHMEM_TRUNCATE));
 		goto unlock;
@@ -1084,21 +1087,10 @@ static int shmem_writepage(struct page *
 		delete_from_page_cache(page);
 		shmem_swp_set(info, entry, swap.val);
 		shmem_swp_unmap(entry);
-		if (list_empty(&info->swaplist))
-			inode = igrab(inode);
-		else
-			inode = NULL;
 		spin_unlock(&info->lock);
 		swap_shmem_alloc(swap);
 		BUG_ON(page_mapped(page));
 		swap_writepage(page, wbc);
-		if (inode) {
-			mutex_lock(&shmem_swaplist_mutex);
-			/* move instead of add in case we're racing */
-			list_move_tail(&info->swaplist, &shmem_swaplist);
-			mutex_unlock(&shmem_swaplist_mutex);
-			iput(inode);
-		}
 		return 0;
 	}
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-05-03 20:06           ` Hugh Dickins
@ 2011-05-07  5:33             ` Konstantin Khlebnikov
  2011-05-07 23:56               ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-05-07  5:33 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, linux-kernel

Hugh Dickins wrote:

<cut>

>
> Here's the patch I was testing last night, but I do want to test it
> some more (I've not even tried your unmounting case yet), and I do want
> to make some changes to it (some comments, and see if I can move the
> mem_cgroup_cache_charge outside of the mutex, making it GFP_KERNEL
> rather than GFP_NOFS - at the time that mem_cgroup charging went in,
> we did not know here if it was actually a shmem swap page, whereas
> nowadays we can be sure, since that's noted in the swap_map).
>
> In shmem_unuse_inode I'm widening the shmem_swaplist_mutex to protect
> against shmem_evict_inode; and in shmem_writepage adding to the list
> earlier, while holding lock on page still in pagecache to protect it.
>
> But testing last night showed corruption on this laptop (no problem
> on other machines): I'm guessing it's unrelated, but I can't be sure
> of that without more extended testing.
>
> Hugh

This patch fixed my problem, I didn't catch any crashes on my test-case: swapout-unmount.

>
> --- 2.6.39-rc5/mm/shmem.c	2011-04-28 09:52:49.066135001 -0700
> +++ linux/mm/shmem.c	2011-05-02 21:02:21.745633214 -0700
> @@ -852,7 +852,7 @@ static inline int shmem_find_swp(swp_ent
>
>   static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
>   {
> -	struct inode *inode;
> +	struct address_space *mapping;
>   	unsigned long idx;
>   	unsigned long size;
>   	unsigned long limit;
> @@ -928,7 +928,7 @@ lost2:
>   	return 0;
>   found:
>   	idx += offset;
> -	inode = igrab(&info->vfs_inode);
> +	mapping = info->vfs_inode.i_mapping;
>   	spin_unlock(&info->lock);
>
>   	/*
> @@ -940,20 +940,16 @@ found:
>   	 */
>   	if (shmem_swaplist.next !=&info->swaplist)
>   		list_move_tail(&shmem_swaplist,&info->swaplist);
> -	mutex_unlock(&shmem_swaplist_mutex);
>
> -	error = 1;
> -	if (!inode)
> -		goto out;
>   	/*
> -	 * Charge page using GFP_KERNEL while we can wait.
> +	 * Charge page using GFP_NOFS while we can wait.
>   	 * Charged back to the user(not to caller) when swap account is used.
>   	 * add_to_page_cache() will be called with GFP_NOWAIT.
>   	 */
> -	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> +	error = mem_cgroup_cache_charge(page, current->mm, GFP_NOFS);
>   	if (error)
>   		goto out;
> -	error = radix_tree_preload(GFP_KERNEL);
> +	error = radix_tree_preload(GFP_NOFS);
>   	if (error) {
>   		mem_cgroup_uncharge_cache_page(page);
>   		goto out;
> @@ -963,14 +959,14 @@ found:
>   	spin_lock(&info->lock);
>   	ptr = shmem_swp_entry(info, idx, NULL);
>   	if (ptr&&  ptr->val == entry.val) {
> -		error = add_to_page_cache_locked(page, inode->i_mapping,
> +		error = add_to_page_cache_locked(page, mapping,
>   						idx, GFP_NOWAIT);
>   		/* does mem_cgroup_uncharge_cache_page on error */
>   	} else	/* we must compensate for our precharge above */
>   		mem_cgroup_uncharge_cache_page(page);
>
>   	if (error == -EEXIST) {
> -		struct page *filepage = find_get_page(inode->i_mapping, idx);
> +		struct page *filepage = find_get_page(mapping, idx);
>   		error = 1;
>   		if (filepage) {
>   			/*
> @@ -995,9 +991,6 @@ found:
>   	spin_unlock(&info->lock);
>   	radix_tree_preload_end();
>   out:
> -	unlock_page(page);
> -	page_cache_release(page);
> -	iput(inode);		/* allows for NULL */
>   	return error;
>   }
>
> @@ -1016,7 +1009,7 @@ int shmem_unuse(swp_entry_t entry, struc
>   		found = shmem_unuse_inode(info, entry, page);
>   		cond_resched();
>   		if (found)
> -			goto out;
> +			break;
>   	}
>   	mutex_unlock(&shmem_swaplist_mutex);
>   	/*
> @@ -1025,7 +1018,6 @@ int shmem_unuse(swp_entry_t entry, struc
>   	 */
>   	unlock_page(page);
>   	page_cache_release(page);
> -out:
>   	return (found<  0) ? found : 0;
>   }
>
> @@ -1039,6 +1031,7 @@ static int shmem_writepage(struct page *
>   	struct address_space *mapping;
>   	unsigned long index;
>   	struct inode *inode;
> +	bool unlock_mutex = false;
>
>   	BUG_ON(!PageLocked(page));
>   	mapping = page->mapping;
> @@ -1064,7 +1057,17 @@ static int shmem_writepage(struct page *
>   	else
>   		swap.val = 0;
>
> +	if (swap.val&&  list_empty(&info->swaplist)) {
> +		mutex_lock(&shmem_swaplist_mutex);
> +		/* move instead of add in case we're racing */
> +		list_move_tail(&info->swaplist,&shmem_swaplist);
> +		unlock_mutex = true;
> +	}
> +
>   	spin_lock(&info->lock);
> +	if (unlock_mutex)
> +		mutex_unlock(&shmem_swaplist_mutex);
> +
>   	if (index>= info->next_index) {
>   		BUG_ON(!(info->flags&  SHMEM_TRUNCATE));
>   		goto unlock;
> @@ -1084,21 +1087,10 @@ static int shmem_writepage(struct page *
>   		delete_from_page_cache(page);
>   		shmem_swp_set(info, entry, swap.val);
>   		shmem_swp_unmap(entry);
> -		if (list_empty(&info->swaplist))
> -			inode = igrab(inode);
> -		else
> -			inode = NULL;
>   		spin_unlock(&info->lock);
>   		swap_shmem_alloc(swap);
>   		BUG_ON(page_mapped(page));
>   		swap_writepage(page, wbc);
> -		if (inode) {
> -			mutex_lock(&shmem_swaplist_mutex);
> -			/* move instead of add in case we're racing */
> -			list_move_tail(&info->swaplist,&shmem_swaplist);
> -			mutex_unlock(&shmem_swaplist_mutex);
> -			iput(inode);
> -		}
>   		return 0;
>   	}
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-05-07  5:33             ` Konstantin Khlebnikov
@ 2011-05-07 23:56               ` Hugh Dickins
  2011-05-08 12:51                 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2011-05-07 23:56 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Andrew Morton, linux-mm, linux-kernel

On Sat, 7 May 2011, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> 
> > Here's the patch I was testing last night, but I do want to test it
> > some more (I've not even tried your unmounting case yet), and I do want
> > to make some changes to it (some comments, and see if I can move the
> > mem_cgroup_cache_charge outside of the mutex, making it GFP_KERNEL
> > rather than GFP_NOFS - at the time that mem_cgroup charging went in,
> > we did not know here if it was actually a shmem swap page, whereas
> > nowadays we can be sure, since that's noted in the swap_map).
> > 
> > In shmem_unuse_inode I'm widening the shmem_swaplist_mutex to protect
> > against shmem_evict_inode; and in shmem_writepage adding to the list
> > earlier, while holding lock on page still in pagecache to protect it.
> > 
> > But testing last night showed corruption on this laptop (no problem
> > on other machines): I'm guessing it's unrelated, but I can't be sure
> > of that without more extended testing.
> 
> This patch fixed my problem, I didn't catch any crashes on my test-case:
> swapout-unmount.

Thank you, Konstantin, for testing that and reporting back.

I tried using your script on Thursday, but couldn't get the tuning right
for this machine: with numbers too big everything would go OOM, with
numbers too small it wouldn't even go to swap, with numbers on the edge
it would soon settle into a steady state with almost nothing in swap.

Just once, without the patch, I did get to "Self-destruct in 5 seconds",
but that was not reproducible enough for me to test that the patch would
be fixing anything.

I was going to try today on other machines with more cpus and more memory,
though not as much as yours; but now I'll let your report save me the time,
and just add your Tested-by.  Big thank you for that!

Besides adding comments, I have changed the patch around since then, at
the shmem_unuse_inode end: to avoid any memory allocation while holding
the mutex (and then we no longer need to drop and retake info->lock,
so it gets a little simpler).  It would be dishonest of me to claim your
Tested-by for the changed code (and your mount/write/umount loop would
not have been testing swapoff): since it is an independent fix with a
different justification, I'll split that part off into a 2/3.

3/3 being the fix to the "corruption" I noticed while testing, corruption
being on the filesystem I had on /dev/loop0, over a tmpfs file filling its
filesystem: when I wrote, I'd missed the "I/O" errors in /var/log/messages.

It was another case of a long-standing but largely theoretical race,
now made easily reproducible by recent changes (the preallocation in
between find_lock_page and taking info->lock): when the filesystem is
full, you could get ENOSPC from a race in bringing back a previously
allocated page from swap.

I'll write these three up now and send off to Andrew.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-05-07 23:56               ` Hugh Dickins
@ 2011-05-08 12:51                 ` Konstantin Khlebnikov
  2011-05-08 19:36                   ` Hugh Dickins
  2011-05-08 19:41                   ` [PATCH 1/3] " Hugh Dickins
  0 siblings, 2 replies; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-05-08 12:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, linux-kernel

Hugh Dickins wrote:
> On Sat, 7 May 2011, Konstantin Khlebnikov wrote:
>> Hugh Dickins wrote:
>>
>>> Here's the patch I was testing last night, but I do want to test it
>>> some more (I've not even tried your unmounting case yet), and I do want
>>> to make some changes to it (some comments, and see if I can move the
>>> mem_cgroup_cache_charge outside of the mutex, making it GFP_KERNEL
>>> rather than GFP_NOFS - at the time that mem_cgroup charging went in,
>>> we did not know here if it was actually a shmem swap page, whereas
>>> nowadays we can be sure, since that's noted in the swap_map).
>>>
>>> In shmem_unuse_inode I'm widening the shmem_swaplist_mutex to protect
>>> against shmem_evict_inode; and in shmem_writepage adding to the list
>>> earlier, while holding lock on page still in pagecache to protect it.
>>>
>>> But testing last night showed corruption on this laptop (no problem
>>> on other machines): I'm guessing it's unrelated, but I can't be sure
>>> of that without more extended testing.
>>
>> This patch fixed my problem, I didn't catch any crashes on my test-case:
>> swapout-unmount.
>
> Thank you, Konstantin, for testing that and reporting back.
>
> I tried using your script on Thursday, but couldn't get the tuning right
> for this machine: with numbers too big everything would go OOM, with
> numbers too small it wouldn't even go to swap, with numbers on the edge
> it would soon settle into a steady state with almost nothing in swap.
>
> Just once, without the patch, I did get to "Self-destruct in 5 seconds",
> but that was not reproducible enough for me to test that the patch would
> be fixing anything.
>
> I was going to try today on other machines with more cpus and more memory,
> though not as much as yours; but now I'll let your report save me the time,
> and just add your Tested-by.  Big thank you for that!
>
> Besides adding comments, I have changed the patch around since then, at
> the shmem_unuse_inode end: to avoid any memory allocation while holding
> the mutex (and then we no longer need to drop and retake info->lock,
> so it gets a little simpler).  It would be dishonest of me to claim your
> Tested-by for the changed code (and your mount/write/umount loop would
> not have been testing swapoff): since it is an independent fix with a
> different justification, I'll split that part off into a 2/3.

Ok, I can test final patch-set on the next week.
Also I can try to add some swapoff test-cases.

>
> 3/3 being the fix to the "corruption" I noticed while testing, corruption
> being on the filesystem I had on /dev/loop0, over a tmpfs file filling its
> filesystem: when I wrote, I'd missed the "I/O" errors in /var/log/messages.
>
> It was another case of a long-standing but largely theoretical race,
> now made easily reproducible by recent changes (the preallocation in
> between find_lock_page and taking info->lock): when the filesystem is
> full, you could get ENOSPC from a race in bringing back a previously
> allocated page from swap.
>
> I'll write these three up now and send off to Andrew.
>
> Hugh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-05-08 12:51                 ` Konstantin Khlebnikov
@ 2011-05-08 19:36                   ` Hugh Dickins
  2011-05-10  9:52                     ` Konstantin Khlebnikov
  2011-05-08 19:41                   ` [PATCH 1/3] " Hugh Dickins
  1 sibling, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2011-05-08 19:36 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Andrew Morton, linux-mm, linux-kernel

On Sun, 8 May 2011, Konstantin Khlebnikov wrote:
> 
> Ok, I can test final patch-set on the next week.
> Also I can try to add some swapoff test-cases.

That would be helpful if you have the time: thank you.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] tmpfs: fix race between umount and writepage
  2011-05-08 12:51                 ` Konstantin Khlebnikov
  2011-05-08 19:36                   ` Hugh Dickins
@ 2011-05-08 19:41                   ` Hugh Dickins
  2011-05-08 19:43                     ` [PATCH 2/3] tmpfs: fix race between umount and swapoff Hugh Dickins
  2011-05-08 19:45                     ` [PATCH 3/3] tmpfs: fix spurious ENOSPC when racing with unswap Hugh Dickins
  1 sibling, 2 replies; 17+ messages in thread
From: Hugh Dickins @ 2011-05-08 19:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Konstantin Khlebnikov, linux-mm, linux-kernel

Konstanin Khlebnikov reports that a dangerous race between umount and
shmem_writepage can be reproduced by this script:

for i in {1..300} ; do
	mkdir $i
	while true ; do
		mount -t tmpfs none $i
		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
		umount $i
	done &
done

on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)

Kernel log:

VFS: Busy inodes after unmount of tmpfs.
               Self-destruct in 5 seconds.  Have a nice day...

WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
Call Trace:
 warn_slowpath_common+0x80/0x98
 warn_slowpath_fmt+0x41/0x43
 __list_del_entry+0x8d/0x98
 evict+0x50/0x113
 iput+0x138/0x141
...
BUG: unable to handle kernel paging request at ffffffffffffffff
IP: shmem_free_blocks+0x18/0x4c
Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4
Call Trace:
 shmem_recalc_inode+0x61/0x66
 shmem_writepage+0xba/0x1dc
 pageout+0x13c/0x24c
 shrink_page_list+0x28e/0x4be
 shrink_inactive_list+0x21f/0x382
...

shmem_writepage() calls igrab() on the inode for the page which came from
page reclaim, to add it later into shmem_swaplist for swapoff operation.

This igrab() can race with super-block deactivating process:

shrink_inactive_list()		deactivate_super()
pageout()			tmpfs_fs_type->kill_sb()
shmem_writepage()		kill_litter_super()
				generic_shutdown_super()
				 evict_inodes()
 igrab()
				  atomic_read(&inode->i_count)
				   skip-inode
 iput()
				 if (!list_empty(&sb->s_inodes))
					printk("VFS: Busy inodes after...

This igrap-iput pair was added in commit 1b1b32f2c6f6
"tmpfs: fix shmem_swaplist races" based on incorrect assumptions:
igrab() protects the inode from concurrent eviction by deletion, but
it does nothing to protect it from concurrent unmounting, which goes
ahead despite the raised i_count.

So this use of igrab() was wrong all along, but the race made much
worse in 2.6.37 when commit 63997e98a3be "split invalidate_inodes()"
replaced two attempts at invalidate_inodes() by a single evict_inodes().

Konstantin posted a plausible patch, raising sb->s_active too:
I'm unsure whether it was correct or not; but burnt once by igrab(),
I am sure that we don't want to rely more deeply upon externals here.

Fix it by adding the inode to shmem_swaplist earlier, while the page
lock on page in page cache still secures the inode against eviction,
without artifically raising i_count.  It was originally added later
because shmem_unuse_inode() is liable to remove an inode from the
list while it's unswapped; but we can guard against that by taking
spinlock before dropping mutex.

Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: stable@kernel.org
---

 mm/shmem.c |   31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

--- 2.6.39-rc6/mm/shmem.c	2011-04-28 09:52:49.066135001 -0700
+++ tmpfs1/mm/shmem.c	2011-05-07 17:38:00.648660817 -0700
@@ -1039,6 +1039,7 @@ static int shmem_writepage(struct page *
 	struct address_space *mapping;
 	unsigned long index;
 	struct inode *inode;
+	bool unlock_mutex = false;
 
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
@@ -1064,7 +1065,26 @@ static int shmem_writepage(struct page *
 	else
 		swap.val = 0;
 
+	/*
+	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
+	 * if it's not already there.  Do it now because we cannot take
+	 * mutex while holding spinlock, and must do so before the page
+	 * is moved to swap cache, when its pagelock no longer protects
+	 * the inode from eviction.  But don't unlock the mutex until
+	 * we've taken the spinlock, because shmem_unuse_inode() will
+	 * prune a !swapped inode from the swaplist under both locks.
+	 */
+	if (swap.val && list_empty(&info->swaplist)) {
+		mutex_lock(&shmem_swaplist_mutex);
+		/* move instead of add in case we're racing */
+		list_move_tail(&info->swaplist, &shmem_swaplist);
+		unlock_mutex = true;
+	}
+
 	spin_lock(&info->lock);
+	if (unlock_mutex)
+		mutex_unlock(&shmem_swaplist_mutex);
+
 	if (index >= info->next_index) {
 		BUG_ON(!(info->flags & SHMEM_TRUNCATE));
 		goto unlock;
@@ -1084,21 +1104,10 @@ static int shmem_writepage(struct page *
 		delete_from_page_cache(page);
 		shmem_swp_set(info, entry, swap.val);
 		shmem_swp_unmap(entry);
-		if (list_empty(&info->swaplist))
-			inode = igrab(inode);
-		else
-			inode = NULL;
 		spin_unlock(&info->lock);
 		swap_shmem_alloc(swap);
 		BUG_ON(page_mapped(page));
 		swap_writepage(page, wbc);
-		if (inode) {
-			mutex_lock(&shmem_swaplist_mutex);
-			/* move instead of add in case we're racing */
-			list_move_tail(&info->swaplist, &shmem_swaplist);
-			mutex_unlock(&shmem_swaplist_mutex);
-			iput(inode);
-		}
 		return 0;
 	}
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/3] tmpfs: fix race between umount and swapoff
  2011-05-08 19:41                   ` [PATCH 1/3] " Hugh Dickins
@ 2011-05-08 19:43                     ` Hugh Dickins
  2011-05-08 19:45                     ` [PATCH 3/3] tmpfs: fix spurious ENOSPC when racing with unswap Hugh Dickins
  1 sibling, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2011-05-08 19:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Konstantin Khlebnikov, linux-mm, linux-kernel

The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
to umount as that in shmem_writepage().

Fix this instance by extending the protection of shmem_swaplist_mutex
right across shmem_unuse_inode(): while it's on the list, the inode
cannot be evicted (and the filesystem cannot be unmounted) without
shmem_evict_inode() taking that mutex to remove it from the list.

But since shmem_writepage() might take that mutex, we should avoid
making memory allocations or memcg charges while holding it: prepare
them at the outer level in shmem_unuse().  When mem_cgroup_cache_charge()
was originally placed, we didn't know until that point that the page from
swap was actually a shmem page; but nowadays it's noted in the swap_map,
so we're safe to charge upfront.  For the radix_tree, do as is done in
shmem_getpage(): preload upfront, but don't pin to the cpu; so we make
a habit of refreshing the node pool, but might dip into GFP_NOWAIT
reserves on occasion if subsequently preempted.

With the allocation and charge moved out from shmem_unuse_inode(),
we can also hold index map and info->lock over from finding the entry.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: stable@kernel.org
---

 mm/shmem.c |   88 +++++++++++++++++++++++--------------------------
 1 file changed, 43 insertions(+), 45 deletions(-)

--- tmpfs1/mm/shmem.c	2011-05-07 17:38:00.648660817 -0700
+++ tmpfs2/mm/shmem.c	2011-05-07 17:39:00.656959448 -0700
@@ -852,7 +852,7 @@ static inline int shmem_find_swp(swp_ent
 
 static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
 {
-	struct inode *inode;
+	struct address_space *mapping;
 	unsigned long idx;
 	unsigned long size;
 	unsigned long limit;
@@ -875,8 +875,10 @@ static int shmem_unuse_inode(struct shme
 	if (size > SHMEM_NR_DIRECT)
 		size = SHMEM_NR_DIRECT;
 	offset = shmem_find_swp(entry, ptr, ptr+size);
-	if (offset >= 0)
+	if (offset >= 0) {
+		shmem_swp_balance_unmap();
 		goto found;
+	}
 	if (!info->i_indirect)
 		goto lost2;
 
@@ -914,11 +916,11 @@ static int shmem_unuse_inode(struct shme
 			if (size > ENTRIES_PER_PAGE)
 				size = ENTRIES_PER_PAGE;
 			offset = shmem_find_swp(entry, ptr, ptr+size);
-			shmem_swp_unmap(ptr);
 			if (offset >= 0) {
 				shmem_dir_unmap(dir);
 				goto found;
 			}
+			shmem_swp_unmap(ptr);
 		}
 	}
 lost1:
@@ -928,8 +930,7 @@ lost2:
 	return 0;
 found:
 	idx += offset;
-	inode = igrab(&info->vfs_inode);
-	spin_unlock(&info->lock);
+	ptr += offset;
 
 	/*
 	 * Move _head_ to start search for next from here.
@@ -940,37 +941,18 @@ found:
 	 */
 	if (shmem_swaplist.next != &info->swaplist)
 		list_move_tail(&shmem_swaplist, &info->swaplist);
-	mutex_unlock(&shmem_swaplist_mutex);
 
-	error = 1;
-	if (!inode)
-		goto out;
 	/*
-	 * Charge page using GFP_KERNEL while we can wait.
-	 * Charged back to the user(not to caller) when swap account is used.
-	 * add_to_page_cache() will be called with GFP_NOWAIT.
+	 * We rely on shmem_swaplist_mutex, not only to protect the swaplist,
+	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
+	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
-	if (error)
-		goto out;
-	error = radix_tree_preload(GFP_KERNEL);
-	if (error) {
-		mem_cgroup_uncharge_cache_page(page);
-		goto out;
-	}
-	error = 1;
-
-	spin_lock(&info->lock);
-	ptr = shmem_swp_entry(info, idx, NULL);
-	if (ptr && ptr->val == entry.val) {
-		error = add_to_page_cache_locked(page, inode->i_mapping,
-						idx, GFP_NOWAIT);
-		/* does mem_cgroup_uncharge_cache_page on error */
-	} else	/* we must compensate for our precharge above */
-		mem_cgroup_uncharge_cache_page(page);
+	mapping = info->vfs_inode.i_mapping;
+	error = add_to_page_cache_locked(page, mapping, idx, GFP_NOWAIT);
+	/* which does mem_cgroup_uncharge_cache_page on error */
 
 	if (error == -EEXIST) {
-		struct page *filepage = find_get_page(inode->i_mapping, idx);
+		struct page *filepage = find_get_page(mapping, idx);
 		error = 1;
 		if (filepage) {
 			/*
@@ -990,14 +972,8 @@ found:
 		swap_free(entry);
 		error = 1;	/* not an error, but entry was found */
 	}
-	if (ptr)
-		shmem_swp_unmap(ptr);
+	shmem_swp_unmap(ptr);
 	spin_unlock(&info->lock);
-	radix_tree_preload_end();
-out:
-	unlock_page(page);
-	page_cache_release(page);
-	iput(inode);		/* allows for NULL */
 	return error;
 }
 
@@ -1009,6 +985,26 @@ int shmem_unuse(swp_entry_t entry, struc
 	struct list_head *p, *next;
 	struct shmem_inode_info *info;
 	int found = 0;
+	int error;
+
+	/*
+	 * Charge page using GFP_KERNEL while we can wait, before taking
+	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
+	 * Charged back to the user (not to caller) when swap account is used.
+	 * add_to_page_cache() will be called with GFP_NOWAIT.
+	 */
+	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+	if (error)
+		goto out;
+	/*
+	 * Try to preload while we can wait, to not make a habit of
+	 * draining atomic reserves; but don't latch on to this cpu,
+	 * it's okay if sometimes we get rescheduled after this.
+	 */
+	error = radix_tree_preload(GFP_KERNEL);
+	if (error)
+		goto uncharge;
+	radix_tree_preload_end();
 
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(p, next, &shmem_swaplist) {
@@ -1016,17 +1012,19 @@ int shmem_unuse(swp_entry_t entry, struc
 		found = shmem_unuse_inode(info, entry, page);
 		cond_resched();
 		if (found)
-			goto out;
+			break;
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
-	/*
-	 * Can some race bring us here?  We've been holding page lock,
-	 * so I think not; but would rather try again later than BUG()
-	 */
+
+uncharge:
+	if (!found)
+		mem_cgroup_uncharge_cache_page(page);
+	if (found < 0)
+		error = found;
+out:
 	unlock_page(page);
 	page_cache_release(page);
-out:
-	return (found < 0) ? found : 0;
+	return error;
 }
 
 /*

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/3] tmpfs: fix spurious ENOSPC when racing with unswap
  2011-05-08 19:41                   ` [PATCH 1/3] " Hugh Dickins
  2011-05-08 19:43                     ` [PATCH 2/3] tmpfs: fix race between umount and swapoff Hugh Dickins
@ 2011-05-08 19:45                     ` Hugh Dickins
  1 sibling, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2011-05-08 19:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Konstantin Khlebnikov, linux-mm, linux-kernel

Testing the shmem_swaplist replacements for igrab() revealed another bug:
writes to /dev/loop0 on a tmpfs file which fills its filesystem were
sometimes failing with "Buffer I/O error"s.

These came from ENOSPC failures of shmem_getpage(), when racing with
swapoff: the same could happen when racing with another shmem_getpage(),
pulling the page in from swap in between our find_lock_page() and our
taking the info->lock (though not in the single-threaded loop case).

This is unacceptable, and surprising that I've not noticed it before:
it dates back many years, but (presumably) was made a lot easier to
reproduce in 2.6.36, which sited a page preallocation in the race window.

Fix it by rechecking the page cache before settling on an ENOSPC error.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: stable@kernel.org
---

 mm/shmem.c |   32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

--- tmpfs2/mm/shmem.c	2011-05-07 17:39:00.656959448 -0700
+++ tmpfs3/mm/shmem.c	2011-05-07 17:40:00.665256101 -0700
@@ -1407,20 +1407,14 @@ repeat:
 		if (sbinfo->max_blocks) {
 			if (percpu_counter_compare(&sbinfo->used_blocks,
 						sbinfo->max_blocks) >= 0 ||
-			    shmem_acct_block(info->flags)) {
-				spin_unlock(&info->lock);
-				error = -ENOSPC;
-				goto failed;
-			}
+			    shmem_acct_block(info->flags))
+				goto nospace;
 			percpu_counter_inc(&sbinfo->used_blocks);
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += BLOCKS_PER_PAGE;
 			spin_unlock(&inode->i_lock);
-		} else if (shmem_acct_block(info->flags)) {
-			spin_unlock(&info->lock);
-			error = -ENOSPC;
-			goto failed;
-		}
+		} else if (shmem_acct_block(info->flags))
+			goto nospace;
 
 		if (!filepage) {
 			int ret;
@@ -1500,6 +1494,24 @@ done:
 	error = 0;
 	goto out;
 
+nospace:
+	/*
+	 * Perhaps the page was brought in from swap between find_lock_page
+	 * and taking info->lock?  We allow for that at add_to_page_cache_lru,
+	 * but must also avoid reporting a spurious ENOSPC while working on a
+	 * full tmpfs.  (When filepage has been passed in to shmem_getpage, it
+	 * is already in page cache, which prevents this race from occurring.)
+	 */
+	if (!filepage) {
+		struct page *page = find_get_page(mapping, idx);
+		if (page) {
+			spin_unlock(&info->lock);
+			page_cache_release(page);
+			goto repeat;
+		}
+	}
+	spin_unlock(&info->lock);
+	error = -ENOSPC;
 failed:
 	if (*pagep != filepage) {
 		unlock_page(filepage);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-05-08 19:36                   ` Hugh Dickins
@ 2011-05-10  9:52                     ` Konstantin Khlebnikov
  2011-05-10 18:55                       ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2011-05-10  9:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, linux-kernel

Hugh Dickins wrote:
> On Sun, 8 May 2011, Konstantin Khlebnikov wrote:
>>
>> Ok, I can test final patch-set on the next week.
>> Also I can try to add some swapoff test-cases.
>
> That would be helpful if you have the time: thank you.

I Confirm, patch 1/3 really fixes race between writepage and umount, as expected.

In patch 2/3: race-window between unlock_page and iput extremely small.
My test works fine in parallel with thirty random swapon-swapoff,
but it works without this patch too, thus I cannot catch this race.

I apply patch 3/3 too, but have not tested this case.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] tmpfs: fix race between umount and writepage
  2011-05-10  9:52                     ` Konstantin Khlebnikov
@ 2011-05-10 18:55                       ` Hugh Dickins
  0 siblings, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2011-05-10 18:55 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Andrew Morton, linux-mm, linux-kernel

On Tue, May 10, 2011 at 2:52 AM, Konstantin Khlebnikov
<khlebnikov@parallels.com> wrote:
> Hugh Dickins wrote:
>>
>> On Sun, 8 May 2011, Konstantin Khlebnikov wrote:
>>>
>>> Ok, I can test final patch-set on the next week.
>>> Also I can try to add some swapoff test-cases.
>>
>> That would be helpful if you have the time: thank you.
>
> I Confirm, patch 1/3 really fixes race between writepage and umount, as
> expected.

Good, thank you (but that path was identical to what you'd already tested).

>
> In patch 2/3: race-window between unlock_page and iput extremely small.

(I should clarify that the main race window is actually much wider
than that.  That page lock is only effective at holding off
shmem_evict_inode() while the page is in the file's pagecache -
between the (old positioning of) mutex_unlock(&shmem_swaplist_mutex)
and the add_to_page_cache_locked(), the page is just in swapcache and
so not recognizably attached to the file: shmem_evict_inode() will
call shmem_truncate_range(), and that would find the swp_entry_t, but
it frees it with a free_swap_and_cache() - which does not wait if it
cannot trylock the page.)

> My test works fine in parallel with thirty random swapon-swapoff,
> but it works without this patch too, thus I cannot catch this race.

Thanks for trying.  Given my difficulty in reproducing your umount
case, I'm not at all surprised that you didn't manage to reproduce
this swapoff case.  Indeed, I didn't even try to reproduce it myself:
I just saw the theoretical possibility once you'd warned me of
igrab(), and tested that this igrab-less approach works as well as the
old approach, without risking that race.

>
> I apply patch 3/3 too, but have not tested this case.

Fine, that part I could reproduce fairly easily for myself, and the
fix tested out fine.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-05-10 18:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-05 10:34 [PATCH] tmpfs: fix race between umount and writepage Konstantin Khlebnikov
2011-04-08 12:27 ` Konstantin Khlebnikov
2011-04-20 20:04 ` Andrew Morton
2011-04-21  6:37   ` Konstantin Khlebnikov
2011-04-21  6:41     ` [PATCH v2] " Konstantin Khlebnikov
2011-04-21 19:44       ` Andrew Morton
2011-04-22  4:05         ` Konstantin Khlebnikov
2011-05-03 20:06           ` Hugh Dickins
2011-05-07  5:33             ` Konstantin Khlebnikov
2011-05-07 23:56               ` Hugh Dickins
2011-05-08 12:51                 ` Konstantin Khlebnikov
2011-05-08 19:36                   ` Hugh Dickins
2011-05-10  9:52                     ` Konstantin Khlebnikov
2011-05-10 18:55                       ` Hugh Dickins
2011-05-08 19:41                   ` [PATCH 1/3] " Hugh Dickins
2011-05-08 19:43                     ` [PATCH 2/3] tmpfs: fix race between umount and swapoff Hugh Dickins
2011-05-08 19:45                     ` [PATCH 3/3] tmpfs: fix spurious ENOSPC when racing with unswap Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).