On 06/14/2011 11:29 AM, Lino Sanfilippo wrote: > Hi Eric, > > After the patch series "decouple mark lock and marks fsobject lock" > I sent on Feb. 21 this is another attempt to change the locking order > used in fsnotify from > > mark->lock > group->mark_lock > inode/mnt->lock > to > group->mark_lock > mark->lock > inode/mnt->lock > > The former series still contained some flaws: > 1. inodes/mounts that have marks and are not pinned could dissappear while > another thread still wants to access a marks inode/mount (see https://lkml.org/lkml/2011/3/1/373) > 2. in the new introduced function remove_mark_locked() a group could be freed > while the groups mark mutex is held > > Both issues should be fixed with this series. > > The reason for changing the locking order is, as you may remember, that there are > still some races when adding/removing marks to a group > (see also https://lkml.org/lkml/2010/11/30/189). > > So what this series does is change the locking order (PATCH 4) to > > group->mark_mutex > mark->lock > inode/mnt->lock > > Beside this the group mark_lock is turned into a mutex (PATCH 6), which allows us to > call functions that may sleep while this lock is held. > > At some places there is only a mark and no group available > (i.e. clear_marks_by_[inode|mount]()), so we first have to get the group from the mark > to use the group->mark_mutex (PATCH 7). > > Since we cant get a group from a mark and use it without the danger that the group is > unregistered and freed, reference counting for groups is introduced and used > (PATCHES 1 to 3) for each mark that is added to the group. > With this freeing a group does not longer depend on the number of marks, but is > simply destroyed when the reference counter hits 0. > > Since a group ref is now taken for each mark that is added to the group we first > have to explicitly call fsnotify_destroy_group() to remove all marks from the group > and release all group references held by those marks. This will also release the > - possibly final - ref of the group held by the caller (PATCH 1). > > Furthermore we now can take the mark lock with the group mutex held so we dont need an > extra "clear list" any more if we clear all marks by group (PATCH 9). > > For [add|remove]_mark() locked versions are introduced (PATCH 8) that can be > used if a caller already has the mark lock held. Thus we now have the possibility > to use the groups mark mutex to also synchronize addition/removal of a mark or to > replace the dnotify mutex. > This is not part of these patches. It would be the next step if these patches are > accepted to be merged. > > This is against 2.6.39 I finally built and tested a v3.0 kernel with these patches (I know I'm SOOOOOO far behind). Not what I hoped for: > [ 150.937798] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds. Have a nice day... > [ 150.945290] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070 > [ 150.946012] IP: [] shmem_free_inode+0x18/0x50 > [ 150.946012] PGD 2bf9e067 PUD 2bf9f067 PMD 0 > [ 150.946012] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 150.946012] CPU 0 > [ 150.946012] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ext4 jbd2 crc16 joydev ata_piix i2c_piix4 pcspkr uinput ipv6 autofs4 usbhid [last unloaded: scsi_wait_scan] > [ 150.946012] > [ 150.946012] Pid: 2764, comm: syscall_thrash Not tainted 3.0.0+ #1 Red Hat KVM > [ 150.946012] RIP: 0010:[] [] shmem_free_inode+0x18/0x50 > [ 150.946012] RSP: 0018:ffff88002c2e5df8 EFLAGS: 00010282 > [ 150.946012] RAX: 000000004e370d9f RBX: 0000000000000000 RCX: ffff88003a029438 > [ 150.946012] RDX: 0000000033630a5f RSI: 0000000000000000 RDI: ffff88003491c240 > [ 150.946012] RBP: ffff88002c2e5e08 R08: 0000000000000000 R09: 0000000000000000 > [ 150.946012] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a029428 > [ 150.946012] R13: ffff88003a029428 R14: ffff88003a029428 R15: ffff88003499a610 > [ 150.946012] FS: 00007f5a05420700(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000 > [ 150.946012] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 150.946012] CR2: 0000000000000070 CR3: 000000002a662000 CR4: 00000000000006f0 > [ 150.946012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 150.946012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 150.946012] Process syscall_thrash (pid: 2764, threadinfo ffff88002c2e4000, task ffff88002bfbc760) > [ 150.946012] Stack: > [ 150.946012] ffff88003a029438 ffff88003a029428 ffff88002c2e5e38 ffffffff81102f76 > [ 150.946012] ffff88003a029438 ffff88003a029598 ffffffff8160f9c0 ffff88002c221250 > [ 150.946012] ffff88002c2e5e68 ffffffff8115e9be ffff88002c2e5e68 ffff88003a029438 > [ 150.946012] Call Trace: > [ 150.946012] [] shmem_evict_inode+0x76/0x130 > [ 150.946012] [] evict+0x7e/0x170 > [ 150.946012] [] iput_final+0xd0/0x190 > [ 150.946012] [] iput+0x33/0x40 > [ 150.946012] [] fsnotify_destroy_mark_locked+0x145/0x160 > [ 150.946012] [] fsnotify_destroy_mark+0x36/0x50 > [ 150.946012] [] sys_inotify_rm_watch+0x77/0xd0 > [ 150.946012] [] system_call_fastpath+0x16/0x1b > [ 150.946012] Code: 67 4a 00 b8 e4 ff ff ff eb aa 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 8b 9f 40 05 00 00 > [ 150.946012] 83 7b 70 00 74 1c 4c 8d a3 80 00 00 00 4c 89 e7 e8 d2 5d 4a > [ 150.946012] RIP [] shmem_free_inode+0x18/0x50 > [ 150.946012] RSP > [ 150.946012] CR2: 0000000000000070 Looks at aweful lot like the problem from: http://www.spinics.net/lists/linux-fsdevel/msg46101.html Latest stress test program attached..... -Eric