race condition in xen-gntdev (was: Re: gntdev/gntalloc and fork? - crash in gntdev) - Marek Marczykowski-Górecki

From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: xen-devel <xen-devel@lists.xen.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	David Vrabel <david.vrabel@citrix.com>
Subject: race condition in xen-gntdev (was: Re: gntdev/gntalloc and fork? - crash in gntdev)
Date: Wed, 17 Jun 2015 21:42:11 +0200	[thread overview]
Message-ID: <20150617194211.GB11083@mail-itl> (raw)
In-Reply-To: <20150527234508.GA14838@mail-itl>

[-- Attachment #1.1: Type: text/plain, Size: 9827 bytes --]

On Thu, May 28, 2015 at 01:45:08AM +0200, Marek Marczykowski-Górecki wrote:
> On Thu, Apr 30, 2015 at 04:47:44PM +0200, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > What is the proper way to handle shared pages (either side - using
> > gntdev or gntalloc) regarding fork and possible exec later? The child
> > process do not need to access those pages in any way, but will map
> > different one(s), using newly opened FD to the gntdev/gntalloc device.
> > Should it unmap them and close FD to the device manually just after the
> > fork? Or the process using gntdev or gntalloc should prevent using fork
> > at all?
> > 
> > I'm asking because I get kernel oops[1] in context of such process. This
> > process uses both gntdev and gntalloc. The PID reported there is a
> > child, which maps additional pages (using newly opened FD to
> > /dev/xen/gnt*), but I'm not sure if the crash happens before, after or
> > at this second mapping (actually vchan connection), or maybe even at
> > cleanup of this second mapping. The parent process keeps its mappings
> > for the whole lifetime of its child. I don't have a 100% reliable way
> > to reproduce this problem, but it happens quite often when I run such
> > operations in a loop.
> 
> Any ideas?

I've done some further debugging, below is what I get.

> > The kernel is vanilla 3.19.3, running on Xen 4.4.2.
> > 
> > The kernel message:
> > [74376.073464] general protection fault: 0000 [#1] SMP 
> > [74376.073475] Modules linked in: fuse xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr xen_netfront ghash_clmulni_intel nfsd auth_rpcgss nfs_acl lockd grace xenfs xen_privcmd dummy_hcd udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc u2mfn(O) xen_evtchn xen_blkfront
> > [74376.073522] CPU: 1 PID: 9377 Comm: qrexec-agent Tainted: G           O   3.19.3-4.pvops.qubes.x86_64 #1
> > [74376.073528] task: ffff880002442e40 ti: ffff88000032c000 task.ti: ffff88000032c000
> > [74376.073532] RIP: e030:[<ffffffffa00952c5>]  [<ffffffffa00952c5>] unmap_if_in_range+0x15/0xd0 [xen_gntdev]

	static void unmap_if_in_range(struct grant_map *map,
			  unsigned long start, unsigned long end)
	{   
	    unsigned long mstart, mend;
	    int err;

	    if (!map->vma)
		return;

The above crash is at first access to "map"...

> > [74376.073543] RSP: e02b:ffff88000032fc08  EFLAGS: 00010292
> > [74376.073546] RAX: 0000000000000000 RBX: dead000000100100 RCX: 00007fd8616ea000
> > [74376.073550] RDX: 00007fd8616ea000 RSI: 00007fd8616e9000 RDI: dead000000100100

... which is 0xdead000000100100 (LIST_POISON1).

> > [74376.073554] RBP: ffff88000032fc48 R08: 0000000000000000 R09: 0000000000000000
> > [74376.073557] R10: ffffea000021bb00 R11: 0000000000000000 R12: 00007fd8616e9000
> > [74376.073561] R13: 00007fd8616ea000 R14: ffff880012702e40 R15: ffff880012702e70
> > [74376.073569] FS:  00007fd8616ca700(0000) GS:ffff880013c80000(0000) knlGS:0000000000000000
> > [74376.073574] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [74376.073577] CR2: 00007fd8616e9458 CR3: 00000000e7af5000 CR4: 0000000000042660
> > [74376.073582] Stack:
> > [74376.073584]  ffff8800188356c0 00000000000000d0 ffff88000032fc68 00000000c64ef797
> > [74376.073590]  0000000000000220 dead000000100100 00007fd8616e9000 00007fd8616ea000
> > [74376.073596]  ffff88000032fc88 ffffffffa00953c6 ffff88000032fcc8 ffff880012702e70
> > [74376.073603] Call Trace:
> > [74376.073610]  [<ffffffffa00953c6>] mn_invl_range_start+0x46/0x90 [xen_gntdev]
> > [74376.073620]  [<ffffffff811e88fb>]
> > __mmu_notifier_invalidate_range_start+0x5b/0x90
> > [74376.073627]  [<ffffffff811c2a59>] do_wp_page+0x769/0x820
> > [74376.074031]  [<ffffffff811c4f5c>] handle_mm_fault+0x7fc/0x10c0
> > [74376.074031]  [<ffffffff813864cd>] ? radix_tree_lookup+0xd/0x10
> > [74376.074031]  [<ffffffff81061e1c>] __do_page_fault+0x1dc/0x5a0
> > [74376.074031]  [<ffffffff817560a6>] ? mutex_lock+0x16/0x37
> > [74376.074031]  [<ffffffffa0008928>] ? evtchn_ioctl+0x118/0x3c0
> > [xen_evtchn]
> > [74376.074031]  [<ffffffff812209d8>] ? do_vfs_ioctl+0x2f8/0x4f0
> > [74376.074031]  [<ffffffff811cafdf>] ? do_munmap+0x29f/0x3b0
> > [74376.074031]  [<ffffffff81062211>] do_page_fault+0x31/0x70
> > [74376.074031]  [<ffffffff81759e28>] page_fault+0x28/0x30

	static void mn_invl_range_start(struct mmu_notifier *mn,
			struct mm_struct *mm,
			unsigned long start, unsigned long end)
	{
	    struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
	    struct grant_map *map;

	    spin_lock(&priv->lock);
	    list_for_each_entry(map, &priv->maps, next) {
		unmap_if_in_range(map, start, end);
	    }

mn_invl_range_start+0x46 is the first call to unmap_if_in_range, so
something is wrong with priv->maps list.

So I've searched for all the list_del calls on this list and found one not
guarded by spinlock:

	static int gntdev_release(struct inode *inode, struct file *flip)
	{   
	    struct gntdev_priv *priv = flip->private_data;
	    struct grant_map *map;

	    pr_debug("priv %p\n", priv);

	    while (!list_empty(&priv->maps)) {
		map = list_entry(priv->maps.next, struct grant_map, next);
		list_del(&map->next);
		gntdev_put_map(NULL /* already removed */, map);
	    }
	    WARN_ON(!list_empty(&priv->freeable_maps));

	    if (use_ptemod)
		mmu_notifier_unregister(&priv->mn, priv->mm);
	    kfree(priv);
	    return 0;
	}   

So I see this as:
P1(parent)                 P2(child)
                           1. gntdev_release called
                           2. list destroyed (above loop)

3. page fault occurs, gntdev mmu notifier called
4. priv->lock taken
5. iterate over priv->maps
6. crashed since map is already destroyed

                           7. mmu_notifier_unregister calls:
                           8.   mn_release, which tries to take priv->lock
                           9. this process hangs

So I'd guess the fix would be to move mmu_notifier_unregister before
releasing priv->maps list. Could someone more familiar with this code
confirm this?

> > [74376.074031] Code: e9 dd fd ff ff 31 c9 31 db e9 20 fe ff ff 0f 1f
> > 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 89 fb
> > 48 83 ec 28 <48> 8b 47 10 48 85 c0 74 4e 4c 8b 00 49 39 d0 73 46 4c
> > 8b 48 08
> > [74376.074031] RIP  [<ffffffffa00952c5>] unmap_if_in_range+0x15/0xd0
> > [xen_gntdev]
> > [74376.074031]  RSP <ffff88000032fc08>
> > [74376.091682] ---[ end trace 2b21c5b714eb1071 ]---

And the other process:

> > [74404.069009] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s!
> > [qrexec-agent:9379]
> > [74404.069009] Modules linked in: fuse xt_conntrack ipt_MASQUERADE
> > nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> > nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables
> > intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp crct10dif_pclmul
> > crc32_pclmul crc32c_intel pcspkr xen_netfront ghash_clmulni_intel
> > nfsd auth_rpcgss nfs_acl lockd grace xenfs xen_privcmd dummy_hcd
> > udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc u2mfn(O)
> > xen_evtchn xen_blkfront
> > [74404.069009] CPU: 2 PID: 9379 Comm: qrexec-agent Tainted: G      D
> > O   3.19.3-4.pvops.qubes.x86_64 #1
> > [74404.069009] task: ffff880010e24a00 ti: ffff880002470000 task.ti:
> > ffff880002470000
> > [74404.069009] RIP: e030:[<ffffffff81757b11>]  [<ffffffff81757b11>]
> > _raw_spin_lock+0x21/0x30
> > [74404.069009] RSP: e02b:ffff880002473e18  EFLAGS: 00000297
> > [74404.069009] RAX: 0000000000000040 RBX: ffff880002345c00 RCX:
> > 0000000000018cf8
> > [74404.069009] RDX: 0000000000000041 RSI: ffff880002345c00 RDI:
> > ffff880012702e60
> > [74404.069009] RBP: ffff880002473e18 R08: ffff880012702240 R09:
> > 00000001802a0019
> > [74404.069009] R10: ffffea000049c080 R11: ffffffffa00955bf R12:
> > ffff880012702e70
> > [74404.069009] R13: ffff880012702e40 R14: ffff8800132c6f20 R15:
> > ffff880012b163c0
> > [74404.069009] FS:  00007fd8616ca700(0000) GS:ffff880013d00000(0000)
> > knlGS:0000000000000000
> > [74404.069009] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [74404.069009] CR2: 00007fd8610be098 CR3: 000000000b971000 CR4:
> > 0000000000042660
> > [74404.069009] Stack:
> > [74404.069009]  ffff880002473e48 ffffffffa0095452 ffff880002473e48
> > ffff880002345c00
> > [74404.069009]  ffff880012702e70 0000000000000000 ffff880002473e78
> > ffffffff811e8c2e
> > [74404.069009]  ffff880002473e78 ffff880012702e40 ffff880012702e40
> > ffff880012d123c8
> > [74404.069009] Call Trace:
> > [74404.069009]  [<ffffffffa0095452>] mn_release+0x22/0x130
> > [xen_gntdev]
> > [74404.069009]  [<ffffffff811e8c2e>]
> > mmu_notifier_unregister+0x4e/0xe0
> > [74404.069009]  [<ffffffffa00957c0>] gntdev_release+0x60/0xa0
> > [xen_gntdev]
> > [74404.069009]  [<ffffffff8120ec0f>] __fput+0xdf/0x1e0
> > [74404.069009]  [<ffffffff8120ed5e>] ____fput+0xe/0x10
> > [74404.069009]  [<ffffffff810b56df>] task_work_run+0xbf/0x100
> > [74404.069009]  [<ffffffff81014c47>] do_notify_resume+0x97/0xb0
> > [74404.069009]  [<ffffffff81758127>] int_signal+0x12/0x17
> > [74404.069009] Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
> > 48 89 e5 b8 00 01 00 00 f0 66 0f c1 07 0f b6 d4 38 c2 75 04 5d c3 f3
> > 90 0f b6 07 <38> d0 75 f7 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66
> > 66 90 55

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #1.2: Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel