From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:40660 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758677AbcCVMVu (ORCPT ); Tue, 22 Mar 2016 08:21:50 -0400 Date: Tue, 22 Mar 2016 13:21:19 +0100 From: David Sterba To: Anand Jain Cc: dsterba@suse.cz, linux-btrfs@vger.kernel.org Subject: Re: [PATCH 13/13] btrfs: optimize check for stale device Message-ID: <20160322122119.GJ8095@twin.jikos.cz> Reply-To: dsterba@suse.cz References: <1455328900-1476-1-git-send-email-anand.jain@oracle.com> <1455328900-1476-14-git-send-email-anand.jain@oracle.com> <20160218151348.GY4374@twin.jikos.cz> <56C6BFD8.30801@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <56C6BFD8.30801@oracle.com> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Fri, Feb 19, 2016 at 03:10:16PM +0800, Anand Jain wrote: > > I see crashes with btrfs/011 on a non-debugging config > > > > [ 641.714363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 > > [ 641.716057] IP: [] scrub_setup_ctx.isra.19+0x1f6/0x260 [btrfs] > > [ 641.717036] PGD 720c1067 PUD 720c2067 PMD 0 > > [ 641.717749] Oops: 0000 [#1] PREEMPT SMP > :: > > [ 641.723163] CPU: 0 PID: 27766 Comm: btrfs Not tainted 4.5.0-rc3-next-20160212-1.g38290f0-vanilla #1 > > [ 641.724420] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 > > [ 641.725723] task: ffff8800742481c0 ti: ffff880071d10000 task.ti: ffff880071d10000 > > [ 641.726954] RIP: 0010:[] [] scrub_setup_ctx.isra.19+0x1f6/0x260 [btrfs] > > [ 641.728404] RSP: 0018:ffff880071d13ce8 EFLAGS: 00010202 > > [ 641.729413] RAX: ffff88007231e800 RBX: ffff88007231e800 RCX: 0000000000000000 > > [ 641.730610] RDX: ffffffffa0195638 RSI: ffffffffa017c5a8 RDI: ffff88007231ea80 > > [ 641.731832] RBP: ffff880071d13d18 R08: 0000000000000000 R09: ffff88007204ea00 > > [ 641.733085] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000000 > > [ 641.734307] R13: 0000000000000001 R14: ffff88007231e9f8 R15: 000000000000003f > > [ 641.735544] FS: 00007f03ed36d8c0(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 > > [ 641.736883] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 641.738022] CR2: 0000000000000068 CR3: 00000000720c0000 CR4: 00000000000006f0 > > [ 641.739325] Stack: > > [ 641.740156] ffff8800724d4000 ffff8800724d4000 0000000000000000 ffff8800722ef000 > > [ 641.741735] 0000000000000000 ffff8800724d4fc8 ffff880071d13d98 ffffffffa01566fd > > [ 641.743163] ffff88007b127000 0000001900000000 ffff8800724d4ce8 0000000000000000 > > [ 641.744599] Call Trace: > > [ 641.745553] [] btrfs_scrub_dev+0x13d/0x510 [btrfs] > > [ 641.746894] [] btrfs_dev_replace_start+0x279/0x3f0 [btrfs] > > [ 641.748282] [] btrfs_ioctl+0x1869/0x2070 [btrfs] > > [ 641.749587] [] ? pte_alloc_one+0x33/0x40 > > [ 641.750850] [] do_vfs_ioctl+0x96/0x590 > > [ 641.752128] [] ? __do_page_fault+0x181/0x450 > > [ 641.753432] [] SyS_ioctl+0x79/0x90 > > [ 641.754663] [] entry_SYSCALL_64_fastpath+0x1e/0xa8 > > [ 641.756037] Code: 00 48 c7 c2 38 56 19 a0 48 c7 c6 a8 c5 17 a0 e8 21 39 f7 e0 45 85 ed 48 c7 83 68 02 00 00 00 00 00 00 48 89 d8 0f 84 03 ff ff ff <49> 83 7c 24 68 00 74 40 c7 83 78 02 00 00 20 00 00 00 4c 89 a3 > > [ 641.760392] RIP [] scrub_setup_ctx.isra.19+0x1f6/0x260 [btrfs] > > [ 641.761970] RSP > > [ 641.763190] CR2: 0000000000000068 > > [ 641.767218] ---[ end trace f46d4e6a90bda310 ]--- > > > > the dereference happens at offset 0x68 which matches bdev in > > btrfs_device, so this patch is my best guess at the moment. I'm not able > > to reproduce it directly so I need to wait for a rebuild and repeat. > > > Looks like dev was fine when find_device was called, but > later it was null when ->bdev was accessed. > > I couldn't reproduce here. There are 10 workouts within btrfs/011 > any idea workout caused this? As of now I am guessing.. > > workout "-m dup -d single" 1 cancel quick > > digging more. I was not able reproduce the crash since. All ok on a physical machine, in a virtual machine in kvm the test runs for a long time and then freezes (serial console, ssh). The kvm process eats 100% cpu, not possible to debug it directly. The branch stays in my for-next and is on the way to 4.7, we'll see if we can reproduce it.