All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Btrfs: fix scrub race leading to use-after-free
@ 2015-01-27 16:11 Filipe Manana
  2015-01-27 23:06 ` [PATCH v2] " Filipe Manana
  2015-02-09 19:07 ` [PATCH] " Chris Mason
  0 siblings, 2 replies; 3+ messages in thread
From: Filipe Manana @ 2015-01-27 16:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:

[68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
[68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
[68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
[68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
[68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
[68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
[68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
[68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
[68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
[68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
[68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
[68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
[68127.807663] Stack:
[68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
[68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
[68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
[68127.807663] Call Trace:
[68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
[68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
[68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
[68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
[68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
[68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
[68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
[68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
[68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
[68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
[68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
[68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
[68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
[68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
[68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663]  RSP <ffff8803f097fbb8>
[68127.807663] CR2: ffff8803f8947a50
[68127.807663] ---[ end trace d7045aac00a66cd8 ]---

This is due to a very tiny time window that can happen when wake_up()
attempts to unlock the wait queue's lock, the scrub context object (sctx),
which embeds the wait queue, is freed by the main scrub task:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
                                                                   wait sctx->bios_in_flight == 0
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       __wake_up()
          spin_lock_irqsave(&sctx->list_wait->lock, flags)
          default_wake_function()
              wake up task at CPU 2
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
          spin_unlock_irqrestore(&sctx->list_wait->lock, flags)

Fix this by holding the scrub lock while doing the wakeup.

This isn't a recent regression, the issue as been around since the scrub
feature was added (2011, commit a2de733c78fa7af51ba9670482fa7d392aa67c57).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/scrub.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 8f75b99..8252a5e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -306,8 +306,17 @@ static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
 
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx)
 {
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+
+	/*
+	 * Hold the scrub_lock while doing the wakeup to ensure the
+	 * sctx (and its wait queue list_wait) isn't destroyed/freed
+	 * during the wakeup.
+	 */
+	mutex_lock(&fs_info->scrub_lock);
 	atomic_dec(&sctx->bios_in_flight);
 	wake_up(&sctx->list_wait);
+	mutex_unlock(&fs_info->scrub_lock);
 }
 
 static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
@@ -379,10 +388,15 @@ static void scrub_pending_trans_workers_dec(struct scrub_ctx *sctx)
 	mutex_lock(&fs_info->scrub_lock);
 	atomic_dec(&fs_info->scrubs_running);
 	atomic_dec(&fs_info->scrubs_paused);
-	mutex_unlock(&fs_info->scrub_lock);
 	atomic_dec(&sctx->workers_pending);
 	wake_up(&fs_info->scrub_pause_wait);
+	/*
+	 * Hold the scrub_lock while doing the wakeup to ensure the
+	 * sctx (and its wait queue list_wait) isn't destroyed/freed
+	 * during the wakeup.
+	 */
 	wake_up(&sctx->list_wait);
+	mutex_unlock(&fs_info->scrub_lock);
 }
 
 static void scrub_free_csums(struct scrub_ctx *sctx)
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v2] Btrfs: fix scrub race leading to use-after-free
  2015-01-27 16:11 [PATCH] Btrfs: fix scrub race leading to use-after-free Filipe Manana
@ 2015-01-27 23:06 ` Filipe Manana
  2015-02-09 19:07 ` [PATCH] " Chris Mason
  1 sibling, 0 replies; 3+ messages in thread
From: Filipe Manana @ 2015-01-27 23:06 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:

[68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
[68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
[68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
[68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
[68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
[68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
[68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
[68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
[68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
[68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
[68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
[68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
[68127.807663] Stack:
[68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
[68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
[68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
[68127.807663] Call Trace:
[68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
[68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
[68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
[68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
[68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
[68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
[68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
[68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
[68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
[68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
[68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
[68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
[68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
[68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
[68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663]  RSP <ffff8803f097fbb8>
[68127.807663] CR2: ffff8803f8947a50
[68127.807663] ---[ end trace d7045aac00a66cd8 ]---

This is due to a race that can happen in a very tiny time window and is
illustrated by the following sequence diagram:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       atomic_dec(&sctx->bios_in_flight)
                                                                   wait sctx->bios_in_flight == 0
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
       wake_up(&sctx->list_wait)
          __wake_up()
              spin_lock_irqsave(&sctx->list_wait->lock, flags)

Another variation of this scenario that results in the same use-after-free
issue is:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
                                                                   wait sctx->bios_in_flight == 0
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       __wake_up(&sctx->list_wait)
          spin_lock_irqsave(&sctx->list_wait->lock, flags)
          default_wake_function()
              wake up task at CPU 2
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
          spin_unlock_irqrestore(&sctx->list_wait->lock, flags)

Fix this by holding the scrub lock while doing the wakeup.

This isn't a recent regression, the issue as been around since the scrub
feature was added (2011, commit a2de733c78fa7af51ba9670482fa7d392aa67c57).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: No code changes, updated only commit message's sequence diagram and
    added a second one for a slightly different case.

 fs/btrfs/scrub.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 8f75b99..8252a5e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -306,8 +306,17 @@ static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
 
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx)
 {
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+
+	/*
+	 * Hold the scrub_lock while doing the wakeup to ensure the
+	 * sctx (and its wait queue list_wait) isn't destroyed/freed
+	 * during the wakeup.
+	 */
+	mutex_lock(&fs_info->scrub_lock);
 	atomic_dec(&sctx->bios_in_flight);
 	wake_up(&sctx->list_wait);
+	mutex_unlock(&fs_info->scrub_lock);
 }
 
 static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
@@ -379,10 +388,15 @@ static void scrub_pending_trans_workers_dec(struct scrub_ctx *sctx)
 	mutex_lock(&fs_info->scrub_lock);
 	atomic_dec(&fs_info->scrubs_running);
 	atomic_dec(&fs_info->scrubs_paused);
-	mutex_unlock(&fs_info->scrub_lock);
 	atomic_dec(&sctx->workers_pending);
 	wake_up(&fs_info->scrub_pause_wait);
+	/*
+	 * Hold the scrub_lock while doing the wakeup to ensure the
+	 * sctx (and its wait queue list_wait) isn't destroyed/freed
+	 * during the wakeup.
+	 */
 	wake_up(&sctx->list_wait);
+	mutex_unlock(&fs_info->scrub_lock);
 }
 
 static void scrub_free_csums(struct scrub_ctx *sctx)
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] Btrfs: fix scrub race leading to use-after-free
  2015-01-27 16:11 [PATCH] Btrfs: fix scrub race leading to use-after-free Filipe Manana
  2015-01-27 23:06 ` [PATCH v2] " Filipe Manana
@ 2015-02-09 19:07 ` Chris Mason
  1 sibling, 0 replies; 3+ messages in thread
From: Chris Mason @ 2015-02-09 19:07 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



On Tue, Jan 27, 2015 at 11:11 AM, Filipe Manana <fdmanana@suse.com> 
wrote:
> While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
> the following trace:

This actually trades one bug for another:

[ 1928.950319] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:621^M
[ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: 
fsstress^M
[ 1928.981324] INFO: lockdep is turned off.^M
[ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G        W   
   3.19.0-rc7-mason+ #41^M
[ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, 
BIOS 1.07 05/10/2012^M
[ 1929.022207]  ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 
ffff881076e03b78^M
[ 1929.037267]  ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 
ffff881076e03bc8^M
[ 1929.052315]  0000000000000000 000000000000026d ffffffff81a22cf8 
ffff881076e03bd8^M
[ 1929.067381] Call Trace:^M
[ 1929.072344]  <IRQ>  [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e^M
[ 1929.083968]  [<ffffffff810856c4>] ___might_sleep+0x174/0x230^M
[ 1929.095352]  [<ffffffff810857d2>] __might_sleep+0x52/0x90^M
[ 1929.106223]  [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0^M
[ 1929.117951]  [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10^M
[ 1929.129708]  [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 
[btrfs]^M
[ 1929.143370]  [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 
[btrfs]^M
[ 1929.157191]  [<ffffffff812fa603>] bio_endio+0x53/0xa0^M
[ 1929.167382]  [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 
[btrfs]^M
[ 1929.180161]  [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 
[btrfs]^M
[ 1929.194318]  [<ffffffff812fa603>] bio_endio+0x53/0xa0^M
[ 1929.204496]  [<ffffffff8130401b>] blk_update_request+0x1eb/0x450^M
[ 1929.216569]  [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500^M
[ 1929.229176]  [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0^M
[ 1929.240740]  [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0^M
[ 1929.252654]  [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150^M
[ 1929.264725]  [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170^M
[ 1929.276635]  [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0^M
[ 1929.288014]  [<ffffffff8105d92e>] __do_softirq+0xde/0x600^M
[ 1929.298885]  [<ffffffff8105df6d>] irq_exit+0xbd/0xd0^M
[ 1929.308879]  [<ffffffff81034ea5>] 
smp_call_function_single_interrupt+0x35/0x40^M
[ 1929.323455]  [<ffffffff816c126f>] 
call_function_single_interrupt+0x6f/0x80^M
[ 1929.337270]  <EOI>  [<ffffffff811fc745>] ? 
sync_inodes_sb+0x1b5/0x2a0^M
[ 1929.350261]  [<ffffffff811fc728>] ? sync_inodes_sb+0x198/0x2a0^M
[ 1929.361991]  [<ffffffff816badcf>] ? wait_for_completion+0xef/0x120^M
[ 1929.374423]  [<ffffffff812028d0>] ? fdatawrite_one_bdev+0x20/0x20^M
[ 1929.386671]  [<ffffffff812028d0>] ? fdatawrite_one_bdev+0x20/0x20^M
[ 1929.398930]  [<ffffffff812028ed>] sync_inodes_one_sb+0x1d/0x30^M
[ 1929.410668]  [<ffffffff811cf4c6>] iterate_supers+0xb6/0xf0^M
[ 1929.421712]  [<ffffffff81202935>] sys_sync+0x35/0x90^M
[ 1929.431704]  [<ffffffff816bfed2>] system_call_fastpath+0x12/0x17^M

So we'll have to either put in a refcount or a spinlock instead.

-chris


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-02-09 19:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-27 16:11 [PATCH] Btrfs: fix scrub race leading to use-after-free Filipe Manana
2015-01-27 23:06 ` [PATCH v2] " Filipe Manana
2015-02-09 19:07 ` [PATCH] " Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.