All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] bcache: fix a circular dead locking with dc->writeback_lock and bch_register_lock
@ 2017-11-18  8:14 ` Coly Li
  0 siblings, 0 replies; 6+ messages in thread
From: Coly Li @ 2017-11-18  8:14 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When bcache is in writeback mode, and with heavy write I/O, a warning by
lockdep check reports a potential circular locking issue,

[   58.084940] ======================================================
[   58.084941] WARNING: possible circular locking dependency detected
[   58.084942] 4.14.0-1-default+ #3 Tainted: G        W
[   58.084943] ------------------------------------------------------
[   58.084944] kworker/0:3/1140 is trying to acquire lock:
[   58.084945]  (&bch_register_lock){+.+.}, at: [<ffffffffa069a29b>] update_writeback_rate+0x8b/0x290 [bcache]
[   58.084958]
               but task is already holding lock:
[   58.084958]  (&dc->writeback_lock){++++}, at: [<ffffffffa069a22f>] update_writeback_rate+0x1f/0x290 [bcache]
[   58.084966]
               which lock already depends on the new lock.

[   58.084966]
               the existing dependency chain (in reverse order) is:
[   58.084967]
               -> #1 (&dc->writeback_lock){++++}:
[   58.084972]        down_write+0x51/0xb0
[   58.084978]        bch_cached_dev_attach+0x239/0x500 [bcache]
[   58.084983]        run_cache_set+0x683/0x880 [bcache]
[   58.084987]        register_bcache+0xec7/0x1450 [bcache]
[   58.084990]        kernfs_fop_write+0x10e/0x1a0
[   58.084994]        __vfs_write+0x23/0x150
[   58.084995]        vfs_write+0xc2/0x1c0
[   58.084996]        SyS_write+0x45/0xa0
[   58.084997]        entry_SYSCALL_64_fastpath+0x23/0x9a
[   58.084998]
               -> #0 (&bch_register_lock){+.+.}:
[   58.085002]        lock_acquire+0xd4/0x220
[   58.085003]        __mutex_lock+0x70/0x950
[   58.085009]        update_writeback_rate+0x8b/0x290 [bcache]
[   58.085011]        process_one_work+0x1e5/0x5e0
[   58.085012]        worker_thread+0x4a/0x3f0
[   58.085014]        kthread+0x141/0x180
[   58.085015]        ret_from_fork+0x24/0x30
[   58.085015]
               other info that might help us debug this:

[   58.085015]  Possible unsafe locking scenario:

[   58.085016]        CPU0                    CPU1
[   58.085016]        ----                    ----
[   58.085016]   lock(&dc->writeback_lock);
[   58.085017]                                lock(&bch_register_lock);
[   58.085018]                                lock(&dc->writeback_lock);
[   58.085019]   lock(&bch_register_lock);
[   58.085019]
                *** DEADLOCK ***

This is a real circular locking issue, it may hold dc->writeback_lock
for long time, block btree related operations, introduce long latency
for front end I/O requests on cache device.

The code path of bch_cached_dev_attach() firstly aquires bch_register_lock
then acquires dc->writeback_lock. And code path of kworker function
update_writeback_rate() firstly acquires dc->writeback_lock then acquires
bch_register_lock.

In kworker function update_writeback_rate(), mutex dc->writeback_lock is
acquired before calling __update_writeback_rate(). After read the code
carefully it seems holding dc->writeback_lock in update_writeback_rate()
is unncessary. Let me explain why.

In __update_writeback_rate(), when bcache_flash_devs_sectors_dirty() is
called, mutex bch_register_lock is acquired to prevent bcache devices
changes (add/remove) from the cache set, which is necessary. But rested
global objects do not need writeback_lock protection.

Let's see each global objects referenced in __update_writeback_rate(),
- The following 3 objects are only read and always same value. They don't
  need to be protected by dc->writeback_lock.
	dc->disk.c
	c->nbuckets
	c->sb.bucket_size
- The following objects are only changed and referenced inside non re-
  entrancy function __update_writeback_rate(), then don't need to be
  protected by dc->writeback_lock.
	dc->writeback_rate_p_term_inverse
	dc->writeback_rate_integral
	dc->writeback_rate_update_seconds
	dc->writeback_rate_i_term_inverse
	dc->writeback_rate_minimum
	dc->writeback_rate_proportional
	dc->writeback_rate_integral_scaled
	dc->writeback_rate_change
	dc->writeback_rate_target
- dc->writeback_percent
  Only changed via sysfs interface in runtime, and it is a 8bit variable,
  it is safe to access without dc->writeback_lock.
- c->cached_dev_sectors
  This is a 64bit variable, updated in calc_cached_dev_sectors() and
  only read in __update_writeback_rate(). Change it into atomic64_t will
  be safe enough on both 32bit and 64bit hardware.
- bcache_dev_sectors_dirty()
  Inside this function, d->nr_stripes is a consistent number in run time,
  stripe_sectors_dirty on each stripe is atomic_t, they are updated in
  bcache_dev_sectors_dirty_add() and only read in function
  bcache_dev_sectors_dirty(). It is safe to access these varaibles without
  dc->writeback_lock. And if the bcache is removing from cache set, its
  cached device's writebackrate update kworker should be canceled firstly,
  so we don't need to worry about a NULL pointer dereference if bcache
  device is removed when bcache_dev_sectors_dirty() is executing.
- dc->writeback_rate.next
  writeback_rate.next is only read in __update_writeback_rate() and
  updated in bch_next_delay(). bch_next_delay() is referenced by
  writeback_delay()<-read_dirty()<-bch_writeback_thread(), while mutex
  dc->writeback_lock is not held. That is to say, current bcache code
  does not protect writeback_rate.next for concurrent access at all. For
  32bit hardware it might be problematic. This patch doesn't fix existing
  concurrent access issue on 32bit hardware, but not make things worse
  neither.
- dc->writeback_rate.rate
  writeback_rate.rate is only read in bch_next_delay() and updated in
  __update_writeback_rate(). Again its concurrent access is not protected
  by dc->writeback_lock, it is 32bit and only modified in one thread, so
  it is safe to use for now.

>From the above analysis, kworker function update_writeback_rate() can work
properly without protection of dc->writeback_lock. The writeback rate
calculation might not be extremely accurate but good enough for writeback
I/O throttle.

By removing mutex dc->writeback_lock, we can avoid a deadlock. And further
more, avoid lock contention between kworker update_writeback_rate() and
btree operations on dc->writeback_lock, which means a potential better I/O
latency for front end I/O requests. Because in writeback mode, front end
I/O request also needs to acquire dc->writeback_lock for btree operations.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h    | 2 +-
 drivers/md/bcache/super.c     | 2 +-
 drivers/md/bcache/writeback.c | 6 +-----
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e017e1..1b6964077100 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -489,7 +489,7 @@ struct cache_set {
 
 	struct bcache_device	**devices;
 	struct list_head	cached_devs;
-	uint64_t		cached_dev_sectors;
+	atomic64_t		cached_dev_sectors;
 	struct closure		caching;
 
 	struct closure		sb_write;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928dec5..879e1a135180 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -847,7 +847,7 @@ static void calc_cached_dev_sectors(struct cache_set *c)
 	list_for_each_entry(dc, &c->cached_devs, list)
 		sectors += bdev_sectors(dc->bdev);
 
-	c->cached_dev_sectors = sectors;
+	atomic64_set(&c->cached_dev_sectors, sectors);
 }
 
 void bch_cached_dev_run(struct cached_dev *dc)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a37884ca8b..cec10e6345af 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -27,7 +27,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	uint64_t cache_dirty_target =
 		div_u64(cache_sectors * dc->writeback_percent, 100);
 	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
-				   c->cached_dev_sectors);
+				   atomic64_read(&c->cached_dev_sectors));
 
 	/*
 	 * PI controller:
@@ -92,14 +92,10 @@ static void update_writeback_rate(struct work_struct *work)
 					     struct cached_dev,
 					     writeback_rate_update);
 
-	down_read(&dc->writeback_lock);
-
 	if (atomic_read(&dc->has_dirty) &&
 	    dc->writeback_percent)
 		__update_writeback_rate(dc);
 
-	up_read(&dc->writeback_lock);
-
 	schedule_delayed_work(&dc->writeback_rate_update,
 			      dc->writeback_rate_update_seconds * HZ);
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC] bcache: fix a circular dead locking with dc->writeback_lock and bch_register_lock
@ 2017-11-18  8:14 ` Coly Li
  0 siblings, 0 replies; 6+ messages in thread
From: Coly Li @ 2017-11-18  8:14 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When bcache is in writeback mode, and with heavy write I/O, a warning by
lockdep check reports a potential circular locking issue,

[   58.084940] ======================================================
[   58.084941] WARNING: possible circular locking dependency detected
[   58.084942] 4.14.0-1-default+ #3 Tainted: G        W
[   58.084943] ------------------------------------------------------
[   58.084944] kworker/0:3/1140 is trying to acquire lock:
[   58.084945]  (&bch_register_lock){+.+.}, at: [<ffffffffa069a29b>] update_writeback_rate+0x8b/0x290 [bcache]
[   58.084958]
               but task is already holding lock:
[   58.084958]  (&dc->writeback_lock){++++}, at: [<ffffffffa069a22f>] update_writeback_rate+0x1f/0x290 [bcache]
[   58.084966]
               which lock already depends on the new lock.

[   58.084966]
               the existing dependency chain (in reverse order) is:
[   58.084967]
               -> #1 (&dc->writeback_lock){++++}:
[   58.084972]        down_write+0x51/0xb0
[   58.084978]        bch_cached_dev_attach+0x239/0x500 [bcache]
[   58.084983]        run_cache_set+0x683/0x880 [bcache]
[   58.084987]        register_bcache+0xec7/0x1450 [bcache]
[   58.084990]        kernfs_fop_write+0x10e/0x1a0
[   58.084994]        __vfs_write+0x23/0x150
[   58.084995]        vfs_write+0xc2/0x1c0
[   58.084996]        SyS_write+0x45/0xa0
[   58.084997]        entry_SYSCALL_64_fastpath+0x23/0x9a
[   58.084998]
               -> #0 (&bch_register_lock){+.+.}:
[   58.085002]        lock_acquire+0xd4/0x220
[   58.085003]        __mutex_lock+0x70/0x950
[   58.085009]        update_writeback_rate+0x8b/0x290 [bcache]
[   58.085011]        process_one_work+0x1e5/0x5e0
[   58.085012]        worker_thread+0x4a/0x3f0
[   58.085014]        kthread+0x141/0x180
[   58.085015]        ret_from_fork+0x24/0x30
[   58.085015]
               other info that might help us debug this:

[   58.085015]  Possible unsafe locking scenario:

[   58.085016]        CPU0                    CPU1
[   58.085016]        ----                    ----
[   58.085016]   lock(&dc->writeback_lock);
[   58.085017]                                lock(&bch_register_lock);
[   58.085018]                                lock(&dc->writeback_lock);
[   58.085019]   lock(&bch_register_lock);
[   58.085019]
                *** DEADLOCK ***

This is a real circular locking issue, it may hold dc->writeback_lock
for long time, block btree related operations, introduce long latency
for front end I/O requests on cache device.

The code path of bch_cached_dev_attach() firstly aquires bch_register_lock
then acquires dc->writeback_lock. And code path of kworker function
update_writeback_rate() firstly acquires dc->writeback_lock then acquires
bch_register_lock.

In kworker function update_writeback_rate(), mutex dc->writeback_lock is
acquired before calling __update_writeback_rate(). After read the code
carefully it seems holding dc->writeback_lock in update_writeback_rate()
is unncessary. Let me explain why.

In __update_writeback_rate(), when bcache_flash_devs_sectors_dirty() is
called, mutex bch_register_lock is acquired to prevent bcache devices
changes (add/remove) from the cache set, which is necessary. But rested
global objects do not need writeback_lock protection.

Let's see each global objects referenced in __update_writeback_rate(),
- The following 3 objects are only read and always same value. They don't
  need to be protected by dc->writeback_lock.
	dc->disk.c
	c->nbuckets
	c->sb.bucket_size
- The following objects are only changed and referenced inside non re-
  entrancy function __update_writeback_rate(), then don't need to be
  protected by dc->writeback_lock.
	dc->writeback_rate_p_term_inverse
	dc->writeback_rate_integral
	dc->writeback_rate_update_seconds
	dc->writeback_rate_i_term_inverse
	dc->writeback_rate_minimum
	dc->writeback_rate_proportional
	dc->writeback_rate_integral_scaled
	dc->writeback_rate_change
	dc->writeback_rate_target
- dc->writeback_percent
  Only changed via sysfs interface in runtime, and it is a 8bit variable,
  it is safe to access without dc->writeback_lock.
- c->cached_dev_sectors
  This is a 64bit variable, updated in calc_cached_dev_sectors() and
  only read in __update_writeback_rate(). Change it into atomic64_t will
  be safe enough on both 32bit and 64bit hardware.
- bcache_dev_sectors_dirty()
  Inside this function, d->nr_stripes is a consistent number in run time,
  stripe_sectors_dirty on each stripe is atomic_t, they are updated in
  bcache_dev_sectors_dirty_add() and only read in function
  bcache_dev_sectors_dirty(). It is safe to access these varaibles without
  dc->writeback_lock. And if the bcache is removing from cache set, its
  cached device's writebackrate update kworker should be canceled firstly,
  so we don't need to worry about a NULL pointer dereference if bcache
  device is removed when bcache_dev_sectors_dirty() is executing.
- dc->writeback_rate.next
  writeback_rate.next is only read in __update_writeback_rate() and
  updated in bch_next_delay(). bch_next_delay() is referenced by
  writeback_delay()<-read_dirty()<-bch_writeback_thread(), while mutex
  dc->writeback_lock is not held. That is to say, current bcache code
  does not protect writeback_rate.next for concurrent access at all. For
  32bit hardware it might be problematic. This patch doesn't fix existing
  concurrent access issue on 32bit hardware, but not make things worse
  neither.
- dc->writeback_rate.rate
  writeback_rate.rate is only read in bch_next_delay() and updated in
  __update_writeback_rate(). Again its concurrent access is not protected
  by dc->writeback_lock, it is 32bit and only modified in one thread, so
  it is safe to use for now.

From the above analysis, kworker function update_writeback_rate() can work
properly without protection of dc->writeback_lock. The writeback rate
calculation might not be extremely accurate but good enough for writeback
I/O throttle.

By removing mutex dc->writeback_lock, we can avoid a deadlock. And further
more, avoid lock contention between kworker update_writeback_rate() and
btree operations on dc->writeback_lock, which means a potential better I/O
latency for front end I/O requests. Because in writeback mode, front end
I/O request also needs to acquire dc->writeback_lock for btree operations.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h    | 2 +-
 drivers/md/bcache/super.c     | 2 +-
 drivers/md/bcache/writeback.c | 6 +-----
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e017e1..1b6964077100 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -489,7 +489,7 @@ struct cache_set {
 
 	struct bcache_device	**devices;
 	struct list_head	cached_devs;
-	uint64_t		cached_dev_sectors;
+	atomic64_t		cached_dev_sectors;
 	struct closure		caching;
 
 	struct closure		sb_write;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928dec5..879e1a135180 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -847,7 +847,7 @@ static void calc_cached_dev_sectors(struct cache_set *c)
 	list_for_each_entry(dc, &c->cached_devs, list)
 		sectors += bdev_sectors(dc->bdev);
 
-	c->cached_dev_sectors = sectors;
+	atomic64_set(&c->cached_dev_sectors, sectors);
 }
 
 void bch_cached_dev_run(struct cached_dev *dc)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a37884ca8b..cec10e6345af 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -27,7 +27,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	uint64_t cache_dirty_target =
 		div_u64(cache_sectors * dc->writeback_percent, 100);
 	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
-				   c->cached_dev_sectors);
+				   atomic64_read(&c->cached_dev_sectors));
 
 	/*
 	 * PI controller:
@@ -92,14 +92,10 @@ static void update_writeback_rate(struct work_struct *work)
 					     struct cached_dev,
 					     writeback_rate_update);
 
-	down_read(&dc->writeback_lock);
-
 	if (atomic_read(&dc->has_dirty) &&
 	    dc->writeback_percent)
 		__update_writeback_rate(dc);
 
-	up_read(&dc->writeback_lock);
-
 	schedule_delayed_work(&dc->writeback_rate_update,
 			      dc->writeback_rate_update_seconds * HZ);
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] bcache: fix a circular dead locking with dc->writeback_lock and bch_register_lock
  2017-11-18  8:14 ` Coly Li
  (?)
@ 2017-11-18 17:01 ` Michael Lyle
  2017-11-18 17:32     ` Coly Li
  -1 siblings, 1 reply; 6+ messages in thread
From: Michael Lyle @ 2017-11-18 17:01 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

Coly--

It's a real circular lock dep between registration (a very rare
operation) and writeback rate calculation (a fairly rare operation).
While it's definitely a real issue, it's a very unlikely one to be
actually encountered.

(Encountering it requires the every-5-second periodic task of
calculating writeback to be running, between writeback lock and register
lock [a pretty short window], and then for the user to register a new
cached volume)

This particular issue was introduced in a8394090a9129b40f9 ("bcache:
correct cache_dirty_target in __update_writeback_rate()") as part of
4.14. I'm concerned that if we keep putting in small local patches like
this (especially to lock ordering or registration/deregistration
behavior) that we're just moving problems around.

I'm planning on working on the writeback lock this cycle to make it less
contended.  As part of this I plan to audit overall lock order, and to
broaden use of the registration lock to better control reg/dereg
ordering.  That is, there's already going to be a lot of change in this
path and I hope to catch/fix a lot of issues like this.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] bcache: fix a circular dead locking with dc->writeback_lock and bch_register_lock
  2017-11-18 17:01 ` Michael Lyle
@ 2017-11-18 17:32     ` Coly Li
  0 siblings, 0 replies; 6+ messages in thread
From: Coly Li @ 2017-11-18 17:32 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache; +Cc: linux-block

On 19/11/2017 1:01 AM, Michael Lyle wrote:
> Coly--
> 
> It's a real circular lock dep between registration (a very rare
> operation) and writeback rate calculation (a fairly rare operation).
> While it's definitely a real issue, it's a very unlikely one to be
> actually encountered.
> 
> (Encountering it requires the every-5-second periodic task of
> calculating writeback to be running, between writeback lock and register
> lock [a pretty short window], and then for the user to register a new
> cached volume)
> 

Hi Mike,

Thanks for response. It seems I can almost 100% reproduce this lockdep
warning on my simple test environment. Just compiling mainline kernel
with locking debug enabled, then attach a backing device to a cache
device, and use fio to write around 30G 4KB blocksize data on bcache
device. This is why this warning grabs my eyes.

>From the lockdep warning it seems this should be very rare to happen,
but I don't know why it happens almost every time in my testing...

> This particular issue was introduced in a8394090a9129b40f9 ("bcache:
> correct cache_dirty_target in __update_writeback_rate()") as part of
> 4.14. I'm concerned that if we keep putting in small local patches like
> this (especially to lock ordering or registration/deregistration
> behavior) that we're just moving problems around.
> 
> I'm planning on working on the writeback lock this cycle to make it less
> contended.  As part of this I plan to audit overall lock order, and to
> broaden use of the registration lock to better control reg/dereg
> ordering.  That is, there's already going to be a lot of change in this
> path and I hope to catch/fix a lot of issues like this.

It seems we use many mutex locks to protect critical code pieces, not
critical shared data. And sometimes it seems some shared data is
protected by some kind of lock, indeed it was not. (E.g. the 64bit
writeback_rate.next).

This is a RFC patch, maybe after the locking reordering and clean up
gets done by you, this patch is unnecessary any more.

Another warning is the nested locking in btree levels, so far I don't
have no idea what happens here yet ....


[   51.044531] ------------[ cut here ]------------
[   51.044536] WARNING: CPU: 0 PID: 1599 at kernel/locking/lockdep.c:735
register_lock_class+0x48e/0x4e0
[   51.044537] Modules linked in: ghash_clmulni_intel(+) snd_ens1371(+)
snd_ac97_codec ac97_bus snd_pcm uvcvideo bcache(O) videobuf2_vmalloc
aesni_intel videobuf2_memops vmw_balloon videobuf2_v4l2 aes_x86_64
crypto_simd cryptd glue_helper snd_timer snd_rawmidi videobuf2_core
videodev snd e1000(+) ppdev nfit joydev pcspkr vmw_vmci libnvdimm
soundcore gameport mptctl shpchp i2c_piix4 battery parport_pc parport
button ac ext4 crc16 mbcache jbd2 sr_mod cdrom ata_generic sd_mod
hid_generic usbhid ata_piix ahci libahci vmwgfx drm_kms_helper
syscopyarea sysfillrect sysimgblt ehci_pci fb_sys_fops uhci_hcd ehci_hcd
ttm mptspi scsi_transport_spi serio_raw usbcore mptscsih mptbase drm
libata sg scsi_mod autofs4
[   51.044559] CPU: 0 PID: 1599 Comm: bcache-register Tainted: G
  O    4.14.0-1-default+ #3
[   51.044560] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 05/19/2017
[   51.044560] task: ffff880223580140 task.stack: ffffc90001f3c000
[   51.044561] RIP: 0010:register_lock_class+0x48e/0x4e0
[   51.044562] RSP: 0018:ffffc90001f3f9c0 EFLAGS: 00010083
[   51.044563] RAX: ffffffff82e31240 RBX: 0000000000000000 RCX:
ffffffff8272cca0
[   51.044563] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
ffffffffa0606658
[   51.044563] RBP: ffff8802151790d8 R08: 0000000000000000 R09:
0000000000000000
[   51.044564] R10: ffff880223580f80 R11: 0000000000000000 R12:
ffff8802151790d8
[   51.044564] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[   51.044565] FS:  00007f96e68ed780(0000) GS:ffff880227600000(0000)
knlGS:0000000000000000
[   51.044565] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.044566] CR2: 000056015e05d988 CR3: 0000000223974001 CR4:
00000000003606f0
[   51.044590] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   51.044590] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   51.044591] Call Trace:
[   51.044593]  ? check_prev_add+0x544/0x6d0
[   51.044594]  ? check_chain_key+0x160/0x160
[   51.044596]  __lock_acquire+0x76/0x13b0
[   51.044597]  lock_acquire+0xd4/0x220
[   51.044603]  ? mca_reap+0x4f/0x170 [bcache]
[   51.044605]  down_write_trylock+0x64/0x80
[   51.044608]  ? mca_reap+0x4f/0x170 [bcache]
[   51.044611]  mca_reap+0x4f/0x170 [bcache]
[   51.044615]  mca_alloc+0xbf/0x580 [bcache]
[   51.044618]  bch_btree_node_get+0x142/0x290 [bcache]
[   51.044623]  run_cache_set+0x22b/0x880 [bcache]
[   51.044628]  register_bcache+0xec7/0x1450 [bcache]
[   51.044631]  kernfs_fop_write+0x10e/0x1a0
[   51.044632]  __vfs_write+0x23/0x150
[   51.044634]  ? rcu_read_lock_sched_held+0x9b/0xb0
[   51.044635]  ? rcu_sync_lockdep_assert+0x2d/0x60
[   51.044636]  ? __sb_start_write+0x157/0x1e0
[   51.044637]  ? vfs_write+0x190/0x1c0
[   51.044638]  vfs_write+0xc2/0x1c0
[   51.044639]  SyS_write+0x45/0xa0
[   51.044641]  entry_SYSCALL_64_fastpath+0x23/0x9a
[   51.044642] RIP: 0033:0x7f96e5fe72d0
[   51.044643] RSP: 002b:00007fffb0860bf8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[   51.044644] RAX: ffffffffffffffda RBX: 00007fffb0861be2 RCX:
00007f96e5fe72d0
[   51.044644] RDX: 0000000000000009 RSI: 00007f96e6915000 RDI:
0000000000000003
[   51.044644] RBP: 0000000000000008 R08: 00000000004009c0 R09:
0000000000000008
[   51.044645] R10: 00007fffb0861be2 R11: 0000000000000246 R12:
00007fffb0860c70
[   51.044645] R13: 0000000000000008 R14: 0000000000000008 R15:
00007fffb0860ec8
[   51.044646] Code: ff ff ff e8 d5 ec ff ff 85 c0 0f 1f 00 0f 84 62 fe
ff ff 48 c7 c7 b0 71 a5 81 e8 2f 64 00 00 e8 9a 8c 67 00 31 c0 e9 eb fd
ff ff <0f> ff 49 89 c4 e9 b4 fd ff ff e8 03 0a 32 00 85 c0 90 0f 84 32
[   51.044658] ---[ end trace bdc1e7194a439d0a ]---

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] bcache: fix a circular dead locking with dc->writeback_lock and bch_register_lock
@ 2017-11-18 17:32     ` Coly Li
  0 siblings, 0 replies; 6+ messages in thread
From: Coly Li @ 2017-11-18 17:32 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache; +Cc: linux-block

On 19/11/2017 1:01 AM, Michael Lyle wrote:
> Coly--
> 
> It's a real circular lock dep between registration (a very rare
> operation) and writeback rate calculation (a fairly rare operation).
> While it's definitely a real issue, it's a very unlikely one to be
> actually encountered.
> 
> (Encountering it requires the every-5-second periodic task of
> calculating writeback to be running, between writeback lock and register
> lock [a pretty short window], and then for the user to register a new
> cached volume)
> 

Hi Mike,

Thanks for response. It seems I can almost 100% reproduce this lockdep
warning on my simple test environment. Just compiling mainline kernel
with locking debug enabled, then attach a backing device to a cache
device, and use fio to write around 30G 4KB blocksize data on bcache
device. This is why this warning grabs my eyes.

From the lockdep warning it seems this should be very rare to happen,
but I don't know why it happens almost every time in my testing...

> This particular issue was introduced in a8394090a9129b40f9 ("bcache:
> correct cache_dirty_target in __update_writeback_rate()") as part of
> 4.14. I'm concerned that if we keep putting in small local patches like
> this (especially to lock ordering or registration/deregistration
> behavior) that we're just moving problems around.
> 
> I'm planning on working on the writeback lock this cycle to make it less
> contended.  As part of this I plan to audit overall lock order, and to
> broaden use of the registration lock to better control reg/dereg
> ordering.  That is, there's already going to be a lot of change in this
> path and I hope to catch/fix a lot of issues like this.

It seems we use many mutex locks to protect critical code pieces, not
critical shared data. And sometimes it seems some shared data is
protected by some kind of lock, indeed it was not. (E.g. the 64bit
writeback_rate.next).

This is a RFC patch, maybe after the locking reordering and clean up
gets done by you, this patch is unnecessary any more.

Another warning is the nested locking in btree levels, so far I don't
have no idea what happens here yet ....


[   51.044531] ------------[ cut here ]------------
[   51.044536] WARNING: CPU: 0 PID: 1599 at kernel/locking/lockdep.c:735
register_lock_class+0x48e/0x4e0
[   51.044537] Modules linked in: ghash_clmulni_intel(+) snd_ens1371(+)
snd_ac97_codec ac97_bus snd_pcm uvcvideo bcache(O) videobuf2_vmalloc
aesni_intel videobuf2_memops vmw_balloon videobuf2_v4l2 aes_x86_64
crypto_simd cryptd glue_helper snd_timer snd_rawmidi videobuf2_core
videodev snd e1000(+) ppdev nfit joydev pcspkr vmw_vmci libnvdimm
soundcore gameport mptctl shpchp i2c_piix4 battery parport_pc parport
button ac ext4 crc16 mbcache jbd2 sr_mod cdrom ata_generic sd_mod
hid_generic usbhid ata_piix ahci libahci vmwgfx drm_kms_helper
syscopyarea sysfillrect sysimgblt ehci_pci fb_sys_fops uhci_hcd ehci_hcd
ttm mptspi scsi_transport_spi serio_raw usbcore mptscsih mptbase drm
libata sg scsi_mod autofs4
[   51.044559] CPU: 0 PID: 1599 Comm: bcache-register Tainted: G
  O    4.14.0-1-default+ #3
[   51.044560] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 05/19/2017
[   51.044560] task: ffff880223580140 task.stack: ffffc90001f3c000
[   51.044561] RIP: 0010:register_lock_class+0x48e/0x4e0
[   51.044562] RSP: 0018:ffffc90001f3f9c0 EFLAGS: 00010083
[   51.044563] RAX: ffffffff82e31240 RBX: 0000000000000000 RCX:
ffffffff8272cca0
[   51.044563] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
ffffffffa0606658
[   51.044563] RBP: ffff8802151790d8 R08: 0000000000000000 R09:
0000000000000000
[   51.044564] R10: ffff880223580f80 R11: 0000000000000000 R12:
ffff8802151790d8
[   51.044564] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[   51.044565] FS:  00007f96e68ed780(0000) GS:ffff880227600000(0000)
knlGS:0000000000000000
[   51.044565] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.044566] CR2: 000056015e05d988 CR3: 0000000223974001 CR4:
00000000003606f0
[   51.044590] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   51.044590] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   51.044591] Call Trace:
[   51.044593]  ? check_prev_add+0x544/0x6d0
[   51.044594]  ? check_chain_key+0x160/0x160
[   51.044596]  __lock_acquire+0x76/0x13b0
[   51.044597]  lock_acquire+0xd4/0x220
[   51.044603]  ? mca_reap+0x4f/0x170 [bcache]
[   51.044605]  down_write_trylock+0x64/0x80
[   51.044608]  ? mca_reap+0x4f/0x170 [bcache]
[   51.044611]  mca_reap+0x4f/0x170 [bcache]
[   51.044615]  mca_alloc+0xbf/0x580 [bcache]
[   51.044618]  bch_btree_node_get+0x142/0x290 [bcache]
[   51.044623]  run_cache_set+0x22b/0x880 [bcache]
[   51.044628]  register_bcache+0xec7/0x1450 [bcache]
[   51.044631]  kernfs_fop_write+0x10e/0x1a0
[   51.044632]  __vfs_write+0x23/0x150
[   51.044634]  ? rcu_read_lock_sched_held+0x9b/0xb0
[   51.044635]  ? rcu_sync_lockdep_assert+0x2d/0x60
[   51.044636]  ? __sb_start_write+0x157/0x1e0
[   51.044637]  ? vfs_write+0x190/0x1c0
[   51.044638]  vfs_write+0xc2/0x1c0
[   51.044639]  SyS_write+0x45/0xa0
[   51.044641]  entry_SYSCALL_64_fastpath+0x23/0x9a
[   51.044642] RIP: 0033:0x7f96e5fe72d0
[   51.044643] RSP: 002b:00007fffb0860bf8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[   51.044644] RAX: ffffffffffffffda RBX: 00007fffb0861be2 RCX:
00007f96e5fe72d0
[   51.044644] RDX: 0000000000000009 RSI: 00007f96e6915000 RDI:
0000000000000003
[   51.044644] RBP: 0000000000000008 R08: 00000000004009c0 R09:
0000000000000008
[   51.044645] R10: 00007fffb0861be2 R11: 0000000000000246 R12:
00007fffb0860c70
[   51.044645] R13: 0000000000000008 R14: 0000000000000008 R15:
00007fffb0860ec8
[   51.044646] Code: ff ff ff e8 d5 ec ff ff 85 c0 0f 1f 00 0f 84 62 fe
ff ff 48 c7 c7 b0 71 a5 81 e8 2f 64 00 00 e8 9a 8c 67 00 31 c0 e9 eb fd
ff ff <0f> ff 49 89 c4 e9 b4 fd ff ff e8 03 0a 32 00 85 c0 90 0f 84 32
[   51.044658] ---[ end trace bdc1e7194a439d0a ]---

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] bcache: fix a circular dead locking with dc->writeback_lock and bch_register_lock
  2017-11-18 17:32     ` Coly Li
  (?)
@ 2017-11-19 19:59     ` Michael Lyle
  -1 siblings, 0 replies; 6+ messages in thread
From: Michael Lyle @ 2017-11-19 19:59 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 11/18/2017 09:32 AM, Coly Li wrote:
> On 19/11/2017 1:01 AM, Michael Lyle wrote:
>> Coly--
>>
>> It's a real circular lock dep between registration (a very rare
>> operation) and writeback rate calculation (a fairly rare operation).
>> While it's definitely a real issue, it's a very unlikely one to be
>> actually encountered.
>>
>> (Encountering it requires the every-5-second periodic task of
>> calculating writeback to be running, between writeback lock and register
>> lock [a pretty short window], and then for the user to register a new
>> cached volume)
>>
> 
> Hi Mike,
> 
> Thanks for response. It seems I can almost 100% reproduce this lockdep
> warning on my simple test environment. Just compiling mainline kernel
> with locking debug enabled, then attach a backing device to a cache
> device, and use fio to write around 30G 4KB blocksize data on bcache
> device. This is why this warning grabs my eyes.
> 
> From the lockdep warning it seems this should be very rare to happen,
> but I don't know why it happens almost every time in my testing...

Coly--

Do you get an actual deadlock here, or just a lockdep warning?  The
lockdep warning just lets you know that locks were acquired in different
orders and that an issue is possible.  I'd expect a lockdep warning
every time, but almost never an actual deadlock.

> [snip] 
> Another warning is the nested locking in btree levels, so far I don't
> have no idea what happens here yet ....

I am guessing there is a missing free of a lock-- which only creates
problems with lockdep on but we should fix it.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-11-19 19:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-18  8:14 [RFC] bcache: fix a circular dead locking with dc->writeback_lock and bch_register_lock Coly Li
2017-11-18  8:14 ` Coly Li
2017-11-18 17:01 ` Michael Lyle
2017-11-18 17:32   ` Coly Li
2017-11-18 17:32     ` Coly Li
2017-11-19 19:59     ` Michael Lyle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.