[PATCH RESEND 1/3 v2.6.39-rc7] block: don't use non-syncing event blocking in disk_check

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RESEND 1/3 v2.6.39-rc7] block: don't use non-syncing event blocking in disk_check_events()
@ 2011-05-17 10:27 Tejun Heo
  2011-05-17 10:28 ` [PATCH RESEND 2/3 v2.6.39-rc7] block: remove non-syncing __disk_block_events() and fold it into disk_block_events() Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2011-05-17 10:27 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Sitsofe Wheeler, Borislav Petkov, Meelis Roos, Andrew Morton,
	Kay Sievers, linux-kernel

This patch is part of fix for triggering of WARN_ON_ONCE() in
disk_clear_events() reported in bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

disk_clear_events() blocks events, schedules and flushes the event
work.  It expects the work to have started execution on schedule and
finished on return from flush.  WARN_ON_ONCE() triggers if the event
work hasn't executed as expected.  This problem happens because
__disk_block_events() fails to guarantee that the event work item is
not in flight on return from the function in race-free manner.  The
problem is two-fold and this patch addresses one of them.

When __disk_block_events() is called with @sync == %false, it bumps
event block count, calls cancel_delayed_work() and return.  This makes
it impossible to guarantee that event polling is not in flight on
return from syncing __disk_block_events() - if the first blocker was
non-syncing, polling could still be in progress and later syncing ones
would assume that the first blocker already canceled it.

Making __disk_block_events() cancel_sync regardless of block count
isn't feasible either as it may race with forced event checking in
disk_clear_events().

As disk_check_events() is the only user of non-syncing
__disk_block_events(), updating it to directly cancel and schedule
event work is the easiest way to solve the issue.

Note that there's another bug in __disk_block_events() and this patch
doesn't fix the issue completely.  Later patch will fix the other bug.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
---
(sorry, forgot to cc lkml, resending)

This is the first of three patches which (finally) fix the
WARN_ON_ONCE() in disk_clear_events() triggering.  It was me being
stupid about synchronization around event blocking.

Given that we're very late in -rc cycle and, although the fix isn't
invasive, it isn't obvious one-liner either, and that the bug happens
sporadically with non-critical failure mode, it might be better to
route this through block for v2.6.40-rc1 and then back port to v2.6.39
via -stable, unless v2.6.39 is gonna go through another -rc cycle.

Jens, Linus, what do you guys think?

Thank you.

 block/genhd.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Index: work/block/genhd.c
===================================================================
--- work.orig/block/genhd.c
+++ work/block/genhd.c
@@ -1508,10 +1508,18 @@ void disk_unblock_events(struct gendisk
  */
 void disk_check_events(struct gendisk *disk)
 {
-	if (disk->ev) {
-		__disk_block_events(disk, false);
-		__disk_unblock_events(disk, true);
+	struct disk_events *ev = disk->ev;
+	unsigned long flags;
+
+	if (!ev)
+		return;
+
+	spin_lock_irqsave(&ev->lock, flags);
+	if (!ev->block) {
+		cancel_delayed_work(&ev->dwork);
+		queue_delayed_work(system_nrt_wq, &ev->dwork, 0);
 	}
+	spin_unlock_irqrestore(&ev->lock, flags);
 }
 EXPORT_SYMBOL_GPL(disk_check_events);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RESEND 2/3 v2.6.39-rc7] block: remove non-syncing __disk_block_events() and fold it into disk_block_events()
  2011-05-17 10:27 [PATCH RESEND 1/3 v2.6.39-rc7] block: don't use non-syncing event blocking in disk_check_events() Tejun Heo
@ 2011-05-17 10:28 ` Tejun Heo
  2011-05-17 10:28   ` [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2011-05-17 10:28 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Sitsofe Wheeler, Borislav Petkov, Meelis Roos, Andrew Morton,
	Kay Sievers, linux-kernel

After the previous update to disk_check_events(), nobody is using
non-syncing __disk_block_events().  Remove @sync and, as this makes
__disk_block_events() virtually identical to disk_block_events(),
remove the underscore prefixed version.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/genhd.c |   55 ++++++++++++++++++++++++-------------------------------
 1 file changed, 24 insertions(+), 31 deletions(-)

Index: work/block/genhd.c
===================================================================
--- work.orig/block/genhd.c
+++ work/block/genhd.c
@@ -1414,22 +1414,36 @@ static unsigned long disk_events_poll_ji
 	return msecs_to_jiffies(intv_msecs);
 }
 
-static void __disk_block_events(struct gendisk *disk, bool sync)
+/**
+ * disk_block_events - block and flush disk event checking
+ * @disk: disk to block events for
+ *
+ * On return from this function, it is guaranteed that event checking
+ * isn't in progress and won't happen until unblocked by
+ * disk_unblock_events().  Events blocking is counted and the actual
+ * unblocking happens after the matching number of unblocks are done.
+ *
+ * Note that this intentionally does not block event checking from
+ * disk_clear_events().
+ *
+ * CONTEXT:
+ * Might sleep.
+ */
+void disk_block_events(struct gendisk *disk)
 {
 	struct disk_events *ev = disk->ev;
 	unsigned long flags;
 	bool cancel;
 
+	if (!ev)
+		return;
+
 	spin_lock_irqsave(&ev->lock, flags);
 	cancel = !ev->block++;
 	spin_unlock_irqrestore(&ev->lock, flags);
 
-	if (cancel) {
-		if (sync)
-			cancel_delayed_work_sync(&disk->ev->dwork);
-		else
-			cancel_delayed_work(&disk->ev->dwork);
-	}
+	if (cancel)
+		cancel_delayed_work_sync(&disk->ev->dwork);
 }
 
 static void __disk_unblock_events(struct gendisk *disk, bool check_now)
@@ -1461,27 +1475,6 @@ out_unlock:
 }
 
 /**
- * disk_block_events - block and flush disk event checking
- * @disk: disk to block events for
- *
- * On return from this function, it is guaranteed that event checking
- * isn't in progress and won't happen until unblocked by
- * disk_unblock_events().  Events blocking is counted and the actual
- * unblocking happens after the matching number of unblocks are done.
- *
- * Note that this intentionally does not block event checking from
- * disk_clear_events().
- *
- * CONTEXT:
- * Might sleep.
- */
-void disk_block_events(struct gendisk *disk)
-{
-	if (disk->ev)
-		__disk_block_events(disk, true);
-}
-
-/**
  * disk_unblock_events - unblock disk event checking
  * @disk: disk to unblock events for
  *
@@ -1554,7 +1547,7 @@ unsigned int disk_clear_events(struct ge
 	spin_unlock_irq(&ev->lock);
 
 	/* uncondtionally schedule event check and wait for it to finish */
-	__disk_block_events(disk, true);
+	disk_block_events(disk);
 	queue_delayed_work(system_nrt_wq, &ev->dwork, 0);
 	flush_delayed_work(&ev->dwork);
 	__disk_unblock_events(disk, false);
@@ -1672,7 +1665,7 @@ static ssize_t disk_events_poll_msecs_st
 	if (intv < 0 && intv != -1)
 		return -EINVAL;
 
-	__disk_block_events(disk, true);
+	disk_block_events(disk);
 	disk->ev->poll_msecs = intv;
 	__disk_unblock_events(disk, true);
 
@@ -1778,7 +1771,7 @@ static void disk_del_events(struct gendi
 	if (!disk->ev)
 		return;
 
-	__disk_block_events(disk, true);
+	disk_block_events(disk);
 
 	mutex_lock(&disk_events_mutex);
 	list_del_init(&disk->ev->node);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 10:28 ` [PATCH RESEND 2/3 v2.6.39-rc7] block: remove non-syncing __disk_block_events() and fold it into disk_block_events() Tejun Heo
@ 2011-05-17 10:28   ` Tejun Heo
  2011-05-17 14:46     ` Linus Torvalds
  2011-05-17 15:47     ` [PATCH UPDATED " Tejun Heo
  0 siblings, 2 replies; 16+ messages in thread
From: Tejun Heo @ 2011-05-17 10:28 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Sitsofe Wheeler, Borislav Petkov, Meelis Roos, Andrew Morton,
	Kay Sievers, linux-kernel

disk_block_events() should guarantee that the event work is not in
flight on return and once blocked it shouldn't issue further
cancellations.

Because there was no synchronization between the first blocker doing
cancel_delayed_work_sync() and the following blockers, the following
blockers could finish before cancellation was complete, which broke
both guarantees - event work could be in flight and cancellation could
happen after return.

This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

Fix it by introducing DISK_EVENT_CANCELING bit which is set by the
first blocker while cancellation is in progress.  Further blockers
wait until the bit is cleared by the first blocker.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
---
 block/genhd.c |   36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

Index: work/block/genhd.c
===================================================================
--- work.orig/block/genhd.c
+++ work/block/genhd.c
@@ -1371,7 +1371,7 @@ struct disk_events {
 	struct gendisk		*disk;		/* the associated disk */
 	spinlock_t		lock;
 
-	int			block;		/* event blocking depth */
+	unsigned int		block;		/* event blocking depth */
 	unsigned int		pending;	/* events already sent out */
 	unsigned int		clearing;	/* events being cleared */
 
@@ -1379,6 +1379,8 @@ struct disk_events {
 	struct delayed_work	dwork;
 };
 
+#define DISK_EVENT_CANCELING			0x80000000U
+
 static const char *disk_events_strs[] = {
 	[ilog2(DISK_EVENT_MEDIA_CHANGE)]	= "media_change",
 	[ilog2(DISK_EVENT_EJECT_REQUEST)]	= "eject_request",
@@ -1414,6 +1416,12 @@ static unsigned long disk_events_poll_ji
 	return msecs_to_jiffies(intv_msecs);
 }
 
+static int disk_block_wait_canceling(void *word)
+{
+	schedule();
+	return 0;
+}
+
 /**
  * disk_block_events - block and flush disk event checking
  * @disk: disk to block events for
@@ -1438,12 +1446,34 @@ void disk_block_events(struct gendisk *d
 	if (!ev)
 		return;
 
+	/*
+	 * Bump block count and set CANCELLING if we're the first blocker
+	 * and have to cancel the event work.
+	 */
 	spin_lock_irqsave(&ev->lock, flags);
-	cancel = !ev->block++;
+	if ((cancel = !ev->block++))
+		ev->block |= DISK_EVENT_CANCELING;
 	spin_unlock_irqrestore(&ev->lock, flags);
 
-	if (cancel)
+	if (cancel) {
+		/*
+		 * Cancel the event work, clear CANCELING and wake up
+		 * waiters.
+		 */
 		cancel_delayed_work_sync(&disk->ev->dwork);
+
+		spin_lock_irqsave(&ev->lock, flags);
+		ev->block &= ~DISK_EVENT_CANCELING;
+		spin_unlock_irqrestore(&ev->lock, flags);
+		wake_up_bit(&ev->block, ilog2(DISK_EVENT_CANCELING));
+	} else {
+		/*
+		 * The first blocker might not have finished canceling the
+		 * event work.  Wait for CANCELING to clear.
+		 */
+		wait_on_bit(&ev->block, ilog2(DISK_EVENT_CANCELING),
+			    disk_block_wait_canceling, TASK_UNINTERRUPTIBLE);
+	}
 }
 
 static void __disk_unblock_events(struct gendisk *disk, bool check_now)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 10:28   ` [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation Tejun Heo
@ 2011-05-17 14:46     ` Linus Torvalds
  2011-05-17 15:11       ` Tejun Heo
  2011-05-17 15:47     ` [PATCH UPDATED " Tejun Heo
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2011-05-17 14:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

This is pretty disgusting.

You're not using a real lock, and to compensate for that you use a
bloccking bit-lock hack. And to make that hack extra ugly, you define
the bit as a bitmask, and use the ilog2() macro to turn it into a bit
pos.

Horrid. Horrid.

Is there some fundamental reason why you cannot just turn the ev->lock
into a real semaphore (allowing blocking), and then doing the dwork
cancel under the semaphore - avoiding all the crazy bit-lock crud.

Or just _add_ a semaphore to the 'struct disk_events', for chrissake.

This is just too ugly to survive. And even if you fixed the ilog()
(hint: just define the bit, and then use (1u<<BIT) to define the
mask), it would be too ugly.

Don't do these kinds of ad-hock locks. They are WRONG.

              Linus

On Tue, May 17, 2011 at 3:28 AM, Tejun Heo <tj@kernel.org> wrote:
> disk_block_events() should guarantee that the event work is not in
> flight on return and once blocked it shouldn't issue further
> cancellations.
>
> Because there was no synchronization between the first blocker doing
> cancel_delayed_work_sync() and the following blockers, the following
> blockers could finish before cancellation was complete, which broke
> both guarantees - event work could be in flight and cancellation could
> happen after return.
>
> This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
> bug#34662.
>
>  https://bugzilla.kernel.org/show_bug.cgi?id=34662
>
> Fix it by introducing DISK_EVENT_CANCELING bit which is set by the
> first blocker while cancellation is in progress.  Further blockers
> wait until the bit is cleared by the first blocker.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
> Reported-by: Borislav Petkov <bp@alien8.de>
> Reported-by: Meelis Roos <mroos@linux.ee>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> ---
>  block/genhd.c |   36 +++++++++++++++++++++++++++++++++---
>  1 file changed, 33 insertions(+), 3 deletions(-)
>
> Index: work/block/genhd.c
> ===================================================================
> --- work.orig/block/genhd.c
> +++ work/block/genhd.c
> @@ -1371,7 +1371,7 @@ struct disk_events {
>        struct gendisk          *disk;          /* the associated disk */
>        spinlock_t              lock;
>
> -       int                     block;          /* event blocking depth */
> +       unsigned int            block;          /* event blocking depth */
>        unsigned int            pending;        /* events already sent out */
>        unsigned int            clearing;       /* events being cleared */
>
> @@ -1379,6 +1379,8 @@ struct disk_events {
>        struct delayed_work     dwork;
>  };
>
> +#define DISK_EVENT_CANCELING                   0x80000000U
> +
>  static const char *disk_events_strs[] = {
>        [ilog2(DISK_EVENT_MEDIA_CHANGE)]        = "media_change",
>        [ilog2(DISK_EVENT_EJECT_REQUEST)]       = "eject_request",
> @@ -1414,6 +1416,12 @@ static unsigned long disk_events_poll_ji
>        return msecs_to_jiffies(intv_msecs);
>  }
>
> +static int disk_block_wait_canceling(void *word)
> +{
> +       schedule();
> +       return 0;
> +}
> +
>  /**
>  * disk_block_events - block and flush disk event checking
>  * @disk: disk to block events for
> @@ -1438,12 +1446,34 @@ void disk_block_events(struct gendisk *d
>        if (!ev)
>                return;
>
> +       /*
> +        * Bump block count and set CANCELLING if we're the first blocker
> +        * and have to cancel the event work.
> +        */
>        spin_lock_irqsave(&ev->lock, flags);
> -       cancel = !ev->block++;
> +       if ((cancel = !ev->block++))
> +               ev->block |= DISK_EVENT_CANCELING;
>        spin_unlock_irqrestore(&ev->lock, flags);
>
> -       if (cancel)
> +       if (cancel) {
> +               /*
> +                * Cancel the event work, clear CANCELING and wake up
> +                * waiters.
> +                */
>                cancel_delayed_work_sync(&disk->ev->dwork);
> +
> +               spin_lock_irqsave(&ev->lock, flags);
> +               ev->block &= ~DISK_EVENT_CANCELING;
> +               spin_unlock_irqrestore(&ev->lock, flags);
> +               wake_up_bit(&ev->block, ilog2(DISK_EVENT_CANCELING));
> +       } else {
> +               /*
> +                * The first blocker might not have finished canceling the
> +                * event work.  Wait for CANCELING to clear.
> +                */
> +               wait_on_bit(&ev->block, ilog2(DISK_EVENT_CANCELING),
> +                           disk_block_wait_canceling, TASK_UNINTERRUPTIBLE);
> +       }
>  }
>
>  static void __disk_unblock_events(struct gendisk *disk, bool check_now)
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 14:46     ` Linus Torvalds
@ 2011-05-17 15:11       ` Tejun Heo
  2011-05-17 15:15         ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2011-05-17 15:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

Hello, Linus.

On Tue, May 17, 2011 at 07:46:25AM -0700, Linus Torvalds wrote:
> This is pretty disgusting.
> 
> You're not using a real lock, and to compensate for that you use a
> bloccking bit-lock hack. And to make that hack extra ugly, you define
> the bit as a bitmask, and use the ilog2() macro to turn it into a bit
> pos.
> 
> Horrid. Horrid.

Heh, okay.  It's not a lock tho.  It's multiple waiter waiting for a
single event - so either explicit waitqueue or completion.  I was
doing a waitqueue but bit waitqueue didn't seem to add too much
complexity, so...

> Is there some fundamental reason why you cannot just turn the ev->lock
> into a real semaphore (allowing blocking), and then doing the dwork
> cancel under the semaphore - avoiding all the crazy bit-lock crud.

disk_check_events() can be called from non-sleepable context so we
need a spinlock protecting block count.

> Or just _add_ a semaphore to the 'struct disk_events', for chrissake.

Alright, a completion then.

> This is just too ugly to survive. And even if you fixed the ilog()
> (hint: just define the bit, and then use (1u<<BIT) to define the
> mask), it would be too ugly.

In many cases there are more C bitops than atomic or wait_bit() type
operations which take bit position rather than mask.  I find it less
painful to define constants as bit masks and using ilog2() on those
ops than doing lots of 1 << BIT.

I don't know, this is constantly painful.  More are defined as bit
masks but some prominent ones use bit positions and some even define
both.  Given that the C bitops are there by default, I wish the
explicit bitops also took bitmask and just did ilog2() internally, but
well it's too late and we're stuck with the mixed situation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 15:11       ` Tejun Heo
@ 2011-05-17 15:15         ` Linus Torvalds
  2011-05-17 15:27           ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2011-05-17 15:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On Tue, May 17, 2011 at 8:11 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Heh, okay.  It's not a lock tho.  It's multiple waiter waiting for a
> single event - so either explicit waitqueue or completion.  I was
> doing a waitqueue but bit waitqueue didn't seem to add too much
> complexity, so...

No.

Semantically what it is is a LOCK.

Turning it into something else just screws up everything.  It just
means you have to have ANOTHER lock to protect the things that the
completion/waitqueue would use.

Don't f*&^ around. Just make it a lock, and don't do a
"lock+completion " or something crazy.

If you can do it with a completion but with no locking, go ahead. I
doubt you can. You want the locking for the whole "do I need to wait
for the completion" thing anyway, so why mess things up?

                   Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 15:15         ` Linus Torvalds
@ 2011-05-17 15:27           ` Tejun Heo
  2011-05-17 22:40             ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2011-05-17 15:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

Hello,

On Tue, May 17, 2011 at 08:15:38AM -0700, Linus Torvalds wrote:
> Semantically what it is is a LOCK.

Okay, an outer mutex then.

> You want the locking for the whole "do I need to wait for the
> completion" thing anyway, so why mess things up?

Spinlock inside mutex seemed a bit strange but yeah that probably is
the simpliest way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH UPDATED 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 10:28   ` [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation Tejun Heo
  2011-05-17 14:46     ` Linus Torvalds
@ 2011-05-17 15:47     ` Tejun Heo
  2011-05-17 19:34       ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2011-05-17 15:47 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Sitsofe Wheeler, Borislav Petkov, Meelis Roos, Andrew Morton,
	Kay Sievers, linux-kernel

disk_block_events() should guarantee that the event work is not in
flight on return and once blocked it shouldn't issue further
cancellations.

Because there was no synchronization between the first blocker doing
cancel_delayed_work_sync() and the following blockers, the following
blockers could finish before cancellation was complete, which broke
both guarantees - event work could be in flight and cancellation could
happen after return.

This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

Fix it by adding an outer mutex which protects both block count
manipulation and work cancellation.

-v2: Use outer mutex instead of bit waitqueue per Linus.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
---
Yeap, it's so much simpler this way.  Thanks.

 block/genhd.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: work/block/genhd.c
===================================================================
--- work.orig/block/genhd.c
+++ work/block/genhd.c
@@ -1371,6 +1371,7 @@ struct disk_events {
 	struct gendisk		*disk;		/* the associated disk */
 	spinlock_t		lock;
 
+	struct mutex		block_mutex;	/* protects blocking */
 	int			block;		/* event blocking depth */
 	unsigned int		pending;	/* events already sent out */
 	unsigned int		clearing;	/* events being cleared */
@@ -1438,12 +1439,20 @@ void disk_block_events(struct gendisk *d
 	if (!ev)
 		return;
 
+	/*
+	 * Outer mutex ensures that the first blocker completes canceling
+	 * the event work before further blockers are allowed to finish.
+	 */
+	mutex_lock(&ev->block_mutex);
+
 	spin_lock_irqsave(&ev->lock, flags);
 	cancel = !ev->block++;
 	spin_unlock_irqrestore(&ev->lock, flags);
 
 	if (cancel)
 		cancel_delayed_work_sync(&disk->ev->dwork);
+
+	mutex_unlock(&ev->block_mutex);
 }
 
 static void __disk_unblock_events(struct gendisk *disk, bool check_now)
@@ -1751,6 +1760,7 @@ static void disk_add_events(struct gendi
 	INIT_LIST_HEAD(&ev->node);
 	ev->disk = disk;
 	spin_lock_init(&ev->lock);
+	mutex_init(&ev->block_mutex);
 	ev->block = 1;
 	ev->poll_msecs = -1;
 	INIT_DELAYED_WORK(&ev->dwork, disk_events_workfn);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH UPDATED 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 15:47     ` [PATCH UPDATED " Tejun Heo
@ 2011-05-17 19:34       ` Jens Axboe
  2011-05-17 20:22         ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2011-05-17 19:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On 2011-05-17 17:47, Tejun Heo wrote:
> disk_block_events() should guarantee that the event work is not in
> flight on return and once blocked it shouldn't issue further
> cancellations.
> 
> Because there was no synchronization between the first blocker doing
> cancel_delayed_work_sync() and the following blockers, the following
> blockers could finish before cancellation was complete, which broke
> both guarantees - event work could be in flight and cancellation could
> happen after return.
> 
> This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
> bug#34662.
> 
>   https://bugzilla.kernel.org/show_bug.cgi?id=34662
> 
> Fix it by adding an outer mutex which protects both block count
> manipulation and work cancellation.
> 
> -v2: Use outer mutex instead of bit waitqueue per Linus.

Thanks, much cleaner indeed. I've rebased for-linus. BTW, this is patch
3/3, not 2/3. Had me confused for a while, the numbering on the initial
series was off as well.

I'll let this simmer until tomorrow, then send out the pull request.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH UPDATED 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 19:34       ` Jens Axboe
@ 2011-05-17 20:22         ` Borislav Petkov
  0 siblings, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2011-05-17 20:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Linus Torvalds, Sitsofe Wheeler, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On Tue, May 17, 2011 at 09:34:58PM +0200, Jens Axboe wrote:
> BTW, this is patch 3/3, not 2/3. Had me confused for a while, the
> numbering on the initial series was off as well.

Same here, they don't apply cleanly for some reason on latest Linus
tree. I'm testing your #for-linus branch now, instead.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 15:27           ` Tejun Heo
@ 2011-05-17 22:40             ` Linus Torvalds
  2011-05-18  5:07               ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2011-05-17 22:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On Tue, May 17, 2011 at 8:27 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Spinlock inside mutex seemed a bit strange but yeah that probably is
> the simpliest way.

Do you really even need the spinlock at all?

Just make the semaphore protect the count - and you're done.

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-17 22:40             ` Linus Torvalds
@ 2011-05-18  5:07               ` Tejun Heo
  2011-05-18  9:46                 ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2011-05-18  5:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On Tue, May 17, 2011 at 03:40:11PM -0700, Linus Torvalds wrote:
> On Tue, May 17, 2011 at 8:27 AM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Spinlock inside mutex seemed a bit strange but yeah that probably is
> > the simpliest way.
> 
> Do you really even need the spinlock at all?

There were sites which called disk_unblock/check_events() with
bdevlock held, which was why it was made spinlock in the first place.
Hmmm... they're not there anymore.

> Just make the semaphore protect the count - and you're done.

Yeah, with that gone, we don't even need the open-coding inside
disk_check_events().  It can simply call syncing block and unblock.
But, do you want that in -rc7?  Unnecessarily complicated as the
current code may be, converting the lock to mutex is a larger change
than adding an outer mutex and I think it would be better to do that
during the next cycle.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-18  5:07               ` Tejun Heo
@ 2011-05-18  9:46                 ` Linus Torvalds
  2011-05-18 10:04                   ` Tejun Heo
  2011-05-18 10:26                   ` Jens Axboe
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2011-05-18  9:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On Tue, May 17, 2011 at 10:07 PM, Tejun Heo <tj@kernel.org> wrote:
>
>> Just make the semaphore protect the count - and you're done.
>
> Yeah, with that gone, we don't even need the open-coding inside
> disk_check_events().  It can simply call syncing block and unblock.
> But, do you want that in -rc7?  Unnecessarily complicated as the
> current code may be, converting the lock to mutex is a larger change
> than adding an outer mutex and I think it would be better to do that
> during the next cycle.

Quite frankly. right now I think I need to just release 2.6.39, and
then for 2.6.40 merge the trivial

  mutex_lock(&ev->mutex);
  if (!ev->block++)
    cancel_delayed_work_sync(&ev->dwork);
  mutex_unlock(&ev->mutex);

with a cc: stable for backporting.

I'd _much_ prefer simple obvious code than have a outer mutex etc.
Just make the rule be that "blocked" is protected by the new
semaphore. I don't think it's used very often, and anybody who wants
to block disk events needs to be in blockable context in order to wait
for the delayed work cancel, right? So we can't be in some atomic
context inside some other spinlock anyway, afaik. And there can be no
lock order issues, since this would always be a new inner lock.

Hmm?

                          Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-18  9:46                 ` Linus Torvalds
@ 2011-05-18 10:04                   ` Tejun Heo
  2011-05-18 11:07                     ` Tejun Heo
  2011-05-18 10:26                   ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2011-05-18 10:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

Hello,

On Wed, May 18, 2011 at 02:46:30AM -0700, Linus Torvalds wrote:
> Quite frankly. right now I think I need to just release 2.6.39, and
> then for 2.6.40 merge the trivial
> 
>   mutex_lock(&ev->mutex);
>   if (!ev->block++)
>     cancel_delayed_work_sync(&ev->dwork);
>   mutex_unlock(&ev->mutex);
> 
> with a cc: stable for backporting.

This isn't super-critical and releasing without the fix but later
backporting via -stable definitely is an option.

> So we can't be in some atomic context inside some other spinlock
> anyway, afaik. And there can be no lock order issues, since this
> would always be a new inner lock.

Yeap, the problem was unblock/check being allowed to be called without
sleeping context, which isn't used anymore and was broken due to
cancellation race.  We can just enclose the whole thing inside per-ev
mutex and everything should be simple and fine.  I'll post patches
soon.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-18  9:46                 ` Linus Torvalds
  2011-05-18 10:04                   ` Tejun Heo
@ 2011-05-18 10:26                   ` Jens Axboe
  1 sibling, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2011-05-18 10:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On 2011-05-18 11:46, Linus Torvalds wrote:
> On Tue, May 17, 2011 at 10:07 PM, Tejun Heo <tj@kernel.org> wrote:
>>
>>> Just make the semaphore protect the count - and you're done.
>>
>> Yeah, with that gone, we don't even need the open-coding inside
>> disk_check_events().  It can simply call syncing block and unblock.
>> But, do you want that in -rc7?  Unnecessarily complicated as the
>> current code may be, converting the lock to mutex is a larger change
>> than adding an outer mutex and I think it would be better to do that
>> during the next cycle.
> 
> Quite frankly. right now I think I need to just release 2.6.39, and
> then for 2.6.40 merge the trivial
> 
>   mutex_lock(&ev->mutex);
>   if (!ev->block++)
>     cancel_delayed_work_sync(&ev->dwork);
>   mutex_unlock(&ev->mutex);
> 
> with a cc: stable for backporting.

With these changes pushed to 2.6.40, I have the following sitting queued
up for 2.6.39, of which at least the performance regression and
blk-throttle fix really should go in. I'll let Tejun voice in on his
three changes. I would have pushed this a few days ago, but this thread
had it postponed a bit.

  git://git.kernel.dk/linux-2.6-block.git for-linus

Jens Axboe (1):
      scsi: remove performance regression due to async queue run

Shaohua Li (1):
      block: don't delay blk_run_queue_async

Tejun Heo (3):
      block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers
      cdrom: always check_disk_change() on open
      block: rescan partitions on invalidated devices on -ENOMEDIA too

Vivek Goyal (1):
      blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup

 block/blk-cgroup.c              |    7 +++++++
 block/blk-cgroup.h              |    3 +++
 block/blk-core.c                |    4 +++-
 block/blk-throttle.c            |    9 ++++-----
 block/cfq-iosched.c             |   11 +++++------
 drivers/block/DAC960.c          |    1 -
 drivers/block/amiflop.c         |    1 -
 drivers/block/ataflop.c         |    1 -
 drivers/block/floppy.c          |    1 -
 drivers/block/paride/pcd.c      |    1 -
 drivers/block/paride/pd.c       |    1 -
 drivers/block/paride/pf.c       |    1 -
 drivers/block/swim.c            |    1 -
 drivers/block/swim3.c           |    1 -
 drivers/block/ub.c              |    1 -
 drivers/block/xsysace.c         |    1 -
 drivers/cdrom/cdrom.c           |    6 +++---
 drivers/cdrom/gdrom.c           |    1 -
 drivers/cdrom/viocd.c           |    1 -
 drivers/message/i2o/i2o_block.c |    1 -
 drivers/s390/char/tape_block.c  |    1 -
 drivers/scsi/scsi_lib.c         |   20 ++++++++++++++++----
 drivers/scsi/scsi_scan.c        |    2 ++
 fs/block_dev.c                  |   27 ++++++++++++++++++---------
 include/scsi/scsi_device.h      |    1 +
 25 files changed, 62 insertions(+), 43 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation
  2011-05-18 10:04                   ` Tejun Heo
@ 2011-05-18 11:07                     ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2011-05-18 11:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Sitsofe Wheeler, Borislav Petkov, Meelis Roos,
	Andrew Morton, Kay Sievers, linux-kernel

On Wed, May 18, 2011 at 12:04:51PM +0200, Tejun Heo wrote:
> Yeap, the problem was unblock/check being allowed to be called without
> sleeping context, which isn't used anymore and was broken due to
> cancellation race.  We can just enclose the whole thing inside per-ev
> mutex and everything should be simple and fine.  I'll post patches
> soon.

Oops, spoke too soon.  Converting to mutex creates a circular
dependency.  disk_block_events() acquires ev->lock and waits for
ev->dwork to finish, but ev->dwork also needs to acquire ev->lock to
update event states.  So, no matter how we twist this, we need two
locks or a lock and a completion.  The mutex approach seems simple
enough, so I suggest leaving it like that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-05-18 11:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-17 10:27 [PATCH RESEND 1/3 v2.6.39-rc7] block: don't use non-syncing event blocking in disk_check_events() Tejun Heo
2011-05-17 10:28 ` [PATCH RESEND 2/3 v2.6.39-rc7] block: remove non-syncing __disk_block_events() and fold it into disk_block_events() Tejun Heo
2011-05-17 10:28   ` [PATCH RESEND 2/3 v2.6.39-rc7] block: make disk_block_events() properly wait for work cancellation Tejun Heo
2011-05-17 14:46     ` Linus Torvalds
2011-05-17 15:11       ` Tejun Heo
2011-05-17 15:15         ` Linus Torvalds
2011-05-17 15:27           ` Tejun Heo
2011-05-17 22:40             ` Linus Torvalds
2011-05-18  5:07               ` Tejun Heo
2011-05-18  9:46                 ` Linus Torvalds
2011-05-18 10:04                   ` Tejun Heo
2011-05-18 11:07                     ` Tejun Heo
2011-05-18 10:26                   ` Jens Axboe
2011-05-17 15:47     ` [PATCH UPDATED " Tejun Heo
2011-05-17 19:34       ` Jens Axboe
2011-05-17 20:22         ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.