All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
@ 2016-06-26 19:18 Alexander Lyakas
  2016-07-07 23:41 ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Lyakas @ 2016-06-26 19:18 UTC (permalink / raw)
  To: NeilBrown, 马建朋, linux-raid; +Cc: Jes Sorensen

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes

This problem was originally fixed in commit:
d6b42dc md/raid1,raid10: avoid deadlock during resync/recovery.

But then it was broken in commit:
b364e3d raid1: Add a field array_frozen to indicate whether raid in
freeze state.

Notes:

- the problem cannot happen for WRITEs, because when a WRITE is
submitted to MD, there are two cases:
    - calling thread has a plug: in that case, when this thread will
go to sleep in wait_barrier (or anywhere else), all the plugged IOs
will be flushed via:
schedule()=>sched_submit_work()=>blk_schedule_flush_plug()
    - calling thread is not plugged: in that case internal WRITEs are
handed off to raid1d, which     doesn't have current->bio_list issues

- current->bio_list may contains an internal READ bio from a different
MD! This can happen, for example, when two or more MDs are under
dm-stripe.

- the commit causing the issue mentioned above, also made
freeze_array() call non-reentrant, because array_frozen can be 0 or 1
only. Thus if two threads call freeze_array() in parallel (and it may
happen), the first thread that calls unfreeze_array() sets
array_frozen=0, but the second thread still thinks array is frozen.
This patch does not address this problem, it only adds a warning.

This patch demonstrates a rudimentary fix, which fixes the issue for me.
The patch is based on kernel 3.18.
---
 drivers/md/raid1.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid1.h |   1 +
 2 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3c8ada4..92541b4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -59,20 +59,27 @@
 #define IO_MADE_GOOD ((struct bio *)2)

 #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)

 /* When there are this many requests queue to be written by
  * the raid1 thread, we become 'congested' to provide back-pressure
  * for writeback.
  */
 static int max_queued_requests = 1024;

+/* see wait_barrier_rescue */
+struct workqueue_struct *g_freeze_array_rescue_wq;
+struct work_struct g_freeze_array_rescue_work;
+struct bio_list g_bios_to_rescue;
+spinlock_t g_rescue_lock;
+
+
 static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
               sector_t bi_sector);
 static void lower_barrier(struct r1conf *conf);

 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
     struct pool_info *pi = data;
     int size = offsetof(struct r1bio, bios[pi->raid_disks]);

     /* allocate a r1bio with room for raid_disks entries in the bios array */
@@ -883,26 +890,112 @@ static bool need_to_wait_for_sync(struct r1conf
*conf, struct bio *bio)
             (conf->next_resync + NEXT_NORMALIO_DISTANCE
              <= bio->bi_iter.bi_sector))
             wait = false;
         else
             wait = true;
     }

     return wait;
 }

+/*
+ * When we call wait_barrier, we might have some bios waiting
+ * in current->bio_list, which prevents the array_freeze call to
+ * complete. Those can only be internal READs, which have already
+ * passed the wait_barrier call (thus incrementing nr_pending), but
+ * still were not submitted to the lower level, due to generic_make_request
+ * logic to avoid recursive calls. In such case, we have a deadlock:
+ * - array_frozen is already set to 1, so wait_barrier
unconditionally waits, so
+ * - internal READ bios will not be submitted, thus freeze_array will
never completes
+ *
+ * This problem was originally fixed in commit:
+ * d6b42dc md/raid1,raid10: avoid deadlock during resync/recovery.
+ *
+ * But then was broken in commit:
+ * b364e3d raid1: Add a field array_frozen to indicate whether raid
in freeze state.
+ *
+ * Notes:
+ * - the problem cannot happen for WRITEs, because when a WRITE is
submitted to MD, there are two cases:
+ *   - calling thread has a plug: in that case, when this thread will
go to sleep in wait_barrier (or anywhere elese),
+ *     all the plugged IOs will be flushed via:
schedule()=>sched_submit_work()=>blk_schedule_flush_plug()
+ *   - calling thread is not plugged: in that case internal WRITEs
are handed off to raid1d, which
+ *     doesn't have current->bio_list issues
+ * - note that current->bio_list may contains an internal READ bio
from a different MD! This can
+ *   happen, for example, when two or more MDs are under dm-stripe.
+ */
+static void wait_barrier_rescue(struct r1conf *conf)
+{
+    struct bio_list bios_to_rescue;
+    struct bio *bio = NULL;
+
+    assert_spin_locked(&conf->resync_lock);
+
+    if (current->bio_list == NULL)
+        return;
+    if (!(conf->array_frozen && !conf->array_frozen_done))
+        return;
+
+    bio_list_init(&bios_to_rescue);
+
+    /* grab the whole current->bio_list locally */
+    bio = bio_list_get(current->bio_list);
+
+    /* iterate over the local list */
+    while (bio) {
+        struct bio *next = bio->bi_next;
+        bio->bi_next = NULL;
+
+        if (bio->bi_end_io == raid1_end_read_request)
+            /* this is a problematic bio, set it aside */
+            bio_list_add(&bios_to_rescue, bio);
+        else
+            /* return this bio back to current->bio_list */
+            bio_list_add(current->bio_list, bio);
+
+        bio = next;
+    }
+
+    /* do we have any bios to rescue? */
+    if (bios_to_rescue.head != NULL) {
+        spin_lock(&g_rescue_lock);
+        bio_list_merge(&g_bios_to_rescue, &bios_to_rescue);
+        spin_unlock(&g_rescue_lock);
+        queue_work(g_freeze_array_rescue_wq, &g_freeze_array_rescue_work);
+    }
+}
+
+static void raid1_rescue_work(struct work_struct *work)
+{
+    struct bio *bio = NULL;
+
+    spin_lock(&g_rescue_lock);
+    bio = bio_list_get(&g_bios_to_rescue);
+    spin_unlock(&g_rescue_lock);
+
+    while (bio) {
+        struct bio *next = bio->bi_next;
+
+        bio->bi_next = NULL;
+
+        generic_make_request(bio);
+
+        bio = next;
+    }
+}
+
 static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 {
     sector_t sector = 0;

     spin_lock_irq(&conf->resync_lock);
     if (need_to_wait_for_sync(conf, bio)) {
+        wait_barrier_rescue(conf);
         conf->nr_waiting++;
         /* Wait for the barrier to drop.
          * However if there are already pending
          * requests (preventing the barrier from
          * rising completely), and the
          * per-process bio queue isn't empty,
          * then don't wait, as we need to empty
          * that queue to allow conf->start_next_window
          * to increase.
          */
@@ -978,32 +1071,35 @@ static void freeze_array(struct r1conf *conf, int extra)
      * We wait until nr_pending match nr_queued+extra
      * This is called in the context of one normal IO request
      * that has failed. Thus any sync request that might be pending
      * will be blocked by nr_pending, and we need to wait for
      * pending IO requests to complete or be queued for re-try.
      * Thus the number queued (nr_queued) plus this request (extra)
      * must match the number of pending IOs (nr_pending) before
      * we continue.
      */
     spin_lock_irq(&conf->resync_lock);
+    WARN(conf->array_frozen, "%s-R1 re-enterting freeze_array!!!",
mdname(conf->mddev));
     conf->array_frozen = 1;
     wait_event_lock_irq_cmd(conf->wait_barrier,
                 conf->nr_pending == conf->nr_queued+extra,
                 conf->resync_lock,
                 flush_pending_writes(conf));
+    conf->array_frozen_done = 1;
     spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
 {
     /* reverse the effect of the freeze */
     spin_lock_irq(&conf->resync_lock);
     conf->array_frozen = 0;
+    conf->array_frozen_done = 0;
     wake_up(&conf->wait_barrier);
     spin_unlock_irq(&conf->resync_lock);
 }

 /* duplicate the data pages for behind I/O
  */
 static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)
 {
     int i;
     struct bio_vec *bvec;
@@ -3162,23 +3258,25 @@ static void *raid1_takeover(struct mddev *mddev)
 {
     /* raid1 can take over:
      *  raid5 with 2 devices, any layout or chunk size
      */
     if (mddev->level == 5 && mddev->raid_disks == 2) {
         struct r1conf *conf;
         mddev->new_level = 1;
         mddev->new_layout = 0;
         mddev->new_chunk_sectors = 0;
         conf = setup_conf(mddev);
-        if (!IS_ERR(conf))
+        if (!IS_ERR(conf)) {
             /* Array must appear to be quiesced */
             conf->array_frozen = 1;
+            conf->array_frozen_done = 1;
+        }
         return conf;
     }
     return ERR_PTR(-EINVAL);
 }

 static struct md_personality raid1_personality =
 {
     .name        = "raid1",
     .level        = 1,
     .owner        = THIS_MODULE,
@@ -3193,25 +3291,51 @@ static struct md_personality raid1_personality =
     .sync_request    = sync_request,
     .resize        = raid1_resize,
     .size        = raid1_size,
     .check_reshape    = raid1_reshape,
     .quiesce    = raid1_quiesce,
     .takeover    = raid1_takeover,
 };

 static int __init raid_init(void)
 {
-    return register_md_personality(&raid1_personality);
+    int ret = 0;
+
+    bio_list_init(&g_bios_to_rescue);
+    spin_lock_init(&g_rescue_lock);
+
+    INIT_WORK(&g_freeze_array_rescue_work, raid1_rescue_work);
+
+    /* we have only one work item, so we need only one active work at a time */
+    g_freeze_array_rescue_wq = alloc_workqueue("raid1_rescue_wq",
+            WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS,
+            1/*max_active*/);
+    if (g_freeze_array_rescue_wq == NULL) {
+        pr_err("alloc_workqueue(zraid1_rescue_wq) failed");
+        ret =-ENOMEM;
+        goto out;
+    }
+
+    ret = register_md_personality(&raid1_personality);
+    if (ret != 0)
+        goto free_wq;
+
+free_wq:
+    destroy_workqueue(g_freeze_array_rescue_wq);
+out:
+    return ret;
 }

 static void raid_exit(void)
 {
+    flush_workqueue(g_freeze_array_rescue_wq);
+    destroy_workqueue(g_freeze_array_rescue_wq);
     unregister_md_personality(&raid1_personality);
 }

 module_init(raid_init);
 module_exit(raid_exit);
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("RAID1 (mirroring) personality for MD");
 MODULE_ALIAS("md-personality-3"); /* RAID1 */
 MODULE_ALIAS("md-raid1");
 MODULE_ALIAS("md-level-1");
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 33bda55..a7d0bd4 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -72,20 +72,21 @@ struct r1conf {
      * is no other IO.  So when either is active, the other has to wait.
      * See more details description in raid1.c near raise_barrier().
      */
     wait_queue_head_t    wait_barrier;
     spinlock_t        resync_lock;
     int            nr_pending;
     int            nr_waiting;
     int            nr_queued;
     int            barrier;
     int            array_frozen;
+    int            array_frozen_done;

     /* Set to 1 if a full sync is needed, (fresh device added).
      * Cleared when a sync completes.
      */
     int            fullsync;

     /* When the same as mddev->recovery_disabled we don't allow
      * recovery to be attempted as we expect a read error.
      */
     int            recovery_disabled;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
  2016-06-26 19:18 [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier() Alexander Lyakas
@ 2016-07-07 23:41 ` NeilBrown
  2016-07-12 10:09   ` Alexander Lyakas
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2016-07-07 23:41 UTC (permalink / raw)
  To: Alexander Lyakas, 马建朋, linux-raid; +Cc: Jes Sorensen

[-- Attachment #1: Type: text/plain, Size: 1261 bytes --]

On Mon, Jun 27 2016, Alexander Lyakas wrote:

> When we call wait_barrier, we might have some bios waiting
> in current->bio_list, which prevents the array_freeze call to
> complete. Those can only be internal READs, which have already
> passed the wait_barrier call (thus incrementing nr_pending), but
> still were not submitted to the lower level, due to generic_make_request
> logic to avoid recursive calls. In such case, we have a deadlock:
> - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
> - internal READ bios will not be submitted, thus freeze_array will
> never completes
>
> This problem was originally fixed in commit:
> d6b42dc md/raid1,raid10: avoid deadlock during resync/recovery.
>
> But then it was broken in commit:
> b364e3d raid1: Add a field array_frozen to indicate whether raid in
> freeze state.

Thanks for the great analysis.
I think this primarily a problem in generic_make_request().  It queues
requests in the *wrong* order.

Please try the patch from
  https://lkml.org/lkml/2016/7/7/428

and see if it helps.  If two requests for a raid1 are in the
generic_make_request queue, this patch causes the sub-requests created
by the first to be handled before the second is attempted.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
  2016-07-07 23:41 ` NeilBrown
@ 2016-07-12 10:09   ` Alexander Lyakas
  2016-07-12 22:14     ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Lyakas @ 2016-07-12 10:09 UTC (permalink / raw)
  To: NeilBrown; +Cc: 马建朋, linux-raid, Jes Sorensen

Hello Neil,

Thank you for your response. I read an email about you retiring from
MD/mdadm maintenance and delegating mdadm maintenance to Jes Sorensen.
But I was wondering who will be responsible for MD maintenance, and
was about to send an email asking that.

On Fri, Jul 8, 2016 at 2:41 AM, NeilBrown <neilb@suse.com> wrote:
> On Mon, Jun 27 2016, Alexander Lyakas wrote:
>
>> When we call wait_barrier, we might have some bios waiting
>> in current->bio_list, which prevents the array_freeze call to
>> complete. Those can only be internal READs, which have already
>> passed the wait_barrier call (thus incrementing nr_pending), but
>> still were not submitted to the lower level, due to generic_make_request
>> logic to avoid recursive calls. In such case, we have a deadlock:
>> - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
>> - internal READ bios will not be submitted, thus freeze_array will
>> never completes
>>
>> This problem was originally fixed in commit:
>> d6b42dc md/raid1,raid10: avoid deadlock during resync/recovery.
>>
>> But then it was broken in commit:
>> b364e3d raid1: Add a field array_frozen to indicate whether raid in
>> freeze state.
>
> Thanks for the great analysis.
> I think this primarily a problem in generic_make_request().  It queues
> requests in the *wrong* order.
>
> Please try the patch from
>   https://lkml.org/lkml/2016/7/7/428
>
> and see if it helps.  If two requests for a raid1 are in the
> generic_make_request queue, this patch causes the sub-requests created
> by the first to be handled before the second is attempted.
I have read this discussion and more or less (probably less than more)
understood that the second patch by Lars is supposed to address our
issue. However, we cannot easily apply that patch:
- The patch is based on structures added by earlier patch "[RFC]
block: fix blk_queue_split() resource exhaustion".
- Both patches are not in the mainline tree yet.
- Both patches are in block core, which requires to recompile the whole kernel.
- Not sure if these patches are applicable for our production kernel
3.18 (long term)

I am sure you understand that for production with our current kernel
3.18 (long term) we cannot go with these two patches.

Since this issue is a real deadlock we are hitting in a long-term 3.18
kernel, is there any chance for cc-stable fix? Currently we applied
the rudimentary fix  I posted. It basically switches context for
problematic RAID1 READs, and runs them from a different context. With
this fix we don't see the deadlock anymore.

Also, can you please comment on another concern I expressed:
freeze_array() is now not reentrant. Meaning that if two threads call
it in parallel (and it could happen for the same MD), the first thread
calling unfreeze_array will mess up things for the second thread.

Thank you for your help,
Alex.


>
> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
  2016-07-12 10:09   ` Alexander Lyakas
@ 2016-07-12 22:14     ` NeilBrown
  2016-07-14 13:35       ` Alexander Lyakas
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2016-07-12 22:14 UTC (permalink / raw)
  To: Alexander Lyakas
  Cc: 马建朋, linux-raid, Jes Sorensen, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 4987 bytes --]

On Tue, Jul 12 2016, Alexander Lyakas wrote:

> Hello Neil,
>
> Thank you for your response. I read an email about you retiring from
> MD/mdadm maintenance and delegating mdadm maintenance to Jes Sorensen.
> But I was wondering who will be responsible for MD maintenance, and
> was about to send an email asking that.

Yes, I no longer have maintainership responsibilities, though I'm still
involved to some extent.  Jes Sorensen is looking after mdadm and
Shaohua Li is looking after the kernel driver (as listed in MAINTAINERS).


>
> On Fri, Jul 8, 2016 at 2:41 AM, NeilBrown <neilb@suse.com> wrote:
>> On Mon, Jun 27 2016, Alexander Lyakas wrote:
>>
>>> When we call wait_barrier, we might have some bios waiting
>>> in current->bio_list, which prevents the array_freeze call to
>>> complete. Those can only be internal READs, which have already
>>> passed the wait_barrier call (thus incrementing nr_pending), but
>>> still were not submitted to the lower level, due to generic_make_request
>>> logic to avoid recursive calls. In such case, we have a deadlock:
>>> - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
>>> - internal READ bios will not be submitted, thus freeze_array will
>>> never completes
>>>
>>> This problem was originally fixed in commit:
>>> d6b42dc md/raid1,raid10: avoid deadlock during resync/recovery.
>>>
>>> But then it was broken in commit:
>>> b364e3d raid1: Add a field array_frozen to indicate whether raid in
>>> freeze state.
>>
>> Thanks for the great analysis.
>> I think this primarily a problem in generic_make_request().  It queues
>> requests in the *wrong* order.
>>
>> Please try the patch from
>>   https://lkml.org/lkml/2016/7/7/428
>>
>> and see if it helps.  If two requests for a raid1 are in the
>> generic_make_request queue, this patch causes the sub-requests created
>> by the first to be handled before the second is attempted.
> I have read this discussion and more or less (probably less than more)
> understood that the second patch by Lars is supposed to address our
> issue. However, we cannot easily apply that patch:
> - The patch is based on structures added by earlier patch "[RFC]
> block: fix blk_queue_split() resource exhaustion".
> - Both patches are not in the mainline tree yet.
> - Both patches are in block core, which requires to recompile the whole kernel.
> - Not sure if these patches are applicable for our production kernel
> 3.18 (long term)
>
> I am sure you understand that for production with our current kernel
> 3.18 (long term) we cannot go with these two patches.

This patch takes the basic concept of those two and applies it just to
raid1 and raid10.  I think it should be sufficient.  Can you test?  The
patch is against 3.18

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 40b35be34f8d..99208aa2c1c8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1229,7 +1229,7 @@ read_again:
 				sectors_handled;
 			goto read_again;
 		} else
-			generic_make_request(read_bio);
+			bio_list_add_head(current->bio_list, read_bio);
 		return;
 	}
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 32e282f4c83c..c528102b80b6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1288,7 +1288,7 @@ read_again:
 				sectors_handled;
 			goto read_again;
 		} else
-			generic_make_request(read_bio);
+			bio_list_add_head(&current->bio_list, read_bio);
 		return;
 	}
 

>
> Since this issue is a real deadlock we are hitting in a long-term 3.18
> kernel, is there any chance for cc-stable fix? Currently we applied
> the rudimentary fix  I posted. It basically switches context for
> problematic RAID1 READs, and runs them from a different context. With
> this fix we don't see the deadlock anymore.
>
> Also, can you please comment on another concern I expressed:
> freeze_array() is now not reentrant. Meaning that if two threads call
> it in parallel (and it could happen for the same MD), the first thread
> calling unfreeze_array will mess up things for the second thread.

Yes, that is a regression.  This should be enough to fix it.  Do you
agree?

Thanks,
NeilBrown


diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 40b35be34f8d..5ad25c7d7453 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -984,7 +984,7 @@ static void freeze_array(struct r1conf *conf, int extra)
 	 * we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 1;
+	conf->array_frozen += 1;
 	wait_event_lock_irq_cmd(conf->wait_barrier,
 				conf->nr_pending == conf->nr_queued+extra,
 				conf->resync_lock,
@@ -995,7 +995,7 @@ static void unfreeze_array(struct r1conf *conf)
 {
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 0;
+	conf->array_frozen -= 1;
 	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
 }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
  2016-07-12 22:14     ` NeilBrown
@ 2016-07-14 13:35       ` Alexander Lyakas
  2016-07-14 23:18         ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Lyakas @ 2016-07-14 13:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: 马建朋, linux-raid, Jes Sorensen, Shaohua Li

Hello Neil, Shaohua,


On Wed, Jul 13, 2016 at 1:14 AM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Jul 12 2016, Alexander Lyakas wrote:
>
>> Hello Neil,
>>
>> Thank you for your response. I read an email about you retiring from
>> MD/mdadm maintenance and delegating mdadm maintenance to Jes Sorensen.
>> But I was wondering who will be responsible for MD maintenance, and
>> was about to send an email asking that.
>
> Yes, I no longer have maintainership responsibilities, though I'm still
> involved to some extent.  Jes Sorensen is looking after mdadm and
> Shaohua Li is looking after the kernel driver (as listed in MAINTAINERS).
>
>
>>
>> On Fri, Jul 8, 2016 at 2:41 AM, NeilBrown <neilb@suse.com> wrote:
>>> On Mon, Jun 27 2016, Alexander Lyakas wrote:
>>>
>>>> When we call wait_barrier, we might have some bios waiting
>>>> in current->bio_list, which prevents the array_freeze call to
>>>> complete. Those can only be internal READs, which have already
>>>> passed the wait_barrier call (thus incrementing nr_pending), but
>>>> still were not submitted to the lower level, due to generic_make_request
>>>> logic to avoid recursive calls. In such case, we have a deadlock:
>>>> - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
>>>> - internal READ bios will not be submitted, thus freeze_array will
>>>> never completes
>>>>
>>>> This problem was originally fixed in commit:
>>>> d6b42dc md/raid1,raid10: avoid deadlock during resync/recovery.
>>>>
>>>> But then it was broken in commit:
>>>> b364e3d raid1: Add a field array_frozen to indicate whether raid in
>>>> freeze state.
>>>
>>> Thanks for the great analysis.
>>> I think this primarily a problem in generic_make_request().  It queues
>>> requests in the *wrong* order.
>>>
>>> Please try the patch from
>>>   https://lkml.org/lkml/2016/7/7/428
>>>
>>> and see if it helps.  If two requests for a raid1 are in the
>>> generic_make_request queue, this patch causes the sub-requests created
>>> by the first to be handled before the second is attempted.
>> I have read this discussion and more or less (probably less than more)
>> understood that the second patch by Lars is supposed to address our
>> issue. However, we cannot easily apply that patch:
>> - The patch is based on structures added by earlier patch "[RFC]
>> block: fix blk_queue_split() resource exhaustion".
>> - Both patches are not in the mainline tree yet.
>> - Both patches are in block core, which requires to recompile the whole kernel.
>> - Not sure if these patches are applicable for our production kernel
>> 3.18 (long term)
>>
>> I am sure you understand that for production with our current kernel
>> 3.18 (long term) we cannot go with these two patches.
>
> This patch takes the basic concept of those two and applies it just to
> raid1 and raid10.  I think it should be sufficient.  Can you test?  The
> patch is against 3.18
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 40b35be34f8d..99208aa2c1c8 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1229,7 +1229,7 @@ read_again:
>                                 sectors_handled;
>                         goto read_again;
>                 } else
> -                       generic_make_request(read_bio);
> +                       bio_list_add_head(current->bio_list, read_bio);
>                 return;
>         }
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 32e282f4c83c..c528102b80b6 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1288,7 +1288,7 @@ read_again:
>                                 sectors_handled;
>                         goto read_again;
>                 } else
> -                       generic_make_request(read_bio);
> +                       bio_list_add_head(&current->bio_list, read_bio);
>                 return;
>         }

Unfortunately, this patch doesn't work. It is super elegant, and seems
like it really should work. But the problem is that the "rdev", to
which we want to send the READ bio, might also be a remapping device
(dm-linear, for example). This device will create its own remapped-bio
and will call generic_make_request(), which will stick the bio onto
current->bio_list TAIL:(:(:( So we are back at square one. This patch
would work if *all* the remapping drivers in the stack were doing
bio_list_add_head() instead of generic_make_request()  :(:(:(

It seems the real fix should be that generic_make_request() would use
bio_list_add_head(), but as pointed in
http://www.spinics.net/lists/raid/msg52756.html, there are some
concerns about changing the order of remapped bios.


>
>
>>
>> Since this issue is a real deadlock we are hitting in a long-term 3.18
>> kernel, is there any chance for cc-stable fix? Currently we applied
>> the rudimentary fix  I posted. It basically switches context for
>> problematic RAID1 READs, and runs them from a different context. With
>> this fix we don't see the deadlock anymore.
>>
>> Also, can you please comment on another concern I expressed:
>> freeze_array() is now not reentrant. Meaning that if two threads call
>> it in parallel (and it could happen for the same MD), the first thread
>> calling unfreeze_array will mess up things for the second thread.
>
> Yes, that is a regression.  This should be enough to fix it.  Do you
> agree?
>
> Thanks,
> NeilBrown
>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 40b35be34f8d..5ad25c7d7453 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -984,7 +984,7 @@ static void freeze_array(struct r1conf *conf, int extra)
>          * we continue.
>          */
>         spin_lock_irq(&conf->resync_lock);
> -       conf->array_frozen = 1;
> +       conf->array_frozen += 1;
>         wait_event_lock_irq_cmd(conf->wait_barrier,
>                                 conf->nr_pending == conf->nr_queued+extra,
>                                 conf->resync_lock,
> @@ -995,7 +995,7 @@ static void unfreeze_array(struct r1conf *conf)
>  {
>         /* reverse the effect of the freeze */
>         spin_lock_irq(&conf->resync_lock);
> -       conf->array_frozen = 0;
> +       conf->array_frozen -= 1;
>         wake_up(&conf->wait_barrier);
>         spin_unlock_irq(&conf->resync_lock);
>  }
I partially agree. The fix that you suggest makes proper accounting of
whether the array is considered frozen or not.
But the problem is that even with this fix, both threads will think
that they "own" the array. And both will do things, like writing data
or so, which might interfere. The proper fix would ensure that only
one thread "owns" the array, and the second thread waits until the
first calls unfreeze_array(), and then the second thread becomes
"owner" of the array. And of course while there are threads that want
to "freeze" the array, other threads that want to do raise_barrier
etc, should wait.

I am sorry for giving two negative responses in one email:)

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
  2016-07-14 13:35       ` Alexander Lyakas
@ 2016-07-14 23:18         ` NeilBrown
  2016-07-19  9:20           ` Alexander Lyakas
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2016-07-14 23:18 UTC (permalink / raw)
  To: Alexander Lyakas
  Cc: 马建朋, linux-raid, Jes Sorensen, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 6477 bytes --]

On Thu, Jul 14 2016, Alexander Lyakas wrote:

>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 32e282f4c83c..c528102b80b6 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1288,7 +1288,7 @@ read_again:
>>                                 sectors_handled;
>>                         goto read_again;
>>                 } else
>> -                       generic_make_request(read_bio);
>> +                       bio_list_add_head(&current->bio_list, read_bio);
>>                 return;
>>         }
>
> Unfortunately, this patch doesn't work. It is super elegant, and seems
> like it really should work. But the problem is that the "rdev", to
> which we want to send the READ bio, might also be a remapping device
> (dm-linear, for example). This device will create its own remapped-bio
> and will call generic_make_request(), which will stick the bio onto
> current->bio_list TAIL:(:(:( So we are back at square one. This patch
> would work if *all* the remapping drivers in the stack were doing
> bio_list_add_head() instead of generic_make_request()  :(:(:(
>
> It seems the real fix should be that generic_make_request() would use
> bio_list_add_head(), but as pointed in
> http://www.spinics.net/lists/raid/msg52756.html, there are some
> concerns about changing the order of remapped bios.
>

While those concerns are valid, they are about hypothetical performance
issues rather than observed deadlock issues.  So I wouldn't be too
worried about them.
However I think you said that you didn't want to touch core code at all
(maybe I misunderstood) so that wouldn't help you anyway.

One option would be to punt the request requests to raidXd:

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 40b35be34f8d..f795e27b2124 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1229,7 +1229,7 @@ read_again:
 				sectors_handled;
 			goto read_again;
 		} else
-			generic_make_request(read_bio);
+			reschedule_retry(r1_bio);
 		return;
 	}
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 32e282f4c83c..eec38443075b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1288,7 +1288,7 @@ read_again:
 				sectors_handled;
 			goto read_again;
 		} else
-			generic_make_request(read_bio);
+			reschedule_retry(r10_bio);
 		return;
 	}
 

That might hurt performance, you would need to measure.
The other approach would be to revert the patch that caused the problem.
e.g.
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 40b35be34f8d..062bb86e5fd8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -884,7 +884,8 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
 			wait = false;
 		else
 			wait = true;
-	}
+	} else if (conf->barrier)
+		wait = true;
 
 	return wait;
 }



>
>>
>>
>>>
>>> Since this issue is a real deadlock we are hitting in a long-term 3.18
>>> kernel, is there any chance for cc-stable fix? Currently we applied
>>> the rudimentary fix  I posted. It basically switches context for
>>> problematic RAID1 READs, and runs them from a different context. With
>>> this fix we don't see the deadlock anymore.
>>>
>>> Also, can you please comment on another concern I expressed:
>>> freeze_array() is now not reentrant. Meaning that if two threads call
>>> it in parallel (and it could happen for the same MD), the first thread
>>> calling unfreeze_array will mess up things for the second thread.
>>
>> Yes, that is a regression.  This should be enough to fix it.  Do you
>> agree?
>>
>> Thanks,
>> NeilBrown
>>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 40b35be34f8d..5ad25c7d7453 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -984,7 +984,7 @@ static void freeze_array(struct r1conf *conf, int extra)
>>          * we continue.
>>          */
>>         spin_lock_irq(&conf->resync_lock);
>> -       conf->array_frozen = 1;
>> +       conf->array_frozen += 1;
>>         wait_event_lock_irq_cmd(conf->wait_barrier,
>>                                 conf->nr_pending == conf->nr_queued+extra,
>>                                 conf->resync_lock,
>> @@ -995,7 +995,7 @@ static void unfreeze_array(struct r1conf *conf)
>>  {
>>         /* reverse the effect of the freeze */
>>         spin_lock_irq(&conf->resync_lock);
>> -       conf->array_frozen = 0;
>> +       conf->array_frozen -= 1;
>>         wake_up(&conf->wait_barrier);
>>         spin_unlock_irq(&conf->resync_lock);
>>  }
> I partially agree. The fix that you suggest makes proper accounting of
> whether the array is considered frozen or not.
> But the problem is that even with this fix, both threads will think
> that they "own" the array. And both will do things, like writing data
> or so, which might interfere. The proper fix would ensure that only
> one thread "owns" the array, and the second thread waits until the
> first calls unfreeze_array(), and then the second thread becomes
> "owner" of the array. And of course while there are threads that want
> to "freeze" the array, other threads that want to do raise_barrier
> etc, should wait.

Is that really a problem?
A call to "freeze_array()" doesn't mean "I want to own the array", but
rather "No regular IO should be happening now".

Most callers of freeze_array():
    raid1_add_disk(), raid1_remove_disk(), stop(), raid1_reshape()
are called with the "mddev_lock()" mutex held, so they cannot interfere
with each other.

handle_read_error() is called with one pending request, which will block
any call on freeze_array(mddev, 0); - handle_read_error() itself calls
freeze_array(mddev, 1) so it gets access.
So these are already locked against the first four (as lock that
->array_frozen doesn't get corrupted).

That just leaves raid1_quiesce().
That is mostly called under mddev_lock() so it won't interfere with the
others.
The one exception is where md_do_sync() calls
			mddev->pers->quiesce(mddev, 1);
			mddev->pers->quiesce(mddev, 0);

As this doesn't "claim" the array, but just needs to ensure all pending
IO completes, I don't think there is a problem.

So it seems to me that your concerns are not actually a problem.
Did I miss something?

>
> I am sorry for giving two negative responses in one email:)

Better a negative response than no response :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
  2016-07-14 23:18         ` NeilBrown
@ 2016-07-19  9:20           ` Alexander Lyakas
  2016-07-19 22:19             ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Lyakas @ 2016-07-19  9:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: 马建朋, linux-raid, Jes Sorensen, Shaohua Li

Hello Neil,
Thank you for your response.

On Fri, Jul 15, 2016 at 2:18 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Jul 14 2016, Alexander Lyakas wrote:
>
>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>> index 32e282f4c83c..c528102b80b6 100644
>>> --- a/drivers/md/raid10.c
>>> +++ b/drivers/md/raid10.c
>>> @@ -1288,7 +1288,7 @@ read_again:
>>>                                 sectors_handled;
>>>                         goto read_again;
>>>                 } else
>>> -                       generic_make_request(read_bio);
>>> +                       bio_list_add_head(&current->bio_list, read_bio);
>>>                 return;
>>>         }
>>
>> Unfortunately, this patch doesn't work. It is super elegant, and seems
>> like it really should work. But the problem is that the "rdev", to
>> which we want to send the READ bio, might also be a remapping device
>> (dm-linear, for example). This device will create its own remapped-bio
>> and will call generic_make_request(), which will stick the bio onto
>> current->bio_list TAIL:(:(:( So we are back at square one. This patch
>> would work if *all* the remapping drivers in the stack were doing
>> bio_list_add_head() instead of generic_make_request()  :(:(:(
>>
>> It seems the real fix should be that generic_make_request() would use
>> bio_list_add_head(), but as pointed in
>> http://www.spinics.net/lists/raid/msg52756.html, there are some
>> concerns about changing the order of remapped bios.
>>
>
> While those concerns are valid, they are about hypothetical performance
> issues rather than observed deadlock issues.  So I wouldn't be too
> worried about them.
I am thinking of a hypothetical driver that splits say a 12Kb WRITE
into 3x4kb WRITEs, and submits them in a proper order, hoping they
will get to the disk in the same order, and the disk will work
sequentially. But now we are deliberately hindering this. But I see
that people much smarter than me are in this discussion, so I will
leave it to them:)

> However I think you said that you didn't want to touch core code at all
> (maybe I misunderstood) so that wouldn't help you anyway.
Yes, this is correct. Recompiling the kernel is a bit of a pain for
us. We were smart enough to configure the md_mod as loadable module,
so at least now I can patch MD code easily:)

>
> One option would be to punt the request requests to raidXd:
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 40b35be34f8d..f795e27b2124 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1229,7 +1229,7 @@ read_again:
>                                 sectors_handled;
>                         goto read_again;
>                 } else
> -                       generic_make_request(read_bio);
> +                       reschedule_retry(r1_bio);
>                 return;
>         }
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 32e282f4c83c..eec38443075b 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1288,7 +1288,7 @@ read_again:
>                                 sectors_handled;
>                         goto read_again;
>                 } else
> -                       generic_make_request(read_bio);
> +                       reschedule_retry(r10_bio);
>                 return;
>         }
This is more or less what my rudimentary patch is doing, except it is
doing it only when we really need to wait for the barrier.

>
>
> That might hurt performance, you would need to measure.
> The other approach would be to revert the patch that caused the problem.
> e.g.
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 40b35be34f8d..062bb86e5fd8 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -884,7 +884,8 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>                         wait = false;
>                 else
>                         wait = true;
> -       }
> +       } else if (conf->barrier)
> +               wait = true;
>
>         return wait;
>  }
>
>
I am not sure how this patch helps. You added another condition, and
now READs will also wait in some cases. But still if array_frozen is
set, everybody will wait unconditionally, which is the root cause for
the deadlock I think.

I see that there will be no magic solution for this problem:(

>
>>
>>>
>>>
>>>>
>>>> Since this issue is a real deadlock we are hitting in a long-term 3.18
>>>> kernel, is there any chance for cc-stable fix? Currently we applied
>>>> the rudimentary fix  I posted. It basically switches context for
>>>> problematic RAID1 READs, and runs them from a different context. With
>>>> this fix we don't see the deadlock anymore.
>>>>
>>>> Also, can you please comment on another concern I expressed:
>>>> freeze_array() is now not reentrant. Meaning that if two threads call
>>>> it in parallel (and it could happen for the same MD), the first thread
>>>> calling unfreeze_array will mess up things for the second thread.
>>>
>>> Yes, that is a regression.  This should be enough to fix it.  Do you
>>> agree?
>>>
>>> Thanks,
>>> NeilBrown
>>>
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 40b35be34f8d..5ad25c7d7453 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -984,7 +984,7 @@ static void freeze_array(struct r1conf *conf, int extra)
>>>          * we continue.
>>>          */
>>>         spin_lock_irq(&conf->resync_lock);
>>> -       conf->array_frozen = 1;
>>> +       conf->array_frozen += 1;
>>>         wait_event_lock_irq_cmd(conf->wait_barrier,
>>>                                 conf->nr_pending == conf->nr_queued+extra,
>>>                                 conf->resync_lock,
>>> @@ -995,7 +995,7 @@ static void unfreeze_array(struct r1conf *conf)
>>>  {
>>>         /* reverse the effect of the freeze */
>>>         spin_lock_irq(&conf->resync_lock);
>>> -       conf->array_frozen = 0;
>>> +       conf->array_frozen -= 1;
>>>         wake_up(&conf->wait_barrier);
>>>         spin_unlock_irq(&conf->resync_lock);
>>>  }
>> I partially agree. The fix that you suggest makes proper accounting of
>> whether the array is considered frozen or not.
>> But the problem is that even with this fix, both threads will think
>> that they "own" the array. And both will do things, like writing data
>> or so, which might interfere. The proper fix would ensure that only
>> one thread "owns" the array, and the second thread waits until the
>> first calls unfreeze_array(), and then the second thread becomes
>> "owner" of the array. And of course while there are threads that want
>> to "freeze" the array, other threads that want to do raise_barrier
>> etc, should wait.
>
> Is that really a problem?
> A call to "freeze_array()" doesn't mean "I want to own the array", but
> rather "No regular IO should be happening now".
>
> Most callers of freeze_array():
>     raid1_add_disk(), raid1_remove_disk(), stop(), raid1_reshape()
> are called with the "mddev_lock()" mutex held, so they cannot interfere
> with each other.
>
> handle_read_error() is called with one pending request, which will block
> any call on freeze_array(mddev, 0); - handle_read_error() itself calls
> freeze_array(mddev, 1) so it gets access.
> So these are already locked against the first four (as lock that
> ->array_frozen doesn't get corrupted).
>
> That just leaves raid1_quiesce().
> That is mostly called under mddev_lock() so it won't interfere with the
> others.
> The one exception is where md_do_sync() calls
>                         mddev->pers->quiesce(mddev, 1);
>                         mddev->pers->quiesce(mddev, 0);
>
> As this doesn't "claim" the array, but just needs to ensure all pending
> IO completes, I don't think there is a problem.
>
> So it seems to me that your concerns are not actually a problem.
> Did I miss something?
I think you are correct. The only exception is mddev_suspend/resume,
which can be called from another module. But for my case, this is not
happening.

Thanks,
Alex.

>
>>
>> I am sorry for giving two negative responses in one email:)
>
> Better a negative response than no response :-)
>
> NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier().
  2016-07-19  9:20           ` Alexander Lyakas
@ 2016-07-19 22:19             ` NeilBrown
  0 siblings, 0 replies; 8+ messages in thread
From: NeilBrown @ 2016-07-19 22:19 UTC (permalink / raw)
  To: Alexander Lyakas
  Cc: 马建朋, linux-raid, Jes Sorensen, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 5330 bytes --]

On Tue, Jul 19 2016, Alexander Lyakas wrote:

> Hello Neil,
> Thank you for your response.
>
> On Fri, Jul 15, 2016 at 2:18 AM, NeilBrown <neilb@suse.com> wrote:
>> On Thu, Jul 14 2016, Alexander Lyakas wrote:
>>
>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>> index 32e282f4c83c..c528102b80b6 100644
>>>> --- a/drivers/md/raid10.c
>>>> +++ b/drivers/md/raid10.c
>>>> @@ -1288,7 +1288,7 @@ read_again:
>>>>                                 sectors_handled;
>>>>                         goto read_again;
>>>>                 } else
>>>> -                       generic_make_request(read_bio);
>>>> +                       bio_list_add_head(&current->bio_list, read_bio);
>>>>                 return;
>>>>         }
>>>
>>> Unfortunately, this patch doesn't work. It is super elegant, and seems
>>> like it really should work. But the problem is that the "rdev", to
>>> which we want to send the READ bio, might also be a remapping device
>>> (dm-linear, for example). This device will create its own remapped-bio
>>> and will call generic_make_request(), which will stick the bio onto
>>> current->bio_list TAIL:(:(:( So we are back at square one. This patch
>>> would work if *all* the remapping drivers in the stack were doing
>>> bio_list_add_head() instead of generic_make_request()  :(:(:(
>>>
>>> It seems the real fix should be that generic_make_request() would use
>>> bio_list_add_head(), but as pointed in
>>> http://www.spinics.net/lists/raid/msg52756.html, there are some
>>> concerns about changing the order of remapped bios.
>>>
>>
>> While those concerns are valid, they are about hypothetical performance
>> issues rather than observed deadlock issues.  So I wouldn't be too
>> worried about them.
> I am thinking of a hypothetical driver that splits say a 12Kb WRITE
> into 3x4kb WRITEs, and submits them in a proper order, hoping they
> will get to the disk in the same order, and the disk will work
> sequentially. But now we are deliberately hindering this. But I see
> that people much smarter than me are in this discussion, so I will
> leave it to them:)

Sure, that is the concern and ideally we would keep things in order.
But the elevator should re-order things in most cases so it shouldn't
matter too much.
For upstream, we obviously aim for best possible.  For stable backports,
we sometimes accept non-ideal code when the change is less intrusive.

If we do backport something to -stable, I would do it "properly" using
something very similar to the upstream version.  You don't seem to want
that for your code so I'm suggesting options that, while not 100% ideal,
should suit your needs - and obviously you will test your use cases.

>
>> However I think you said that you didn't want to touch core code at all
>> (maybe I misunderstood) so that wouldn't help you anyway.
> Yes, this is correct. Recompiling the kernel is a bit of a pain for
> us. We were smart enough to configure the md_mod as loadable module,
> so at least now I can patch MD code easily:)
>
>>
>> One option would be to punt the request requests to raidXd:
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 40b35be34f8d..f795e27b2124 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1229,7 +1229,7 @@ read_again:
>>                                 sectors_handled;
>>                         goto read_again;
>>                 } else
>> -                       generic_make_request(read_bio);
>> +                       reschedule_retry(r1_bio);
>>                 return;
>>         }
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 32e282f4c83c..eec38443075b 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1288,7 +1288,7 @@ read_again:
>>                                 sectors_handled;
>>                         goto read_again;
>>                 } else
>> -                       generic_make_request(read_bio);
>> +                       reschedule_retry(r10_bio);
>>                 return;
>>         }
> This is more or less what my rudimentary patch is doing, except it is
> doing it only when we really need to wait for the barrier.
>
>>
>>
>> That might hurt performance, you would need to measure.
>> The other approach would be to revert the patch that caused the problem.
>> e.g.
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 40b35be34f8d..062bb86e5fd8 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -884,7 +884,8 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>>                         wait = false;
>>                 else
>>                         wait = true;
>> -       }
>> +       } else if (conf->barrier)
>> +               wait = true;
>>
>>         return wait;
>>  }
>>
>>
> I am not sure how this patch helps. You added another condition, and
> now READs will also wait in some cases. But still if array_frozen is
> set, everybody will wait unconditionally, which is the root cause for
> the deadlock I think.

Maybe you're right.  I was thinking that array_frozen only causes
problems if there are read requests in the generic_make_request queue,
and this change would keep them out.  But I might have dropped a ball
somewhere.


>
> I see that there will be no magic solution for this problem:(
>

Not really.


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-19 22:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-26 19:18 [PATCH RFC] md/raid1: fix deadlock between freeze_array() and wait_barrier() Alexander Lyakas
2016-07-07 23:41 ` NeilBrown
2016-07-12 10:09   ` Alexander Lyakas
2016-07-12 22:14     ` NeilBrown
2016-07-14 13:35       ` Alexander Lyakas
2016-07-14 23:18         ` NeilBrown
2016-07-19  9:20           ` Alexander Lyakas
2016-07-19 22:19             ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.