All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] md: don't create mddev in md_open
@ 2021-04-01  1:34 Zhao Heming
  2021-04-01  6:17 ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Zhao Heming @ 2021-04-01  1:34 UTC (permalink / raw)
  To: linux-raid, song
  Cc: Zhao Heming, guoqing.jiang, lidong.zhong, xni, neilb, colyli

commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1  node1="mdcluster1"
2  node2="mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..10}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
v2:
- add missed code in mddev_find
- modify commit comment
  - remove SUSE product name in test script
  - soft lockup trigger time is incorrect, fix it. 
  - only leave one soft lockup stack
v1:
- create patch
---
 drivers/md/md.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 21da0c48f6c2..fe6d9f6e3c3b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -734,7 +734,7 @@ void mddev_init(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(mddev_init);
 
-static struct mddev *mddev_find(dev_t unit)
+static struct mddev *mddev_find(dev_t unit, bool create)
 {
 	struct mddev *mddev, *new = NULL;
 
@@ -793,7 +793,8 @@ static struct mddev *mddev_find(dev_t unit)
 	}
 	spin_unlock(&all_mddevs_lock);
 
-	new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (create)
+		new = kzalloc(sizeof(*new), GFP_KERNEL);
 	if (!new)
 		return NULL;
 
@@ -5644,7 +5645,7 @@ static int md_alloc(dev_t dev, char *name)
 	 * writing to /sys/module/md_mod/parameters/new_array.
 	 */
 	static DEFINE_MUTEX(disks_mutex);
-	struct mddev *mddev = mddev_find(dev);
+	struct mddev *mddev = mddev_find(dev, true);
 	struct gendisk *disk;
 	int partitioned;
 	int shift;
@@ -6523,7 +6524,7 @@ static void autorun_devices(int part)
 		}
 
 		md_probe(dev);
-		mddev = mddev_find(dev);
+		mddev = mddev_find(dev, true);
 		if (!mddev || !mddev->gendisk) {
 			if (mddev)
 				mddev_put(mddev);
@@ -7807,7 +7808,7 @@ static int md_open(struct block_device *bdev, fmode_t mode)
 	 * Succeed if we can lock the mddev, which confirms that
 	 * it isn't being stopped right now.
 	 */
-	struct mddev *mddev = mddev_find(bdev->bd_dev);
+	struct mddev *mddev = mddev_find(bdev->bd_dev, false);
 	int err;
 
 	if (!mddev)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] md: don't create mddev in md_open
  2021-04-01  1:34 [PATCH v2] md: don't create mddev in md_open Zhao Heming
@ 2021-04-01  6:17 ` Christoph Hellwig
  2021-04-01 13:01   ` heming.zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2021-04-01  6:17 UTC (permalink / raw)
  To: Zhao Heming
  Cc: linux-raid, song, guoqing.jiang, lidong.zhong, xni, neilb, colyli

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

On Thu, Apr 01, 2021 at 09:34:56AM +0800, Zhao Heming wrote:
> commit d3374825ce57 ("md: make devices disappear when they are no longer
> needed.") introduced protection between mddev creating & removing. The
> md_open shouldn't create mddev when all_mddevs list doesn't contain
> mddev. With currently code logic, there will be very easy to trigger
> soft lockup in non-preempt env.

As mention below, please don't make this even more of a mess than it
needs to.  We can just pick the two patches doing this from the series
I sent:

[-- Attachment #2: 0001-md-factor-out-a-mddev_find_locked-helper-from-mddev_.patch --]
[-- Type: text/plain, Size: 1816 bytes --]

From 86dfff895d62495120d8d61ef2bc5d48db89fe20 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 3 Mar 2021 15:04:15 +0100
Subject: md: factor out a mddev_find_locked helper from mddev_find

Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 368cad6cd53a6e..5692427e78ba37 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -734,6 +734,17 @@ void mddev_init(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(mddev_init);
 
+static struct mddev *mddev_find_locked(dev_t unit)
+{
+	struct mddev *mddev;
+
+	list_for_each_entry(mddev, &all_mddevs, all_mddevs)
+		if (mddev->unit == unit)
+			return mddev;
+
+	return NULL;
+}
+
 static struct mddev *mddev_find(dev_t unit)
 {
 	struct mddev *mddev, *new = NULL;
@@ -745,13 +756,13 @@ static struct mddev *mddev_find(dev_t unit)
 	spin_lock(&all_mddevs_lock);
 
 	if (unit) {
-		list_for_each_entry(mddev, &all_mddevs, all_mddevs)
-			if (mddev->unit == unit) {
-				mddev_get(mddev);
-				spin_unlock(&all_mddevs_lock);
-				kfree(new);
-				return mddev;
-			}
+		mddev = mddev_find_locked(unit);
+		if (mddev) {
+			mddev_get(mddev);
+			spin_unlock(&all_mddevs_lock);
+			kfree(new);
+			return mddev;
+		}
 
 		if (new) {
 			list_add(&new->all_mddevs, &all_mddevs);
@@ -777,12 +788,7 @@ static struct mddev *mddev_find(dev_t unit)
 				return NULL;
 			}
 
-			is_free = 1;
-			list_for_each_entry(mddev, &all_mddevs, all_mddevs)
-				if (mddev->unit == dev) {
-					is_free = 0;
-					break;
-				}
+			is_free = !mddev_find_locked(dev);
 		}
 		new->unit = dev;
 		new->md_minor = MINOR(dev);
-- 
2.30.1


[-- Attachment #3: 0002-md-split-mddev_find.patch --]
[-- Type: text/plain, Size: 6044 bytes --]

From f24462e5166808c4fd724f6b09233096d6ac1e56 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 3 Mar 2021 15:16:53 +0100
Subject: md: split mddev_find

Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.

This turns out to fix this bug reported by
Zhao Heming <heming.zhao@suse.com>:

------------------------------ snip ------------------------------
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger 1 time with 10 tests

```
1  node1="15sp3-mdcluster1"
2  node2="15sp3-mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..100}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
 #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
 #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
 #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
 #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
 #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
 #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
 #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
 #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
 #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
 #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c

or:
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------

Fixes: d3374825ce57 ("md: make devices disappear when they are no longer needed.")
Reported-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5692427e78ba37..08d24177bbd4a4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -746,6 +746,22 @@ static struct mddev *mddev_find_locked(dev_t unit)
 }
 
 static struct mddev *mddev_find(dev_t unit)
+{
+	struct mddev *mddev;
+
+	if (MAJOR(unit) != MD_MAJOR)
+		unit &= ~((1 << MdpMinorShift) - 1);
+
+	spin_lock(&all_mddevs_lock);
+	mddev = mddev_find_locked(unit);
+	if (mddev)
+		mddev_get(mddev);
+	spin_unlock(&all_mddevs_lock);
+
+	return mddev;
+}
+
+static struct mddev *mddev_find_or_alloc(dev_t unit)
 {
 	struct mddev *mddev, *new = NULL;
 
@@ -5650,7 +5666,7 @@ static int md_alloc(dev_t dev, char *name)
 	 * writing to /sys/module/md_mod/parameters/new_array.
 	 */
 	static DEFINE_MUTEX(disks_mutex);
-	struct mddev *mddev = mddev_find(dev);
+	struct mddev *mddev = mddev_find_or_alloc(dev);
 	struct gendisk *disk;
 	int partitioned;
 	int shift;
@@ -6530,11 +6546,9 @@ static void autorun_devices(int part)
 
 		md_probe(dev);
 		mddev = mddev_find(dev);
-		if (!mddev || !mddev->gendisk) {
-			if (mddev)
-				mddev_put(mddev);
+		if (!mddev)
 			break;
-		}
+
 		if (mddev_lock(mddev))
 			pr_warn("md: %s locked, cannot run\n", mdname(mddev));
 		else if (mddev->raid_disks || mddev->major_version
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] md: don't create mddev in md_open
  2021-04-01  6:17 ` Christoph Hellwig
@ 2021-04-01 13:01   ` heming.zhao
  2021-04-02  5:58     ` Song Liu
  0 siblings, 1 reply; 7+ messages in thread
From: heming.zhao @ 2021-04-01 13:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-raid, song, guoqing.jiang, lidong.zhong, xni, neilb, colyli

On 4/1/21 2:17 PM, Christoph Hellwig wrote:
> On Thu, Apr 01, 2021 at 09:34:56AM +0800, Zhao Heming wrote:
>> commit d3374825ce57 ("md: make devices disappear when they are no longer
>> needed.") introduced protection between mddev creating & removing. The
>> md_open shouldn't create mddev when all_mddevs list doesn't contain
>> mddev. With currently code logic, there will be very easy to trigger
>> soft lockup in non-preempt env.
> 
> As mention below, please don't make this even more of a mess than it
> needs to.  We can just pick the two patches doing this from the series
> I sent:
> 

Hi,

I already got your meaning on previously email.
I sent v2 patch for Song's review comment. My patch is bugfix, it may need
to back port into branch maintenance.

Your attachment patch files is partly my patch.
The left part is in md_open (response [PATCH 01/15] md: remove the code to flush
an old instance in md_open)
I still think you directly use bdev->bd_disk->private_data as mddev pointer is not safe.

Let's wait for other guys opinions.

Thanks,
heming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] md: don't create mddev in md_open
  2021-04-01 13:01   ` heming.zhao
@ 2021-04-02  5:58     ` Song Liu
  2021-04-02 10:48       ` Christoph Hellwig
       [not found]       ` <b2288ab4-1da0-612d-8988-637cc33db99a@suse.com>
  0 siblings, 2 replies; 7+ messages in thread
From: Song Liu @ 2021-04-02  5:58 UTC (permalink / raw)
  To: heming.zhao
  Cc: Christoph Hellwig, linux-raid, Guoqing Jiang, lidong.zhong,
	Xiao Ni, NeilBrown, Coly Li

On Thu, Apr 1, 2021 at 6:03 AM heming.zhao@suse.com
<heming.zhao@suse.com> wrote:
>
> On 4/1/21 2:17 PM, Christoph Hellwig wrote:
> > On Thu, Apr 01, 2021 at 09:34:56AM +0800, Zhao Heming wrote:
> >> commit d3374825ce57 ("md: make devices disappear when they are no longer
> >> needed.") introduced protection between mddev creating & removing. The
> >> md_open shouldn't create mddev when all_mddevs list doesn't contain
> >> mddev. With currently code logic, there will be very easy to trigger
> >> soft lockup in non-preempt env.
> >
> > As mention below, please don't make this even more of a mess than it
> > needs to.  We can just pick the two patches doing this from the series
> > I sent:
> >
>
> Hi,
>
> I already got your meaning on previously email.
> I sent v2 patch for Song's review comment. My patch is bugfix, it may need
> to back port into branch maintenance.
>
> Your attachment patch files is partly my patch.
> The left part is in md_open (response [PATCH 01/15] md: remove the code to flush
> an old instance in md_open)
> I still think you directly use bdev->bd_disk->private_data as mddev pointer is not safe.
>

Hi Christoph and Heming,

Trying to understand the whole picture here. Please let me know if I
misunderstood anything.

IIUC, the primary goal of Christoph's set is to get rid of the
ERESTARTSYS hack from md,
and eventually move bd_mutext. 02/15 to 07/15 of this set cleans up
code in md.c, and they
are not very important for the rest of the set (is this correct?).

Heming, you mentioned "the solution may simply return -EBUSY (instead
of -ENODEV) to
fail the open path". Could you please show the code? Maybe that would
be enough to unblock
the second half of Christoph's set (08/15 to 15/15)?

Once this part is resolved, the bug fix (this patch thread) and the
clean up (Christoph's 02 - 07)
should be easier.

Would this work?

Thanks,
Song

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] md: don't create mddev in md_open
  2021-04-02  5:58     ` Song Liu
@ 2021-04-02 10:48       ` Christoph Hellwig
       [not found]       ` <b2288ab4-1da0-612d-8988-637cc33db99a@suse.com>
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2021-04-02 10:48 UTC (permalink / raw)
  To: Song Liu
  Cc: heming.zhao, Christoph Hellwig, linux-raid, Guoqing Jiang,
	lidong.zhong, Xiao Ni, NeilBrown, Coly Li

On Thu, Apr 01, 2021 at 10:58:54PM -0700, Song Liu wrote:
> Hi Christoph and Heming,
> 
> Trying to understand the whole picture here. Please let me know if I
> misunderstood anything.
> 
> IIUC, the primary goal of Christoph's set is to get rid of the
> ERESTARTSYS hack from md,
> and eventually move bd_mutext. 02/15 to 07/15 of this set cleans up
> code in md.c, and they
> are not very important for the rest of the set (is this correct?).

Yes, at least that is the long term goal.

But as posted in this thread we can just split the two patches out
to do the same thing that the patch from Heming did, without making more
of a mess of mddev_find.  I can resend them separately as a small series
instead of the attachment if that makes them easier to review.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] md: don't create mddev in md_open
       [not found]       ` <b2288ab4-1da0-612d-8988-637cc33db99a@suse.com>
@ 2021-04-02 16:01         ` Song Liu
  2021-04-03  2:45           ` heming.zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2021-04-02 16:01 UTC (permalink / raw)
  To: heming.zhao
  Cc: Christoph Hellwig, linux-raid, Guoqing Jiang, lidong.zhong,
	Xiao Ni, NeilBrown, Coly Li

On Fri, Apr 2, 2021 at 1:58 AM heming.zhao@suse.com
<heming.zhao@suse.com> wrote:
>
> On 4/2/21 1:58 PM, Song Liu wrote:
> > On Thu, Apr 1, 2021 at 6:03 AM heming.zhao@suse.com
> > <heming.zhao@suse.com> wrote:
> >>
> >> On 4/1/21 2:17 PM, Christoph Hellwig wrote:
> >>> On Thu, Apr 01, 2021 at 09:34:56AM +0800, Zhao Heming wrote:
> >>>> commit d3374825ce57 ("md: make devices disappear when they are no longer
> >>>> needed.") introduced protection between mddev creating & removing. The
> >>>> md_open shouldn't create mddev when all_mddevs list doesn't contain
> >>>> mddev. With currently code logic, there will be very easy to trigger
> >>>> soft lockup in non-preempt env.
> >>>
> >>> As mention below, please don't make this even more of a mess than it
> >>> needs to.  We can just pick the two patches doing this from the series
> >>> I sent:
> >>>
> >>
> >> Hi,
> >>
> >> I already got your meaning on previously email.
> >> I sent v2 patch for Song's review comment. My patch is bugfix, it may need
> >> to back port into branch maintenance.
> >>
> >> Your attachment patch files is partly my patch.
> >> The left part is in md_open (response [PATCH 01/15] md: remove the code to flush
> >> an old instance in md_open)
> >> I still think you directly use bdev->bd_disk->private_data as mddev pointer is not safe.
> >>
> >
> > Hi Christoph and Heming,
> >
> > Trying to understand the whole picture here. Please let me know if I
> > misunderstood anything.
> >
> > IIUC, the primary goal of Christoph's set is to get rid of the
> > ERESTARTSYS hack from md,
> > and eventually move bd_mutext. 02/15 to 07/15 of this set cleans up
> > code in md.c, and they
> > are not very important for the rest of the set (is this correct?).
> >
> > Heming, you mentioned "the solution may simply return -EBUSY (instead
> > of -ENODEV) to
> > fail the open path". Could you please show the code? Maybe that would
> > be enough to unblock
> > the second half of Christoph's set (08/15 to 15/15)?
> >
>
> I already sent the whole picture of md_open (email data: 2021-4-1,
> subject: Re: [PATCH] md: don't create mddev in md_open).
> My mail may fail to be delivered (at least, I got the "can't be delivered"
> info on my last mail for linux-raid & guoqing's address). I put the related
> email on attachment, anyone can read it again.
>
> For the convenience of discussion, I also pasted **patched** md_open below.
> (I added more comments than enclosed email)
>
> ```
> static int md_open(struct block_device *bdev, fmode_t mode)
> {
>     /* section 1 */
>     struct mddev *mddev = mddev_find(bdev->bd_dev); //hm: the new style, only do searching job
>     int err;
>
>     if (!mddev) //hm: this will cover freeing path 2.2 (refer enclosed file)
>         return -ENODEV;
>
>     if (mddev->gendisk != bdev->bd_disk) { //hm: for freeing path 2.1 (refer enclosed file)
>         /* we are racing with mddev_put which is discarding this
>          * bd_disk.
>          */
>         mddev_put(mddev);
>         /* Wait until bdev->bd_disk is definitely gone */
>         if (work_pending(&mddev->del_work))
>             flush_workqueue(md_misc_wq);
>         return -EBUSY; //hm: fail this path. it also makes __blkdev_get return fail, userspace can try later.
>                        //
>                        // the legacy code flow:
>                        //   return -ERESTARTSYS here, later __blkdev_get reentry.
>                        //   it will trigger first 'if' in this functioin, then
>                        //   return -ENODEV.
>     }
>
>     /* section 2 */
>     /* hm: below same as Christoph's [PATCH 01/15] */
>     err = mutex_lock_interruptible(&mddev->open_mutex);
>     if (err)
>         return err;
>
>     if (test_bit(MD_CLOSING, &mddev->flags)) {
>         mutex_unlock(&mddev->open_mutex);
>         return -ENODEV;
>     }
>
>     mddev_get(mddev);
>     atomic_inc(&mddev->openers);
>     mutex_unlock(&mddev->open_mutex);
>
>     bdev_check_media_change(bdev);
>     return 0;
> }
> ```
>
> I wrote again:
> > Christoph's patch [01/15] totally dropped <section 1>, and use bdev->bd_disk->private_data
> > to get mddev pointer, it's not safe.
>
> And with above **patched** md_open, Christoph's patches 02-07 can be work happily.
> He only needs to adjust/modify 01 patch.
> The md layer behavior will change from return -ENODEV to -EBUSY in some racing scenario.

Thanks for the explanation.

Could you please send official patch of modified 01/15 and your fix
(either in a set or merge into
one patch)? This patch(set) would go to stable. Then, we can apply
Christoph's 02 - 15 on top of
it.

Thanks,
Song

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] md: don't create mddev in md_open
  2021-04-02 16:01         ` Song Liu
@ 2021-04-03  2:45           ` heming.zhao
  0 siblings, 0 replies; 7+ messages in thread
From: heming.zhao @ 2021-04-03  2:45 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, linux-raid, Guoqing Jiang, lidong.zhong,
	Xiao Ni, NeilBrown, Coly Li

On 4/3/21 12:01 AM, Song Liu wrote:
> On Fri, Apr 2, 2021 at 1:58 AM heming.zhao@suse.com
> <heming.zhao@suse.com> wrote:
>>
>> On 4/2/21 1:58 PM, Song Liu wrote:
>>> On Thu, Apr 1, 2021 at 6:03 AM heming.zhao@suse.com
>>> <heming.zhao@suse.com> wrote:
>>>>
>>>> On 4/1/21 2:17 PM, Christoph Hellwig wrote:
>>>>> On Thu, Apr 01, 2021 at 09:34:56AM +0800, Zhao Heming wrote:
>>>>>> commit d3374825ce57 ("md: make devices disappear when they are no longer
>>>>>> needed.") introduced protection between mddev creating & removing. The
>>>>>> md_open shouldn't create mddev when all_mddevs list doesn't contain
>>>>>> mddev. With currently code logic, there will be very easy to trigger
>>>>>> soft lockup in non-preempt env.
>>>>>
>>>>> As mention below, please don't make this even more of a mess than it
>>>>> needs to.  We can just pick the two patches doing this from the series
>>>>> I sent:
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> I already got your meaning on previously email.
>>>> I sent v2 patch for Song's review comment. My patch is bugfix, it may need
>>>> to back port into branch maintenance.
>>>>
>>>> Your attachment patch files is partly my patch.
>>>> The left part is in md_open (response [PATCH 01/15] md: remove the code to flush
>>>> an old instance in md_open)
>>>> I still think you directly use bdev->bd_disk->private_data as mddev pointer is not safe.
>>>>
>>>
>>> Hi Christoph and Heming,
>>>
>>> Trying to understand the whole picture here. Please let me know if I
>>> misunderstood anything.
>>>
>>> IIUC, the primary goal of Christoph's set is to get rid of the
>>> ERESTARTSYS hack from md,
>>> and eventually move bd_mutext. 02/15 to 07/15 of this set cleans up
>>> code in md.c, and they
>>> are not very important for the rest of the set (is this correct?).
>>>
>>> Heming, you mentioned "the solution may simply return -EBUSY (instead
>>> of -ENODEV) to
>>> fail the open path". Could you please show the code? Maybe that would
>>> be enough to unblock
>>> the second half of Christoph's set (08/15 to 15/15)?
>>>
>>
>> I already sent the whole picture of md_open (email data: 2021-4-1,
>> subject: Re: [PATCH] md: don't create mddev in md_open).
>> My mail may fail to be delivered (at least, I got the "can't be delivered"
>> info on my last mail for linux-raid & guoqing's address). I put the related
>> email on attachment, anyone can read it again.
>>
>> For the convenience of discussion, I also pasted **patched** md_open below.
>> (I added more comments than enclosed email)
>>
>> ```
>> static int md_open(struct block_device *bdev, fmode_t mode)
>> {
>>      /* section 1 */
>>      struct mddev *mddev = mddev_find(bdev->bd_dev); //hm: the new style, only do searching job
>>      int err;
>>
>>      if (!mddev) //hm: this will cover freeing path 2.2 (refer enclosed file)
>>          return -ENODEV;
>>
>>      if (mddev->gendisk != bdev->bd_disk) { //hm: for freeing path 2.1 (refer enclosed file)
>>          /* we are racing with mddev_put which is discarding this
>>           * bd_disk.
>>           */
>>          mddev_put(mddev);
>>          /* Wait until bdev->bd_disk is definitely gone */
>>          if (work_pending(&mddev->del_work))
>>              flush_workqueue(md_misc_wq);
>>          return -EBUSY; //hm: fail this path. it also makes __blkdev_get return fail, userspace can try later.
>>                         //
>>                         // the legacy code flow:
>>                         //   return -ERESTARTSYS here, later __blkdev_get reentry.
>>                         //   it will trigger first 'if' in this functioin, then
>>                         //   return -ENODEV.
>>      }
>>
>>      /* section 2 */
>>      /* hm: below same as Christoph's [PATCH 01/15] */
>>      err = mutex_lock_interruptible(&mddev->open_mutex);
>>      if (err)
>>          return err;
>>
>>      if (test_bit(MD_CLOSING, &mddev->flags)) {
>>          mutex_unlock(&mddev->open_mutex);
>>          return -ENODEV;
>>      }
>>
>>      mddev_get(mddev);
>>      atomic_inc(&mddev->openers);
>>      mutex_unlock(&mddev->open_mutex);
>>
>>      bdev_check_media_change(bdev);
>>      return 0;
>> }
>> ```
>>
>> I wrote again:
>>> Christoph's patch [01/15] totally dropped <section 1>, and use bdev->bd_disk->private_data
>>> to get mddev pointer, it's not safe.
>>
>> And with above **patched** md_open, Christoph's patches 02-07 can be work happily.
>> He only needs to adjust/modify 01 patch.
>> The md layer behavior will change from return -ENODEV to -EBUSY in some racing scenario.
> 
> Thanks for the explanation.
> 
> Could you please send official patch of modified 01/15 and your fix
> (either in a set or merge into
> one patch)? This patch(set) would go to stable. Then, we can apply
> Christoph's 02 - 15 on top of
> it.
> 
> Thanks,
> Song
> 

Ok, I will resend a patch, which will replace my patch & Chistoph's patch 01.
Then Christoph could re-send 02-07 as v2 patch.
My patch need to work with Christoph's [PATCH 04/15] (md: split mddev_find) to fix soft lockup.
I will mention the relationship in my patch.

Thanks,
Heming


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-03  2:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01  1:34 [PATCH v2] md: don't create mddev in md_open Zhao Heming
2021-04-01  6:17 ` Christoph Hellwig
2021-04-01 13:01   ` heming.zhao
2021-04-02  5:58     ` Song Liu
2021-04-02 10:48       ` Christoph Hellwig
     [not found]       ` <b2288ab4-1da0-612d-8988-637cc33db99a@suse.com>
2021-04-02 16:01         ` Song Liu
2021-04-03  2:45           ` heming.zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.