All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] two fixes for md
@ 2022-05-05  8:16 Guoqing Jiang
  2022-05-05  8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-05  8:16 UTC (permalink / raw)
  To: song; +Cc: buczek, linux-raid, Guoqing Jiang

Hi,

The first patch addressed the issue in the link as reported by Donald.

https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t

Also add another one per discussion of v2 patch.

Thanks,
Guoqing

Guoqing Jiang (2):
  md: don't unregister sync_thread with reconfig_mutex held
  md: protect md_unregister_thread from reentrancy

 drivers/md/md.c | 24 ++++++++++++++++++------
 drivers/md/md.h |  5 +++++
 2 files changed, 23 insertions(+), 6 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-05  8:16 [PATCH 0/2] two fixes for md Guoqing Jiang
@ 2022-05-05  8:16 ` Guoqing Jiang
  2022-05-05 14:02   ` kernel test robot
  2022-05-05 18:04   ` kernel test robot
  2022-05-05  8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang
  2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
  2 siblings, 2 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-05  8:16 UTC (permalink / raw)
  To: song; +Cc: buczek, linux-raid, Guoqing Jiang, Guoqing Jiang

From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
   idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
   which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
   to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t

Let's call md_unregister_thread between mddev_unlock and mddev_lock if
reconfig_mutex is held, and mddev_is_locked is introduced accordingly.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
v2: https://lore.kernel.org/all/1613177399-22024-1-git-send-email-guoqing.jiang@cloud.ionos.com/#t
v1: https://lore.kernel.org/all/1612923676-18294-1-git-send-email-guoqing.jiang@cloud.ionos.com/

 drivers/md/md.c | 9 ++++++++-
 drivers/md/md.h | 5 +++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 309b3af906ad..a70e7f0f9268 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev)
 {
 	struct md_rdev *rdev;
 	sector_t old_dev_sectors = mddev->dev_sectors;
-	bool is_reshaped = false;
+	bool is_reshaped = false, is_locked = false;
 
+	if (mddev_is_locked(mddev)) {
+		is_locked = true;
+		mddev_unlock(mddev);
+	}
 	/* resync has finished, collect result */
 	md_unregister_thread(&mddev->sync_thread);
+	if (is_locked)
+		mddev_lock(mddev);
+
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
 	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
 	    mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 6ac283864533..af6f3978b62b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev)
 }
 extern void mddev_unlock(struct mddev *mddev);
 
+static inline int mddev_is_locked(struct mddev *mddev)
+{
+	return mutex_is_locked(&mddev->reconfig_mutex);
+}
+
 static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
 {
 	atomic_add(nr_sectors, &bdev->bd_disk->sync_io);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/2] md: protect md_unregister_thread from reentrancy
  2022-05-05  8:16 [PATCH 0/2] two fixes for md Guoqing Jiang
  2022-05-05  8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
@ 2022-05-05  8:16 ` Guoqing Jiang
  2022-05-09  6:39   ` Song Liu
  2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
  2 siblings, 1 reply; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-05  8:16 UTC (permalink / raw)
  To: song; +Cc: buczek, linux-raid, Guoqing Jiang, Guoqing Jiang

From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

Generally, the md_unregister_thread is called with reconfig_mutex, but
raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
so md_unregister_thread can be called simulitaneously from two call sites
in theory.

Then after previous commit which remove the protection of reconfig_mutex
for md_unregister_thread completely, the potential issue could be worse
than before.

Let's take pers_lock at the beginning of function to ensure reentrancy.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/md.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a70e7f0f9268..c401e063bec8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7962,17 +7962,22 @@ EXPORT_SYMBOL(md_register_thread);
 
 void md_unregister_thread(struct md_thread **threadp)
 {
-	struct md_thread *thread = *threadp;
-	if (!thread)
-		return;
-	pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
-	/* Locking ensures that mddev_unlock does not wake_up a
+	struct md_thread *thread;
+
+	/*
+	 * Locking ensures that mddev_unlock does not wake_up a
 	 * non-existent thread
 	 */
 	spin_lock(&pers_lock);
+	thread = *threadp;
+	if (!thread) {
+		spin_unlock(&pers_lock);
+		return;
+	}
 	*threadp = NULL;
 	spin_unlock(&pers_lock);
 
+	pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
 	kthread_stop(thread->tsk);
 	kfree(thread);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-05  8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
@ 2022-05-05 14:02   ` kernel test robot
  2022-05-05 18:04   ` kernel test robot
  1 sibling, 0 replies; 45+ messages in thread
From: kernel test robot @ 2022-05-05 14:02 UTC (permalink / raw)
  To: Guoqing Jiang, song; +Cc: kbuild-all, buczek, linux-raid, Guoqing Jiang

Hi Guoqing,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on song-md/md-next]
[also build test WARNING on v5.18-rc5 next-20220505]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202
base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
config: mips-allyesconfig (https://download.01.org/0day-ci/archive/20220505/202205052148.TTOFRBQx-lkp@intel.com/config)
compiler: mips-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202
        git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/md/md.c: In function 'md_reap_sync_thread':
>> drivers/md/md.c:9448:17: warning: ignoring return value of 'mddev_lock' declared with attribute 'warn_unused_result' [-Wunused-result]
    9448 |                 mddev_lock(mddev);
         |                 ^~~~~~~~~~~~~~~~~


vim +9448 drivers/md/md.c

  9434	
  9435	void md_reap_sync_thread(struct mddev *mddev)
  9436	{
  9437		struct md_rdev *rdev;
  9438		sector_t old_dev_sectors = mddev->dev_sectors;
  9439		bool is_reshaped = false, is_locked = false;
  9440	
  9441		if (mddev_is_locked(mddev)) {
  9442			is_locked = true;
  9443			mddev_unlock(mddev);
  9444		}
  9445		/* resync has finished, collect result */
  9446		md_unregister_thread(&mddev->sync_thread);
  9447		if (is_locked)
> 9448			mddev_lock(mddev);
  9449	
  9450		if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
  9451		    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
  9452		    mddev->degraded != mddev->raid_disks) {
  9453			/* success...*/
  9454			/* activate any spares */
  9455			if (mddev->pers->spare_active(mddev)) {
  9456				sysfs_notify_dirent_safe(mddev->sysfs_degraded);
  9457				set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
  9458			}
  9459		}
  9460		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
  9461		    mddev->pers->finish_reshape) {
  9462			mddev->pers->finish_reshape(mddev);
  9463			if (mddev_is_clustered(mddev))
  9464				is_reshaped = true;
  9465		}
  9466	
  9467		/* If array is no-longer degraded, then any saved_raid_disk
  9468		 * information must be scrapped.
  9469		 */
  9470		if (!mddev->degraded)
  9471			rdev_for_each(rdev, mddev)
  9472				rdev->saved_raid_disk = -1;
  9473	
  9474		md_update_sb(mddev, 1);
  9475		/* MD_SB_CHANGE_PENDING should be cleared by md_update_sb, so we can
  9476		 * call resync_finish here if MD_CLUSTER_RESYNC_LOCKED is set by
  9477		 * clustered raid */
  9478		if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags))
  9479			md_cluster_ops->resync_finish(mddev);
  9480		clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
  9481		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
  9482		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
  9483		clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
  9484		clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
  9485		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
  9486		/*
  9487		 * We call md_cluster_ops->update_size here because sync_size could
  9488		 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared,
  9489		 * so it is time to update size across cluster.
  9490		 */
  9491		if (mddev_is_clustered(mddev) && is_reshaped
  9492					      && !test_bit(MD_CLOSING, &mddev->flags))
  9493			md_cluster_ops->update_size(mddev, old_dev_sectors);
  9494		wake_up(&resync_wait);
  9495		/* flag recovery needed just to double check */
  9496		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
  9497		sysfs_notify_dirent_safe(mddev->sysfs_action);
  9498		md_new_event();
  9499		if (mddev->event_work.func)
  9500			queue_work(md_misc_wq, &mddev->event_work);
  9501	}
  9502	EXPORT_SYMBOL(md_reap_sync_thread);
  9503	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-05  8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
  2022-05-05 14:02   ` kernel test robot
@ 2022-05-05 18:04   ` kernel test robot
  2022-05-06  2:34       ` Guoqing Jiang
  1 sibling, 1 reply; 45+ messages in thread
From: kernel test robot @ 2022-05-05 18:04 UTC (permalink / raw)
  To: Guoqing Jiang, song; +Cc: llvm, kbuild-all, buczek, linux-raid, Guoqing Jiang

Hi Guoqing,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on song-md/md-next]
[also build test WARNING on v5.18-rc5 next-20220505]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202
base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
config: hexagon-randconfig-r045-20220505 (https://download.01.org/0day-ci/archive/20220506/202205060116.J42KgtCW-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5e004fb787698440a387750db7f8028e7cb14cfc)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202
        git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/md/md.c:9448:3: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result]
                   mddev_lock(mddev);
                   ^~~~~~~~~~ ~~~~~
   1 warning generated.


vim +/warn_unused_result +9448 drivers/md/md.c

  9434	
  9435	void md_reap_sync_thread(struct mddev *mddev)
  9436	{
  9437		struct md_rdev *rdev;
  9438		sector_t old_dev_sectors = mddev->dev_sectors;
  9439		bool is_reshaped = false, is_locked = false;
  9440	
  9441		if (mddev_is_locked(mddev)) {
  9442			is_locked = true;
  9443			mddev_unlock(mddev);
  9444		}
  9445		/* resync has finished, collect result */
  9446		md_unregister_thread(&mddev->sync_thread);
  9447		if (is_locked)
> 9448			mddev_lock(mddev);
  9449	
  9450		if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
  9451		    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
  9452		    mddev->degraded != mddev->raid_disks) {
  9453			/* success...*/
  9454			/* activate any spares */
  9455			if (mddev->pers->spare_active(mddev)) {
  9456				sysfs_notify_dirent_safe(mddev->sysfs_degraded);
  9457				set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
  9458			}
  9459		}
  9460		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
  9461		    mddev->pers->finish_reshape) {
  9462			mddev->pers->finish_reshape(mddev);
  9463			if (mddev_is_clustered(mddev))
  9464				is_reshaped = true;
  9465		}
  9466	
  9467		/* If array is no-longer degraded, then any saved_raid_disk
  9468		 * information must be scrapped.
  9469		 */
  9470		if (!mddev->degraded)
  9471			rdev_for_each(rdev, mddev)
  9472				rdev->saved_raid_disk = -1;
  9473	
  9474		md_update_sb(mddev, 1);
  9475		/* MD_SB_CHANGE_PENDING should be cleared by md_update_sb, so we can
  9476		 * call resync_finish here if MD_CLUSTER_RESYNC_LOCKED is set by
  9477		 * clustered raid */
  9478		if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags))
  9479			md_cluster_ops->resync_finish(mddev);
  9480		clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
  9481		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
  9482		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
  9483		clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
  9484		clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
  9485		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
  9486		/*
  9487		 * We call md_cluster_ops->update_size here because sync_size could
  9488		 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared,
  9489		 * so it is time to update size across cluster.
  9490		 */
  9491		if (mddev_is_clustered(mddev) && is_reshaped
  9492					      && !test_bit(MD_CLOSING, &mddev->flags))
  9493			md_cluster_ops->update_size(mddev, old_dev_sectors);
  9494		wake_up(&resync_wait);
  9495		/* flag recovery needed just to double check */
  9496		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
  9497		sysfs_notify_dirent_safe(mddev->sysfs_action);
  9498		md_new_event();
  9499		if (mddev->event_work.func)
  9500			queue_work(md_misc_wq, &mddev->event_work);
  9501	}
  9502	EXPORT_SYMBOL(md_reap_sync_thread);
  9503	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-05 18:04   ` kernel test robot
@ 2022-05-06  2:34       ` Guoqing Jiang
  0 siblings, 0 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-06  2:34 UTC (permalink / raw)
  To: kernel test robot, song; +Cc: llvm, kbuild-all, buczek, linux-raid



On 5/6/22 2:04 AM, kernel test robot wrote:
> Hi Guoqing,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on song-md/md-next]
> [also build test WARNING on v5.18-rc5 next-20220505]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202
> base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
> config: hexagon-randconfig-r045-20220505 (https://download.01.org/0day-ci/archive/20220506/202205060116.J42KgtCW-lkp@intel.com/config)
> compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5e004fb787698440a387750db7f8028e7cb14cfc)
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202
>          git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
>>> drivers/md/md.c:9448:3: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result]
>                     mddev_lock(mddev);
>                     ^~~~~~~~~~ ~~~~~
>     1 warning generated.

Thanks! I should call mddev_lock_nointr here.

Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held
@ 2022-05-06  2:34       ` Guoqing Jiang
  0 siblings, 0 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-06  2:34 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2080 bytes --]



On 5/6/22 2:04 AM, kernel test robot wrote:
> Hi Guoqing,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on song-md/md-next]
> [also build test WARNING on v5.18-rc5 next-20220505]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202
> base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
> config: hexagon-randconfig-r045-20220505 (https://download.01.org/0day-ci/archive/20220506/202205060116.J42KgtCW-lkp(a)intel.com/config)
> compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5e004fb787698440a387750db7f8028e7cb14cfc)
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202
>          git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
>>> drivers/md/md.c:9448:3: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result]
>                     mddev_lock(mddev);
>                     ^~~~~~~~~~ ~~~~~
>     1 warning generated.

Thanks! I should call mddev_lock_nointr here.

Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-05  8:16 [PATCH 0/2] two fixes for md Guoqing Jiang
  2022-05-05  8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
  2022-05-05  8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang
@ 2022-05-06 11:36 ` Guoqing Jiang
  2022-05-09  6:37   ` Song Liu
  2022-05-09  8:18   ` Donald Buczek
  2 siblings, 2 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-06 11:36 UTC (permalink / raw)
  To: song; +Cc: buczek, linux-raid

From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
   idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
   which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
   to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t

Let's call unregister thread between mddev_unlock and mddev_lock_nointr
(thanks for the report from kernel test robot <lkp@intel.com>) if the
reconfig_mutex is held, and mddev_is_locked is introduced accordingly.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/md.c | 9 ++++++++-
 drivers/md/md.h | 5 +++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 309b3af906ad..525f65682356 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev)
 {
 	struct md_rdev *rdev;
 	sector_t old_dev_sectors = mddev->dev_sectors;
-	bool is_reshaped = false;
+	bool is_reshaped = false, is_locked = false;
 
+	if (mddev_is_locked(mddev)) {
+		is_locked = true;
+		mddev_unlock(mddev);
+	}
 	/* resync has finished, collect result */
 	md_unregister_thread(&mddev->sync_thread);
+	if (is_locked)
+		mddev_lock_nointr(mddev);
+
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
 	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
 	    mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 6ac283864533..af6f3978b62b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev)
 }
 extern void mddev_unlock(struct mddev *mddev);
 
+static inline int mddev_is_locked(struct mddev *mddev)
+{
+	return mutex_is_locked(&mddev->reconfig_mutex);
+}
+
 static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
 {
 	atomic_add(nr_sectors, &bdev->bd_disk->sync_io);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
@ 2022-05-09  6:37   ` Song Liu
  2022-05-09  8:09     ` Guoqing Jiang
  2022-05-09  8:18   ` Donald Buczek
  1 sibling, 1 reply; 45+ messages in thread
From: Song Liu @ 2022-05-09  6:37 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid

On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
> From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
>
> Unregister sync_thread doesn't need to hold reconfig_mutex since it
> doesn't reconfigure array.
>
> And it could cause deadlock problem for raid5 as follows:
>
> 1. process A tried to reap sync thread with reconfig_mutex held after echo
>    idle to sync_action.
> 2. raid5 sync thread was blocked if there were too many active stripes.
> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
>    which causes the number of active stripes can't be decreased.
> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
>    to hold reconfig_mutex.
>
> More details in the link:
> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
>
> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
> (thanks for the report from kernel test robot <lkp@intel.com>) if the
> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.

mddev_is_locked() feels really hacky to me. It cannot tell whether
mddev is locked
by current thread. So technically, we can unlock reconfigure_mutex for
other thread
by accident, no?


>
> Reported-by: Donald Buczek <buczek@molgen.mpg.de>
> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> ---
>  drivers/md/md.c | 9 ++++++++-
>  drivers/md/md.h | 5 +++++
>  2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 309b3af906ad..525f65682356 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev)
>  {
>         struct md_rdev *rdev;
>         sector_t old_dev_sectors = mddev->dev_sectors;
> -       bool is_reshaped = false;
> +       bool is_reshaped = false, is_locked = false;
>
> +       if (mddev_is_locked(mddev)) {
> +               is_locked = true;
> +               mddev_unlock(mddev);
> +       }
>         /* resync has finished, collect result */
>         md_unregister_thread(&mddev->sync_thread);
> +       if (is_locked)
> +               mddev_lock_nointr(mddev);
> +
>         if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>             !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>             mddev->degraded != mddev->raid_disks) {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 6ac283864533..af6f3978b62b 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev)
>  }
>  extern void mddev_unlock(struct mddev *mddev);
>
> +static inline int mddev_is_locked(struct mddev *mddev)
> +{
> +       return mutex_is_locked(&mddev->reconfig_mutex);
> +}
> +
>  static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
>  {
>         atomic_add(nr_sectors, &bdev->bd_disk->sync_io);
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/2] md: protect md_unregister_thread from reentrancy
  2022-05-05  8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang
@ 2022-05-09  6:39   ` Song Liu
  2022-05-09  8:12     ` Guoqing Jiang
  0 siblings, 1 reply; 45+ messages in thread
From: Song Liu @ 2022-05-09  6:39 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid, Guoqing Jiang

On Thu, May 5, 2022 at 1:18 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
> From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
>
> Generally, the md_unregister_thread is called with reconfig_mutex, but
> raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
> so md_unregister_thread can be called simulitaneously from two call sites
> in theory.

Can we add lock/unlock into raid_message? Are there some constraints here?

Thanks,
Song

>
> Then after previous commit which remove the protection of reconfig_mutex
> for md_unregister_thread completely, the potential issue could be worse
> than before.
>
> Let's take pers_lock at the beginning of function to ensure reentrancy.
>
> Reported-by: Donald Buczek <buczek@molgen.mpg.de>
> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> ---
>  drivers/md/md.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index a70e7f0f9268..c401e063bec8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7962,17 +7962,22 @@ EXPORT_SYMBOL(md_register_thread);
>
>  void md_unregister_thread(struct md_thread **threadp)
>  {
> -       struct md_thread *thread = *threadp;
> -       if (!thread)
> -               return;
> -       pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
> -       /* Locking ensures that mddev_unlock does not wake_up a
> +       struct md_thread *thread;
> +
> +       /*
> +        * Locking ensures that mddev_unlock does not wake_up a
>          * non-existent thread
>          */
>         spin_lock(&pers_lock);
> +       thread = *threadp;
> +       if (!thread) {
> +               spin_unlock(&pers_lock);
> +               return;
> +       }
>         *threadp = NULL;
>         spin_unlock(&pers_lock);
>
> +       pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
>         kthread_stop(thread->tsk);
>         kfree(thread);
>  }
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-09  6:37   ` Song Liu
@ 2022-05-09  8:09     ` Guoqing Jiang
  2022-05-09  9:32       ` Wols Lists
  2022-05-10  6:44       ` Song Liu
  0 siblings, 2 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-09  8:09 UTC (permalink / raw)
  To: Song Liu; +Cc: Donald Buczek, linux-raid



On 5/9/22 2:37 PM, Song Liu wrote:
> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
>>
>> Unregister sync_thread doesn't need to hold reconfig_mutex since it
>> doesn't reconfigure array.
>>
>> And it could cause deadlock problem for raid5 as follows:
>>
>> 1. process A tried to reap sync thread with reconfig_mutex held after echo
>>     idle to sync_action.
>> 2. raid5 sync thread was blocked if there were too many active stripes.
>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
>>     which causes the number of active stripes can't be decreased.
>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
>>     to hold reconfig_mutex.
>>
>> More details in the link:
>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
>>
>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
>> (thanks for the report from kernel test robot<lkp@intel.com>) if the
>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
> mddev_is_locked() feels really hacky to me. It cannot tell whether
> mddev is locked
> by current thread. So technically, we can unlock reconfigure_mutex for
> other thread
> by accident, no?

I can switch back to V2 if you think that is the correct way to do though no
one comment about the change in dm-raid.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/2] md: protect md_unregister_thread from reentrancy
  2022-05-09  6:39   ` Song Liu
@ 2022-05-09  8:12     ` Guoqing Jiang
  0 siblings, 0 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-09  8:12 UTC (permalink / raw)
  To: Song Liu; +Cc: Donald Buczek, linux-raid, Guoqing Jiang



On 5/9/22 2:39 PM, Song Liu wrote:
> On Thu, May 5, 2022 at 1:18 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
>>
>> Generally, the md_unregister_thread is called with reconfig_mutex, but
>> raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
>> so md_unregister_thread can be called simulitaneously from two call sites
>> in theory.
> Can we add lock/unlock into raid_message? Are there some constraints here?

Honestly, I don't know about dm-raid well, and there was no reply from 
dm people
as you might know, so I prefer leave it as it is.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
  2022-05-09  6:37   ` Song Liu
@ 2022-05-09  8:18   ` Donald Buczek
  2022-05-09  8:48     ` Guoqing Jiang
  1 sibling, 1 reply; 45+ messages in thread
From: Donald Buczek @ 2022-05-09  8:18 UTC (permalink / raw)
  To: Guoqing Jiang, song; +Cc: linux-raid

On 5/6/22 1:36 PM, Guoqing Jiang wrote:
> From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
> 
> Unregister sync_thread doesn't need to hold reconfig_mutex since it
> doesn't reconfigure array.
> 
> And it could cause deadlock problem for raid5 as follows:
> 
> 1. process A tried to reap sync thread with reconfig_mutex held after echo
>     idle to sync_action.
> 2. raid5 sync thread was blocked if there were too many active stripes.
> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
>     which causes the number of active stripes can't be decreased.
> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
>     to hold reconfig_mutex.
> 
> More details in the link:
> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
> 
> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
> (thanks for the report from kernel test robot <lkp@intel.com>) if the
> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
> 
> Reported-by: Donald Buczek <buczek@molgen.mpg.de>
> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> ---
>   drivers/md/md.c | 9 ++++++++-
>   drivers/md/md.h | 5 +++++
>   2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 309b3af906ad..525f65682356 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev)
>   {
>   	struct md_rdev *rdev;
>   	sector_t old_dev_sectors = mddev->dev_sectors;
> -	bool is_reshaped = false;
> +	bool is_reshaped = false, is_locked = false;
>   
> +	if (mddev_is_locked(mddev)) {
> +		is_locked = true;
> +		mddev_unlock(mddev);


Hmmm. Can it be excluded, that another task is holding the mutex?

Best
Donald


> +	}
>   	/* resync has finished, collect result */
>   	md_unregister_thread(&mddev->sync_thread);
> +	if (is_locked)
> +		mddev_lock_nointr(mddev);
> +
>   	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>   	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>   	    mddev->degraded != mddev->raid_disks) {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 6ac283864533..af6f3978b62b 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev)
>   }
>   extern void mddev_unlock(struct mddev *mddev);
>   
> +static inline int mddev_is_locked(struct mddev *mddev)
> +{
> +	return mutex_is_locked(&mddev->reconfig_mutex);
> +}
> +
>   static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
>   {
>   	atomic_add(nr_sectors, &bdev->bd_disk->sync_io);
> 


-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-09  8:18   ` Donald Buczek
@ 2022-05-09  8:48     ` Guoqing Jiang
  0 siblings, 0 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-09  8:48 UTC (permalink / raw)
  To: Donald Buczek, song; +Cc: linux-raid



On 5/9/22 4:18 PM, Donald Buczek wrote:
>
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev)
>>   {
>>       struct md_rdev *rdev;
>>       sector_t old_dev_sectors = mddev->dev_sectors;
>> -    bool is_reshaped = false;
>> +    bool is_reshaped = false, is_locked = false;
>>   +    if (mddev_is_locked(mddev)) {
>> +        is_locked = true;
>> +        mddev_unlock(mddev);
>
>
> Hmmm. Can it be excluded, that another task is holding the mutex?

Yes, it is hacky as Song said, so we probably go back to v2.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-09  8:09     ` Guoqing Jiang
@ 2022-05-09  9:32       ` Wols Lists
  2022-05-09 10:37         ` Guoqing Jiang
  2022-05-10  6:44       ` Song Liu
  1 sibling, 1 reply; 45+ messages in thread
From: Wols Lists @ 2022-05-09  9:32 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu; +Cc: Donald Buczek, linux-raid

On 09/05/2022 09:09, Guoqing Jiang wrote:
> I can switch back to V2 if you think that is the correct way to do 
> though no
> one comment about the change in dm-raid.

DON'T QUOTE ME ON THIS

but it is very confusing, as I believe dm-raid is part of Device 
Managment, which is actually nothing to do with md-raid (apart from, 
like many block layer drivers, md-raid uses dm to manage the device 
beneath it.)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-09  9:32       ` Wols Lists
@ 2022-05-09 10:37         ` Guoqing Jiang
  2022-05-09 11:19           ` Wols Lists
  0 siblings, 1 reply; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-09 10:37 UTC (permalink / raw)
  To: Wols Lists, Song Liu; +Cc: Donald Buczek, linux-raid



On 5/9/22 5:32 PM, Wols Lists wrote:
> On 09/05/2022 09:09, Guoqing Jiang wrote:
>> I can switch back to V2 if you think that is the correct way to do 
>> though no
>> one comment about the change in dm-raid.
>
> DON'T QUOTE ME ON THIS
>
> but it is very confusing, as I believe dm-raid is part of Device 
> Managment, which is actually nothing to do with md-raid


No, dm-raid (I mean dm-raid.c here) is a bridge between dm framwork and 
md raid, which
makes dm can reuse md raid code to avoid re-implementation raid logic 
inside dm itself.
> (apart from, like many block layer drivers, md-raid uses dm to manage 
> the device beneath it.)

The above statement is very confusing.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-09 10:37         ` Guoqing Jiang
@ 2022-05-09 11:19           ` Wols Lists
  2022-05-09 11:26             ` Guoqing Jiang
  0 siblings, 1 reply; 45+ messages in thread
From: Wols Lists @ 2022-05-09 11:19 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu; +Cc: Donald Buczek, linux-raid

On 09/05/2022 11:37, Guoqing Jiang wrote:
> 
> 
> On 5/9/22 5:32 PM, Wols Lists wrote:
>> On 09/05/2022 09:09, Guoqing Jiang wrote:
>>> I can switch back to V2 if you think that is the correct way to do 
>>> though no
>>> one comment about the change in dm-raid.
>>
>> DON'T QUOTE ME ON THIS
>>
>> but it is very confusing, as I believe dm-raid is part of Device 
>> Managment, which is actually nothing to do with md-raid
> 
> 
> No, dm-raid (I mean dm-raid.c here) is a bridge between dm framwork and 
> md raid, which
> makes dm can reuse md raid code to avoid re-implementation raid logic 
> inside dm itself.

Hmm...

Which backs up what I thought - if it's used by the dm people to call 
md-raid, then asking for info/comments about it on the md list is likely 
to get you no-where...

>> (apart from, like many block layer drivers, md-raid uses dm to manage 
>> the device beneath it.)
> 
> The above statement is very confusing.
> 
Yup. I've never understood it. But, from all my time on the list, it is 
noticeable that nobody seems to know or care much about dm, and this is 
simply one more data point that agrees with that impression. If you want 
to know about dm-raid, you probably need to go where the dm people hang out.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-09 11:19           ` Wols Lists
@ 2022-05-09 11:26             ` Guoqing Jiang
  0 siblings, 0 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-09 11:26 UTC (permalink / raw)
  To: Wols Lists, Song Liu; +Cc: Donald Buczek, linux-raid



On 5/9/22 7:19 PM, Wols Lists wrote:
>
> Hmm...
>
> Which backs up what I thought - if it's used by the dm people to call 
> md-raid, then asking for info/comments about it on the md list is 
> likely to get you no-where...

I don't think you know the V2 well which had been cced to dm list, just FYI.

https://lore.kernel.org/all/1613177399-22024-1-git-send-email-guoqing.jiang@cloud.ionos.com/#t

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-09  8:09     ` Guoqing Jiang
  2022-05-09  9:32       ` Wols Lists
@ 2022-05-10  6:44       ` Song Liu
  2022-05-10 12:01         ` Donald Buczek
  2022-05-20 18:27         ` Logan Gunthorpe
  1 sibling, 2 replies; 45+ messages in thread
From: Song Liu @ 2022-05-10  6:44 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid

On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/9/22 2:37 PM, Song Liu wrote:
> > On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
> >> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
> >>
> >> Unregister sync_thread doesn't need to hold reconfig_mutex since it
> >> doesn't reconfigure array.
> >>
> >> And it could cause deadlock problem for raid5 as follows:
> >>
> >> 1. process A tried to reap sync thread with reconfig_mutex held after echo
> >>     idle to sync_action.
> >> 2. raid5 sync thread was blocked if there were too many active stripes.
> >> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
> >>     which causes the number of active stripes can't be decreased.
> >> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
> >>     to hold reconfig_mutex.
> >>
> >> More details in the link:
> >> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
> >>
> >> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
> >> (thanks for the report from kernel test robot<lkp@intel.com>) if the
> >> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
> > mddev_is_locked() feels really hacky to me. It cannot tell whether
> > mddev is locked
> > by current thread. So technically, we can unlock reconfigure_mutex for
> > other thread
> > by accident, no?
>
> I can switch back to V2 if you think that is the correct way to do though no
> one comment about the change in dm-raid.

I guess v2 is the best at the moment. I pushed a slightly modified v2 to
md-next.

Thanks for working on this.
Song

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-10  6:44       ` Song Liu
@ 2022-05-10 12:01         ` Donald Buczek
  2022-05-10 12:09           ` Guoqing Jiang
  2022-05-20 18:27         ` Logan Gunthorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Donald Buczek @ 2022-05-10 12:01 UTC (permalink / raw)
  To: Song Liu, Guoqing Jiang; +Cc: linux-raid

On 5/10/22 8:44 AM, Song Liu wrote:
> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>>
>>
>>
>> On 5/9/22 2:37 PM, Song Liu wrote:
>>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
>>>>
>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it
>>>> doesn't reconfigure array.
>>>>
>>>> And it could cause deadlock problem for raid5 as follows:
>>>>
>>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo
>>>>      idle to sync_action.
>>>> 2. raid5 sync thread was blocked if there were too many active stripes.
>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
>>>>      which causes the number of active stripes can't be decreased.
>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
>>>>      to hold reconfig_mutex.
>>>>
>>>> More details in the link:
>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
>>>>
>>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the
>>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
>>> mddev_is_locked() feels really hacky to me. It cannot tell whether
>>> mddev is locked
>>> by current thread. So technically, we can unlock reconfigure_mutex for
>>> other thread
>>> by accident, no?
>>
>> I can switch back to V2 if you think that is the correct way to do though no
>> one comment about the change in dm-raid.
> 
> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
> md-next.

I think, this can be used to get a double-free from md_unregister_thread.

Please review

https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/

Best
   Donald

> 
> Thanks for working on this.
> Song
> 


-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-10 12:01         ` Donald Buczek
@ 2022-05-10 12:09           ` Guoqing Jiang
  2022-05-10 12:35             ` Donald Buczek
  0 siblings, 1 reply; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-10 12:09 UTC (permalink / raw)
  To: Donald Buczek, Song Liu; +Cc: linux-raid



On 5/10/22 8:01 PM, Donald Buczek wrote:
>
>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
>> md-next.
>
> I think, this can be used to get a double-free from md_unregister_thread.
>
> Please review
>
> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/

That is supposed to be addressed by the second one, pls consider it too.

[PATCH 2/2] md: protect md_unregister_thread from reentrancy

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-10 12:09           ` Guoqing Jiang
@ 2022-05-10 12:35             ` Donald Buczek
  2022-05-10 18:02               ` Song Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Donald Buczek @ 2022-05-10 12:35 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu; +Cc: linux-raid

On 5/10/22 2:09 PM, Guoqing Jiang wrote:
> 
> 
> On 5/10/22 8:01 PM, Donald Buczek wrote:
>>
>>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
>>> md-next.
>>
>> I think, this can be used to get a double-free from md_unregister_thread.
>>
>> Please review
>>
>> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/
> 
> That is supposed to be addressed by the second one, pls consider it too.

Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem.

If the other patch is taken, too, and works as intended, that would be solved.

> [PATCH 2/2] md: protect md_unregister_thread from reentrancy

Looks good to me.

Best

   Donald

> 
> Thanks,
> Guoqing


-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-10 12:35             ` Donald Buczek
@ 2022-05-10 18:02               ` Song Liu
  2022-05-11  8:10                 ` Guoqing Jiang
  0 siblings, 1 reply; 45+ messages in thread
From: Song Liu @ 2022-05-10 18:02 UTC (permalink / raw)
  To: Donald Buczek; +Cc: Guoqing Jiang, linux-raid

On Tue, May 10, 2022 at 5:35 AM Donald Buczek <buczek@molgen.mpg.de> wrote:
>
> On 5/10/22 2:09 PM, Guoqing Jiang wrote:
> >
> >
> > On 5/10/22 8:01 PM, Donald Buczek wrote:
> >>
> >>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
> >>> md-next.
> >>
> >> I think, this can be used to get a double-free from md_unregister_thread.
> >>
> >> Please review
> >>
> >> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/
> >
> > That is supposed to be addressed by the second one, pls consider it too.
>
> Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem.
>
> If the other patch is taken, too, and works as intended, that would be solved.
>
> > [PATCH 2/2] md: protect md_unregister_thread from reentrancy

Good catch!

Guoqing, current 2/2 doesn't apply cleanly. Could you please resend it on top of
md-next?

Thanks,
Song

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-10 18:02               ` Song Liu
@ 2022-05-11  8:10                 ` Guoqing Jiang
  2022-05-11 21:45                   ` Song Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-11  8:10 UTC (permalink / raw)
  To: Song Liu, Donald Buczek; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1621 bytes --]



On 5/11/22 2:02 AM, Song Liu wrote:
> On Tue, May 10, 2022 at 5:35 AM Donald Buczek <buczek@molgen.mpg.de> wrote:
>> On 5/10/22 2:09 PM, Guoqing Jiang wrote:
>>>
>>> On 5/10/22 8:01 PM, Donald Buczek wrote:
>>>>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
>>>>> md-next.
>>>> I think, this can be used to get a double-free from md_unregister_thread.
>>>>
>>>> Please review
>>>>
>>>> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/
>>> That is supposed to be addressed by the second one, pls consider it too.
>> Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem.

Thanks for reminder.

>> If the other patch is taken, too, and works as intended, that would be solved.
>>
>>> [PATCH 2/2] md: protect md_unregister_thread from reentrancy
> Good catch!
>
> Guoqing, current 2/2 doesn't apply cleanly. Could you please resend it on top of
> md-next?

Hmm, no issue from my side.

~/source/md> git am 
0001-md-protect-md_unregister_thread-from-reentrancy.patch
Applying: md: protect md_unregister_thread from reentrancy

~/source/md> git log --oneline |head -5
dc7147a88766 md: protect md_unregister_thread from reentrancy
5a36c493dc82 md: don't unregister sync_thread with reconfig_mutex held
49c3b9266a71 block: null_blk: Improve device creation with configfs
db060f54e0c5 block: null_blk: Cleanup messages
b3a0a73e8a79 block: null_blk: Cleanup device creation and deletion

Anyway, it is attached. I will rebase it to your latest tree if 
something gets wrong.

Thanks,
Guoqing

[-- Attachment #2: 0001-md-protect-md_unregister_thread-from-reentrancy.patch --]
[-- Type: text/x-patch, Size: 1751 bytes --]

From a2da80f62f15023e3fee7a02488c143dfff647b3 Mon Sep 17 00:00:00 2001
From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Date: Fri, 29 Apr 2022 16:49:09 +0800
Subject: [PATCH 2/2] md: protect md_unregister_thread from reentrancy

Generally, the md_unregister_thread is called with reconfig_mutex, but
raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
so md_unregister_thread can be called simulitaneously from two call sites
in theory.

Then after previous commit which remove the protection of reconfig_mutex
for md_unregister_thread completely, the potential issue could be worse
than before.

Let's take pers_lock at the beginning of function to ensure reentrancy.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/md.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a70e7f0f9268..c401e063bec8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7962,17 +7962,22 @@ EXPORT_SYMBOL(md_register_thread);
 
 void md_unregister_thread(struct md_thread **threadp)
 {
-	struct md_thread *thread = *threadp;
-	if (!thread)
-		return;
-	pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
-	/* Locking ensures that mddev_unlock does not wake_up a
+	struct md_thread *thread;
+
+	/*
+	 * Locking ensures that mddev_unlock does not wake_up a
 	 * non-existent thread
 	 */
 	spin_lock(&pers_lock);
+	thread = *threadp;
+	if (!thread) {
+		spin_unlock(&pers_lock);
+		return;
+	}
 	*threadp = NULL;
 	spin_unlock(&pers_lock);
 
+	pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
 	kthread_stop(thread->tsk);
 	kfree(thread);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-11  8:10                 ` Guoqing Jiang
@ 2022-05-11 21:45                   ` Song Liu
  0 siblings, 0 replies; 45+ messages in thread
From: Song Liu @ 2022-05-11 21:45 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid

On Wed, May 11, 2022 at 1:10 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/11/22 2:02 AM, Song Liu wrote:
> > On Tue, May 10, 2022 at 5:35 AM Donald Buczek <buczek@molgen.mpg.de> wrote:
> >> On 5/10/22 2:09 PM, Guoqing Jiang wrote:
> >>>
> >>> On 5/10/22 8:01 PM, Donald Buczek wrote:
> >>>>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
> >>>>> md-next.
> >>>> I think, this can be used to get a double-free from md_unregister_thread.
> >>>>
> >>>> Please review
> >>>>
> >>>> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/
> >>> That is supposed to be addressed by the second one, pls consider it too.
> >> Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem.
>
> Thanks for reminder.
>
> >> If the other patch is taken, too, and works as intended, that would be solved.
> >>
> >>> [PATCH 2/2] md: protect md_unregister_thread from reentrancy
> > Good catch!
> >
> > Guoqing, current 2/2 doesn't apply cleanly. Could you please resend it on top of
> > md-next?
>
> Hmm, no issue from my side.
>
> ~/source/md> git am
> 0001-md-protect-md_unregister_thread-from-reentrancy.patch
> Applying: md: protect md_unregister_thread from reentrancy
>
> ~/source/md> git log --oneline |head -5
> dc7147a88766 md: protect md_unregister_thread from reentrancy
> 5a36c493dc82 md: don't unregister sync_thread with reconfig_mutex held
> 49c3b9266a71 block: null_blk: Improve device creation with configfs
> db060f54e0c5 block: null_blk: Cleanup messages
> b3a0a73e8a79 block: null_blk: Cleanup device creation and deletion
>
> Anyway, it is attached. I will rebase it to your latest tree if
> something gets wrong.

Applied to md-next. Thanks!

Song

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-10  6:44       ` Song Liu
  2022-05-10 12:01         ` Donald Buczek
@ 2022-05-20 18:27         ` Logan Gunthorpe
  2022-05-21 18:23           ` Donald Buczek
  2022-06-02  8:12           ` Xiao Ni
  1 sibling, 2 replies; 45+ messages in thread
From: Logan Gunthorpe @ 2022-05-20 18:27 UTC (permalink / raw)
  To: Song Liu, Guoqing Jiang; +Cc: Donald Buczek, linux-raid


Hi,

On 2022-05-10 00:44, Song Liu wrote:
> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>> On 5/9/22 2:37 PM, Song Liu wrote:
>>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
>>>>
>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it
>>>> doesn't reconfigure array.
>>>>
>>>> And it could cause deadlock problem for raid5 as follows:
>>>>
>>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo
>>>>     idle to sync_action.
>>>> 2. raid5 sync thread was blocked if there were too many active stripes.
>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
>>>>     which causes the number of active stripes can't be decreased.
>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
>>>>     to hold reconfig_mutex.
>>>>
>>>> More details in the link:
>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
>>>>
>>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the
>>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
>>> mddev_is_locked() feels really hacky to me. It cannot tell whether
>>> mddev is locked
>>> by current thread. So technically, we can unlock reconfigure_mutex for
>>> other thread
>>> by accident, no?
>>
>> I can switch back to V2 if you think that is the correct way to do though no
>> one comment about the change in dm-raid.
> 
> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
> md-next.
> 

I noticed a clear regression with mdadm tests with this patch in md-next
(7e6ba434cc6080).

Before the patch, tests 07reshape5intr and 07revert-grow would fail
fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs
for the latter).

After this patch, both tests always fail.

I don't have time to dig into why this is, but it would be nice if
someone can at least fix the regression. It is hard to make any progress
on these tests if we are continuing to further break them.

Logan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-20 18:27         ` Logan Gunthorpe
@ 2022-05-21 18:23           ` Donald Buczek
  2022-05-23  1:08             ` Guoqing Jiang
  2022-05-24 15:51             ` Logan Gunthorpe
  2022-06-02  8:12           ` Xiao Ni
  1 sibling, 2 replies; 45+ messages in thread
From: Donald Buczek @ 2022-05-21 18:23 UTC (permalink / raw)
  To: Logan Gunthorpe, Song Liu, Guoqing Jiang; +Cc: linux-raid

On 20.05.22 20:27, Logan Gunthorpe wrote:
> 
> Hi,
> 
> On 2022-05-10 00:44, Song Liu wrote:
>> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>>> On 5/9/22 2:37 PM, Song Liu wrote:
>>>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
>>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
>>>>>
>>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it
>>>>> doesn't reconfigure array.
>>>>>
>>>>> And it could cause deadlock problem for raid5 as follows:
>>>>>
>>>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo
>>>>>      idle to sync_action.
>>>>> 2. raid5 sync thread was blocked if there were too many active stripes.
>>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
>>>>>      which causes the number of active stripes can't be decreased.
>>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
>>>>>      to hold reconfig_mutex.
>>>>>
>>>>> More details in the link:
>>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
>>>>>
>>>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
>>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the
>>>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
>>>> mddev_is_locked() feels really hacky to me. It cannot tell whether
>>>> mddev is locked
>>>> by current thread. So technically, we can unlock reconfigure_mutex for
>>>> other thread
>>>> by accident, no?
>>>
>>> I can switch back to V2 if you think that is the correct way to do though no
>>> one comment about the change in dm-raid.
>>
>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
>> md-next.
>>
> 
> I noticed a clear regression with mdadm tests with this patch in md-next
> (7e6ba434cc6080).
> 
> Before the patch, tests 07reshape5intr and 07revert-grow would fail
> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs
> for the latter).
> 
> After this patch, both tests always fail.
> 
> I don't have time to dig into why this is, but it would be nice if
> someone can at least fix the regression. It is hard to make any progress
> on these tests if we are continuing to further break them.

Hmmm. I wanted to try to help a bit by reproducing and digging into this.

But it seems that more or less ALL tests hang my system one way or another.

I've used a qemu/kvm machine with md-next and mdraid master.

Is this supposed to work?

I can investigate the bugs I see, but probably that is a waste of time because I'm doing something wrong fundamentally?

This is an example from 00raid0:

[   57.434064] md: md0 stopped.
[   57.586951] md0: detected capacity change from 0 to 107520
[   57.618454] BUG: kernel NULL pointer dereference, address: 0000000000000094
[   57.620830] #PF: supervisor read access in kernel mode
[   57.622554] #PF: error_code(0x0000) - not-present page
[   57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD 10df28067 PMD 0
[   57.626548] Oops: 0000 [#1] PREEMPT SMP PTI
[   57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
[   57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[   57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80
[   57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70
[   57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
[   57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001
[   57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180
[   57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00
[   57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0
[   57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00
[   57.659316] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000
[   57.661700] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0
[   57.665453] Call Trace:
[   57.666479]  <TASK>
[   57.667382]  bfq_bic_update_cgroup+0x28/0x1b0
[   57.668724]  bfq_insert_requests+0x233/0x2340
[   57.670049]  ? ioc_find_get_icq+0x21c/0x2a0
[   57.671315]  ? bfq_prepare_request+0x11/0x30
[   57.672565]  blk_mq_sched_insert_requests+0x5c/0x150
[   57.673891]  blk_mq_flush_plug_list+0xe1/0x2a0
[   57.675140]  __blk_flush_plug+0xdf/0x120
[   57.676259]  io_schedule_prepare+0x3d/0x50
[   57.677373]  io_schedule_timeout+0xf/0x40
[   57.678465]  wait_for_completion_io+0x78/0x140
[   57.679578]  submit_bio_wait+0x5b/0x80
[   57.680575]  blkdev_issue_discard+0x65/0xb0
[   57.681640]  blkdev_common_ioctl+0x391/0x8f0
[   57.682712]  blkdev_ioctl+0x216/0x2a0
[   57.683648]  __x64_sys_ioctl+0x76/0xb0
[   57.684607]  do_syscall_64+0x42/0x90
[   57.685527]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   57.686645] RIP: 0033:0x7fdfce56dc17
[   57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08
[   57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17
[   57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003
[   57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000
[   57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000
[   57.699281]  </TASK>
[   57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4
[   57.705955] CR2: 0000000000000094
[   57.706710] ---[ end trace 0000000000000000 ]---
[   57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80
[   57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70
[   57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
[   57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001
[   57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180
[   57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00
[   57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0
[   57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00
[   57.718438] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000
[   57.719778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0
[   57.722019] note: mkfs.ext3[1064] exited with preempt_count 1
[   57.723067] ------------[ cut here ]------------
[   57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 do_exit+0x8cb/0xbc0
[   57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4
[   57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G      D           5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
[   57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[   57.734853] RIP: 0010:do_exit+0x8cb/0xbc0
[   57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 f8 ff ff <0f> 0b e9 6b f7 ff ff 4b
[   57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082
[   57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: 0000000000000000
[   57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: 0000000000000009
[   57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: c0000000ffffbfff
[   57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: 0000000000000009
[   57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: 0000000000000000
[   57.746388] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000
[   57.747806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0
[   57.750225] Call Trace:
[   57.750894]  <TASK>
[   57.751535]  make_task_dead+0x41/0xf0
[   57.752369]  rewind_stack_and_make_dead+0x17/0x17
[   57.753336] RIP: 0033:0x7fdfce56dc17
[   57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08
[   57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17
[   57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003
[   57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000
[   57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000
[   57.765177]  </TASK>
[   57.765813] ---[ end trace 0000000000000000 ]---
[   57.790046] md0: detected capacity change from 107520 to 0
[   57.792834] md: md0 stopped.
[   78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   78.845334] rcu: 	10-...0: (0 ticks this GP) idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805
[   78.847246] 	(detected by 13, t=21005 jiffies, g=9013, q=1419)
[   78.848619] Sending NMI from CPU 13 to CPUs 10:
[   78.849810] NMI backtrace for cpu 10
[   78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G      D W         5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
[   78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[   78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0
[   78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c c3 f3 90 <8b> 07 84 c0 75 f8 eb e7
[   78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002
[   78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: 0000000000000000
[   78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888102177c30
[   78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: ffff888106a4ed10
[   78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: ffff888103833000
[   78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: ffffc90001c9fa78
[   78.849844] FS:  00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) knlGS:0000000000000000
[   78.849847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: 0000000000170ee0
[   78.849850] Call Trace:
[   78.849853]  <TASK>
[   78.849855]  bfq_insert_requests+0xae/0x2340
[   78.849862]  ? submit_bio_noacct_nocheck+0x225/0x2b0
[   78.849868]  blk_mq_sched_insert_requests+0x5c/0x150
[   78.849872]  blk_mq_flush_plug_list+0xe1/0x2a0
[   78.849876]  __blk_flush_plug+0xdf/0x120
[   78.849879]  blk_finish_plug+0x27/0x40
[   78.849882]  read_pages+0x15b/0x360
[   78.849891]  page_cache_ra_unbounded+0x120/0x170
[   78.849894]  filemap_get_pages+0xdd/0x5f0
[   78.849899]  filemap_read+0xbf/0x350
[   78.849902]  ? __mod_memcg_lruvec_state+0x72/0xc0
[   78.849907]  ? __mod_lruvec_page_state+0xb4/0x160
[   78.849909]  ? folio_add_lru+0x51/0x80
[   78.849912]  ? _raw_spin_unlock+0x12/0x30
[   78.849916]  ? __handle_mm_fault+0xdee/0x14d0
[   78.849921]  blkdev_read_iter+0xa9/0x180
[   78.849924]  new_sync_read+0x109/0x180
[   78.849929]  vfs_read+0x187/0x1b0
[   78.849932]  ksys_read+0xa1/0xe0
[   78.849935]  do_syscall_64+0x42/0x90
[   78.849938]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   78.849941] RIP: 0033:0x7fd3d6322f8e
[   78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 59
[   78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd3d6322f8e
[   78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: 0000000000000003
[   78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: 0000000000000003
[   78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: 00000000004a2a60
[   78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: 000000000048a4a0
[   78.849954]  </TASK>

Best
   Donald

> Logan

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-21 18:23           ` Donald Buczek
@ 2022-05-23  1:08             ` Guoqing Jiang
  2022-05-23  5:41               ` Donald Buczek
  2022-05-24 15:51             ` Logan Gunthorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-23  1:08 UTC (permalink / raw)
  To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid



On 5/22/22 2:23 AM, Donald Buczek wrote:
> On 20.05.22 20:27, Logan Gunthorpe wrote:
>>
>> Hi,
>>
>> On 2022-05-10 00:44, Song Liu wrote:
>>> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang 
>>> <guoqing.jiang@linux.dev> wrote:
>>>> On 5/9/22 2:37 PM, Song Liu wrote:
>>>>> On Fri, May 6, 2022 at 4:37 AM Guoqing 
>>>>> Jiang<guoqing.jiang@linux.dev>  wrote:
>>>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
>>>>>>
>>>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it
>>>>>> doesn't reconfigure array.
>>>>>>
>>>>>> And it could cause deadlock problem for raid5 as follows:
>>>>>>
>>>>>> 1. process A tried to reap sync thread with reconfig_mutex held 
>>>>>> after echo
>>>>>>      idle to sync_action.
>>>>>> 2. raid5 sync thread was blocked if there were too many active 
>>>>>> stripes.
>>>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from 
>>>>>> upper layer)
>>>>>>      which causes the number of active stripes can't be decreased.
>>>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was 
>>>>>> not able
>>>>>>      to hold reconfig_mutex.
>>>>>>
>>>>>> More details in the link:
>>>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t 
>>>>>>
>>>>>>
>>>>>> Let's call unregister thread between mddev_unlock and 
>>>>>> mddev_lock_nointr
>>>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the
>>>>>> reconfig_mutex is held, and mddev_is_locked is introduced 
>>>>>> accordingly.
>>>>> mddev_is_locked() feels really hacky to me. It cannot tell whether
>>>>> mddev is locked
>>>>> by current thread. So technically, we can unlock reconfigure_mutex 
>>>>> for
>>>>> other thread
>>>>> by accident, no?
>>>>
>>>> I can switch back to V2 if you think that is the correct way to do 
>>>> though no
>>>> one comment about the change in dm-raid.
>>>
>>> I guess v2 is the best at the moment. I pushed a slightly modified 
>>> v2 to
>>> md-next.
>>>
>>
>> I noticed a clear regression with mdadm tests with this patch in md-next
>> (7e6ba434cc6080).
>>
>> Before the patch, tests 07reshape5intr and 07revert-grow would fail
>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs
>> for the latter).
>>
>> After this patch, both tests always fail.
>>
>> I don't have time to dig into why this is, but it would be nice if
>> someone can at least fix the regression. It is hard to make any progress
>> on these tests if we are continuing to further break them.
>
> Hmmm. I wanted to try to help a bit by reproducing and digging into this.
>
> But it seems that more or less ALL tests hang my system one way or 
> another.
>
> I've used a qemu/kvm machine with md-next and mdraid master.
>
> Is this supposed to work?
>
> I can investigate the bugs I see, but probably that is a waste of time 
> because I'm doing something wrong fundamentally?
>
> This is an example from 00raid0:
>
> [   57.434064] md: md0 stopped.
> [   57.586951] md0: detected capacity change from 0 to 107520
> [   57.618454] BUG: kernel NULL pointer dereference, address: 
> 0000000000000094
> [   57.620830] #PF: supervisor read access in kernel mode
> [   57.622554] #PF: error_code(0x0000) - not-present page
> [   57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD 10df28067 
> PMD 0
> [   57.626548] Oops: 0000 [#1] PREEMPT SMP PTI
> [   57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted 
> 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
> [   57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
> [   57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80
> [   57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 
> 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 
> 5c c2 08 <80> bb 94 00 00 00 00 70
> [   57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
> [   57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 
> 0000000000000001
> [   57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: 
> ffff8881032ba180
> [   57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: 
> ffff88810318cb00
> [   57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: 
> ffff88810318cae0
> [   57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: 
> ffffc90001c27c00
> [   57.659316] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) 
> knlGS:0000000000000000
> [   57.661700] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 
> 0000000000170ee0
> [   57.665453] Call Trace:
> [   57.666479]  <TASK>
> [   57.667382]  bfq_bic_update_cgroup+0x28/0x1b0
> [   57.668724]  bfq_insert_requests+0x233/0x2340
> [   57.670049]  ? ioc_find_get_icq+0x21c/0x2a0
> [   57.671315]  ? bfq_prepare_request+0x11/0x30
> [   57.672565]  blk_mq_sched_insert_requests+0x5c/0x150
> [   57.673891]  blk_mq_flush_plug_list+0xe1/0x2a0
> [   57.675140]  __blk_flush_plug+0xdf/0x120
> [   57.676259]  io_schedule_prepare+0x3d/0x50
> [   57.677373]  io_schedule_timeout+0xf/0x40
> [   57.678465]  wait_for_completion_io+0x78/0x140
> [   57.679578]  submit_bio_wait+0x5b/0x80
> [   57.680575]  blkdev_issue_discard+0x65/0xb0
> [   57.681640]  blkdev_common_ioctl+0x391/0x8f0
> [   57.682712]  blkdev_ioctl+0x216/0x2a0
> [   57.683648]  __x64_sys_ioctl+0x76/0xb0
> [   57.684607]  do_syscall_64+0x42/0x90
> [   57.685527]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   57.686645] RIP: 0033:0x7fdfce56dc17
> [   57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 
> 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 
> 00 0f 05 <48> 3d 01 f0 ff ff 73 08
> [   57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000010
> [   57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 
> 00007fdfce56dc17
> [   57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 
> 0000000000000003
> [   57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: 
> 0000000000000000
> [   57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000000000000
> [   57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: 
> 0000000000000000
> [   57.699281]  </TASK>
> [   57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp 
> stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm 
> drm_kms_helper kvm drm fb_sys_fops vi4
> [   57.705955] CR2: 0000000000000094
> [   57.706710] ---[ end trace 0000000000000000 ]---
> [   57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80
> [   57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 
> 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 
> 5c c2 08 <80> bb 94 00 00 00 00 70
> [   57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
> [   57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 
> 0000000000000001
> [   57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: 
> ffff8881032ba180
> [   57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: 
> ffff88810318cb00
> [   57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: 
> ffff88810318cae0
> [   57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: 
> ffffc90001c27c00
> [   57.718438] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) 
> knlGS:0000000000000000
> [   57.719778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 
> 0000000000170ee0
> [   57.722019] note: mkfs.ext3[1064] exited with preempt_count 1
> [   57.723067] ------------[ cut here ]------------
> [   57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 
> do_exit+0x8cb/0xbc0
> [   57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp 
> stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm 
> drm_kms_helper kvm drm fb_sys_fops vi4
> [   57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G D           
> 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
> [   57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
> [   57.734853] RIP: 0010:do_exit+0x8cb/0xbc0
> [   57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c 
> db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 
> f8 ff ff <0f> 0b e9 6b f7 ff ff 4b
> [   57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082
> [   57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: 
> 0000000000000000
> [   57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: 
> 0000000000000009
> [   57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: 
> c0000000ffffbfff
> [   57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: 
> 0000000000000009
> [   57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: 
> 0000000000000000
> [   57.746388] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) 
> knlGS:0000000000000000
> [   57.747806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 
> 0000000000170ee0
> [   57.750225] Call Trace:
> [   57.750894]  <TASK>
> [   57.751535]  make_task_dead+0x41/0xf0
> [   57.752369]  rewind_stack_and_make_dead+0x17/0x17
> [   57.753336] RIP: 0033:0x7fdfce56dc17
> [   57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 
> 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 
> 00 0f 05 <48> 3d 01 f0 ff ff 73 08
> [   57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000010
> [   57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 
> 00007fdfce56dc17
> [   57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 
> 0000000000000003
> [   57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: 
> 0000000000000000
> [   57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000000000000
> [   57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: 
> 0000000000000000
> [   57.765177]  </TASK>
> [   57.765813] ---[ end trace 0000000000000000 ]---
> [   57.790046] md0: detected capacity change from 107520 to 0
> [   57.792834] md: md0 stopped.
> [   78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [   78.845334] rcu:     10-...0: (0 ticks this GP) 
> idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805
> [   78.847246]     (detected by 13, t=21005 jiffies, g=9013, q=1419)
> [   78.848619] Sending NMI from CPU 13 to CPUs 10:
> [   78.849810] NMI backtrace for cpu 10
> [   78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G      D 
> W         5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
> [   78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
> [   78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0
> [   78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 
> a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c 
> c3 f3 90 <8b> 07 84 c0 75 f8 eb e7
> [   78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002
> [   78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: 
> 0000000000000000
> [   78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
> ffff888102177c30
> [   78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: 
> ffff888106a4ed10
> [   78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: 
> ffff888103833000
> [   78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: 
> ffffc90001c9fa78
> [   78.849844] FS:  00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) 
> knlGS:0000000000000000
> [   78.849847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: 
> 0000000000170ee0
> [   78.849850] Call Trace:
> [   78.849853]  <TASK>
> [   78.849855]  bfq_insert_requests+0xae/0x2340
> [   78.849862]  ? submit_bio_noacct_nocheck+0x225/0x2b0
> [   78.849868]  blk_mq_sched_insert_requests+0x5c/0x150
> [   78.849872]  blk_mq_flush_plug_list+0xe1/0x2a0
> [   78.849876]  __blk_flush_plug+0xdf/0x120
> [   78.849879]  blk_finish_plug+0x27/0x40
> [   78.849882]  read_pages+0x15b/0x360
> [   78.849891]  page_cache_ra_unbounded+0x120/0x170
> [   78.849894]  filemap_get_pages+0xdd/0x5f0
> [   78.849899]  filemap_read+0xbf/0x350
> [   78.849902]  ? __mod_memcg_lruvec_state+0x72/0xc0
> [   78.849907]  ? __mod_lruvec_page_state+0xb4/0x160
> [   78.849909]  ? folio_add_lru+0x51/0x80
> [   78.849912]  ? _raw_spin_unlock+0x12/0x30
> [   78.849916]  ? __handle_mm_fault+0xdee/0x14d0
> [   78.849921]  blkdev_read_iter+0xa9/0x180
> [   78.849924]  new_sync_read+0x109/0x180
> [   78.849929]  vfs_read+0x187/0x1b0
> [   78.849932]  ksys_read+0xa1/0xe0
> [   78.849935]  do_syscall_64+0x42/0x90
> [   78.849938]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   78.849941] RIP: 0033:0x7fd3d6322f8e
> [   78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec 
> 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 
> 14 0f 05 <48> 3d 00 f0 ff ff 77 59
> [   78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000000
> [   78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 
> 00007fd3d6322f8e
> [   78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: 
> 0000000000000003
> [   78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: 
> 0000000000000003
> [   78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: 
> 00000000004a2a60
> [   78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: 
> 000000000048a4a0
> [   78.849954]  </TASK>

Looks like bfq or block issue, will try it from my side.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-23  1:08             ` Guoqing Jiang
@ 2022-05-23  5:41               ` Donald Buczek
  2022-05-23  9:51                 ` Guoqing Jiang
                                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Donald Buczek @ 2022-05-23  5:41 UTC (permalink / raw)
  To: Guoqing Jiang, Logan Gunthorpe, Song Liu, Logan Gunthorpe; +Cc: linux-raid

On 23.05.22 03:08, Guoqing Jiang wrote:
> 
> 
> On 5/22/22 2:23 AM, Donald Buczek wrote:
>> On 20.05.22 20:27, Logan Gunthorpe wrote:
>>>
>>> Hi,
>>>
>>> On 2022-05-10 00:44, Song Liu wrote:
>>>> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>>>>> On 5/9/22 2:37 PM, Song Liu wrote:
>>>>>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
>>>>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
>>>>>>>
>>>>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it
>>>>>>> doesn't reconfigure array.
>>>>>>>
>>>>>>> And it could cause deadlock problem for raid5 as follows:
>>>>>>>
>>>>>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo
>>>>>>>      idle to sync_action.
>>>>>>> 2. raid5 sync thread was blocked if there were too many active stripes.
>>>>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
>>>>>>>      which causes the number of active stripes can't be decreased.
>>>>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
>>>>>>>      to hold reconfig_mutex.
>>>>>>>
>>>>>>> More details in the link:
>>>>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
>>>>>>>
>>>>>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
>>>>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the
>>>>>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
>>>>>> mddev_is_locked() feels really hacky to me. It cannot tell whether
>>>>>> mddev is locked
>>>>>> by current thread. So technically, we can unlock reconfigure_mutex for
>>>>>> other thread
>>>>>> by accident, no?
>>>>>
>>>>> I can switch back to V2 if you think that is the correct way to do though no
>>>>> one comment about the change in dm-raid.
>>>>
>>>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to
>>>> md-next.
>>>>
>>>
>>> I noticed a clear regression with mdadm tests with this patch in md-next
>>> (7e6ba434cc6080).
>>>
>>> Before the patch, tests 07reshape5intr and 07revert-grow would fail
>>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs
>>> for the latter).
>>>
>>> After this patch, both tests always fail.
>>>
>>> I don't have time to dig into why this is, but it would be nice if
>>> someone can at least fix the regression. It is hard to make any progress
>>> on these tests if we are continuing to further break them.
>>
>> Hmmm. I wanted to try to help a bit by reproducing and digging into this.
>>
>> But it seems that more or less ALL tests hang my system one way or another.
>>
>> I've used a qemu/kvm machine with md-next and mdraid master.
>>
>> Is this supposed to work?
>>
>> I can investigate the bugs I see, but probably that is a waste of time because I'm doing something wrong fundamentally?
>>
>> This is an example from 00raid0:
>>
>> [   57.434064] md: md0 stopped.
>> [   57.586951] md0: detected capacity change from 0 to 107520
>> [   57.618454] BUG: kernel NULL pointer dereference, address: 0000000000000094
>> [   57.620830] #PF: supervisor read access in kernel mode
>> [   57.622554] #PF: error_code(0x0000) - not-present page
>> [   57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD 10df28067 PMD 0
>> [   57.626548] Oops: 0000 [#1] PREEMPT SMP PTI
>> [   57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
>> [   57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
>> [   57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80
>> [   57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70
>> [   57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
>> [   57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001
>> [   57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180
>> [   57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00
>> [   57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0
>> [   57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00
>> [   57.659316] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000
>> [   57.661700] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0
>> [   57.665453] Call Trace:
>> [   57.666479]  <TASK>
>> [   57.667382]  bfq_bic_update_cgroup+0x28/0x1b0
>> [   57.668724]  bfq_insert_requests+0x233/0x2340
>> [   57.670049]  ? ioc_find_get_icq+0x21c/0x2a0
>> [   57.671315]  ? bfq_prepare_request+0x11/0x30
>> [   57.672565]  blk_mq_sched_insert_requests+0x5c/0x150
>> [   57.673891]  blk_mq_flush_plug_list+0xe1/0x2a0
>> [   57.675140]  __blk_flush_plug+0xdf/0x120
>> [   57.676259]  io_schedule_prepare+0x3d/0x50
>> [   57.677373]  io_schedule_timeout+0xf/0x40
>> [   57.678465]  wait_for_completion_io+0x78/0x140
>> [   57.679578]  submit_bio_wait+0x5b/0x80
>> [   57.680575]  blkdev_issue_discard+0x65/0xb0
>> [   57.681640]  blkdev_common_ioctl+0x391/0x8f0
>> [   57.682712]  blkdev_ioctl+0x216/0x2a0
>> [   57.683648]  __x64_sys_ioctl+0x76/0xb0
>> [   57.684607]  do_syscall_64+0x42/0x90
>> [   57.685527]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [   57.686645] RIP: 0033:0x7fdfce56dc17
>> [   57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08
>> [   57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [   57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17
>> [   57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003
>> [   57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000
>> [   57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> [   57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000
>> [   57.699281]  </TASK>
>> [   57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4
>> [   57.705955] CR2: 0000000000000094
>> [   57.706710] ---[ end trace 0000000000000000 ]---
>> [   57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80
>> [   57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70
>> [   57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
>> [   57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001
>> [   57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180
>> [   57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00
>> [   57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0
>> [   57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00
>> [   57.718438] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000
>> [   57.719778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0
>> [   57.722019] note: mkfs.ext3[1064] exited with preempt_count 1
>> [   57.723067] ------------[ cut here ]------------
>> [   57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 do_exit+0x8cb/0xbc0
>> [   57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4
>> [   57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G D 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
>> [   57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
>> [   57.734853] RIP: 0010:do_exit+0x8cb/0xbc0
>> [   57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 f8 ff ff <0f> 0b e9 6b f7 ff ff 4b
>> [   57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082
>> [   57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: 0000000000000000
>> [   57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: 0000000000000009
>> [   57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: c0000000ffffbfff
>> [   57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: 0000000000000009
>> [   57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: 0000000000000000
>> [   57.746388] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000
>> [   57.747806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0
>> [   57.750225] Call Trace:
>> [   57.750894]  <TASK>
>> [   57.751535]  make_task_dead+0x41/0xf0
>> [   57.752369]  rewind_stack_and_make_dead+0x17/0x17
>> [   57.753336] RIP: 0033:0x7fdfce56dc17
>> [   57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08
>> [   57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [   57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17
>> [   57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003
>> [   57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000
>> [   57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> [   57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000
>> [   57.765177]  </TASK>
>> [   57.765813] ---[ end trace 0000000000000000 ]---
>> [   57.790046] md0: detected capacity change from 107520 to 0
>> [   57.792834] md: md0 stopped.
>> [   78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [   78.845334] rcu:     10-...0: (0 ticks this GP) idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805
>> [   78.847246]     (detected by 13, t=21005 jiffies, g=9013, q=1419)
>> [   78.848619] Sending NMI from CPU 13 to CPUs 10:
>> [   78.849810] NMI backtrace for cpu 10
>> [   78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G      D W         5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
>> [   78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
>> [   78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0
>> [   78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c c3 f3 90 <8b> 07 84 c0 75 f8 eb e7
>> [   78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002
>> [   78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: 0000000000000000
>> [   78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888102177c30
>> [   78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: ffff888106a4ed10
>> [   78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: ffff888103833000
>> [   78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: ffffc90001c9fa78
>> [   78.849844] FS:  00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) knlGS:0000000000000000
>> [   78.849847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: 0000000000170ee0
>> [   78.849850] Call Trace:
>> [   78.849853]  <TASK>
>> [   78.849855]  bfq_insert_requests+0xae/0x2340
>> [   78.849862]  ? submit_bio_noacct_nocheck+0x225/0x2b0
>> [   78.849868]  blk_mq_sched_insert_requests+0x5c/0x150
>> [   78.849872]  blk_mq_flush_plug_list+0xe1/0x2a0
>> [   78.849876]  __blk_flush_plug+0xdf/0x120
>> [   78.849879]  blk_finish_plug+0x27/0x40
>> [   78.849882]  read_pages+0x15b/0x360
>> [   78.849891]  page_cache_ra_unbounded+0x120/0x170
>> [   78.849894]  filemap_get_pages+0xdd/0x5f0
>> [   78.849899]  filemap_read+0xbf/0x350
>> [   78.849902]  ? __mod_memcg_lruvec_state+0x72/0xc0
>> [   78.849907]  ? __mod_lruvec_page_state+0xb4/0x160
>> [   78.849909]  ? folio_add_lru+0x51/0x80
>> [   78.849912]  ? _raw_spin_unlock+0x12/0x30
>> [   78.849916]  ? __handle_mm_fault+0xdee/0x14d0
>> [   78.849921]  blkdev_read_iter+0xa9/0x180
>> [   78.849924]  new_sync_read+0x109/0x180
>> [   78.849929]  vfs_read+0x187/0x1b0
>> [   78.849932]  ksys_read+0xa1/0xe0
>> [   78.849935]  do_syscall_64+0x42/0x90
>> [   78.849938]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [   78.849941] RIP: 0033:0x7fd3d6322f8e
>> [   78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 59
>> [   78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [   78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd3d6322f8e
>> [   78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: 0000000000000003
>> [   78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: 0000000000000003
>> [   78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: 00000000004a2a60
>> [   78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: 000000000048a4a0
>> [   78.849954]  </TASK>
> 
> Looks like bfq or block issue, will try it from my side.

FYI: I've used loop devices on a virtio disk.

I later discovered Logans patches [1], which I were not aware of, as I'm not subscribed to the lists.

[1]: https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u

The series seems to acknowledge that there are open problems and tries to fix them.
So I've used his md-bug branch from https://github.com/sbates130272/linux-p2pmem but it didn't look better.

So I understand, the mdadm tests *are* supposed to work and every bug I see here is worth analyzing? Or is Logan hunting down everything anyway?

Best
   Donald
-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-23  5:41               ` Donald Buczek
@ 2022-05-23  9:51                 ` Guoqing Jiang
  2022-05-24 16:13                   ` Logan Gunthorpe
  2022-05-30  9:55                   ` Guoqing Jiang
  2022-05-24 15:58                 ` Logan Gunthorpe
  2022-05-25  9:17                 ` Guoqing Jiang
  2 siblings, 2 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-23  9:51 UTC (permalink / raw)
  To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid



>>>> I noticed a clear regression with mdadm tests with this patch in 
>>>> md-next
>>>> (7e6ba434cc6080).
>>>>
>>>> Before the patch, tests 07reshape5intr and 07revert-grow would fail
>>>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs
>>>> for the latter).
>>>>
>>>> After this patch, both tests always fail.
>>>>
>>>> I don't have time to dig into why this is, but it would be nice if
>>>> someone can at least fix the regression. It is hard to make any 
>>>> progress
>>>> on these tests if we are continuing to further break them.

I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla 
5.12, none of them
can pass your mentioned tests.

[root@localhost mdadm]# lsblk|grep vd
vda          252:0    0    1G  0 disk
vdb          252:16   0    1G  0 disk
vdc          252:32   0    1G  0 disk
vdd          252:48   0    1G  0 disk
[root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
--tests=05r1-add-internalbitmap
Testing on linux-5.12.0-default kernel
/root/mdadm/tests/05r1-add-internalbitmap... succeeded
[root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
--tests=07reshape5intr
Testing on linux-5.12.0-default kernel
/root/mdadm/tests/07reshape5intr... FAILED - see 
/var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details
[root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
--tests=07revert-grow
Testing on linux-5.12.0-default kernel
/root/mdadm/tests/07revert-grow... FAILED - see 
/var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details
[root@localhost mdadm]# head -10  /var/tmp/07revert-grow.log | grep mdadm
+ . /root/mdadm/tests/07revert-grow
*++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb 
/dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9**
*
The above line is clearly wrong from my understanding.

And let's check ubuntu 22.04.

root@vm:/home/gjiang/mdadm# lsblk|grep vd
vda    252:0    0     1G  0 disk
vdb    252:16   0     1G  0 disk
vdc    252:32   0     1G  0 disk
root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d} 
--tests=05r1-failfast
Testing on linux-5.15.0-30-generic kernel
/home/gjiang/mdadm/tests/05r1-failfast... succeeded
root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c}   
--tests=07reshape5intr
Testing on linux-5.15.0-30-generic kernel
/home/gjiang/mdadm/tests/07reshape5intr... FAILED - see 
/var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details
root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} 
--tests=07revert-grow
Testing on linux-5.15.0-30-generic kernel
/home/gjiang/mdadm/tests/07revert-grow... FAILED - see 
/var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details

So I would not consider it is regression.

[ ... ]

> FYI: I've used loop devices on a virtio disk.
>
> I later discovered Logans patches [1], which I were not aware of, as 
> I'm not subscribed to the lists.
>
> [1]: 
> https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u
>
> The series seems to acknowledge that there are open problems and tries 
> to fix them.
> So I've used his md-bug branch from 
> https://github.com/sbates130272/linux-p2pmem but it didn't look better.

Thanks for your effort.

> So I understand, the mdadm tests *are* supposed to work and every bug 
> I see here is worth analyzing? Or is Logan hunting down everything 
> anyway?

Yes, it was supposed to be. But unfortunately, it was kind of broken, 
good news is people are aware
of it and try to make it works/better, pls see other links.

[1] 
https://lore.kernel.org/linux-raid/EA6887B4-2A44-49D0-ACF9-C04CC92AFD87@oracle.com/T/#t
[2] 
https://lore.kernel.org/linux-raid/CALTww2-mbfZRcHu_95Q+WANXZ9ayRwjXvyvqP5Gseeb86dEy=w@mail.gmail.com/T/#t

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-21 18:23           ` Donald Buczek
  2022-05-23  1:08             ` Guoqing Jiang
@ 2022-05-24 15:51             ` Logan Gunthorpe
  1 sibling, 0 replies; 45+ messages in thread
From: Logan Gunthorpe @ 2022-05-24 15:51 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, Guoqing Jiang; +Cc: linux-raid



On 2022-05-21 12:23, Donald Buczek wrote:
>> I noticed a clear regression with mdadm tests with this patch in md-next
>> (7e6ba434cc6080).
>>
>> Before the patch, tests 07reshape5intr and 07revert-grow would fail
>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs
>> for the latter).
>>
>> After this patch, both tests always fail.
>>
>> I don't have time to dig into why this is, but it would be nice if
>> someone can at least fix the regression. It is hard to make any progress
>> on these tests if we are continuing to further break them.
> 
> Hmmm. I wanted to try to help a bit by reproducing and digging into this.
> 
> But it seems that more or less ALL tests hang my system one way or another.
> 
> I've used a qemu/kvm machine with md-next and mdraid master.
> 
> Is this supposed to work?

I know those tests are full of bugs, I've been trying to remedy that
somewhat. You can run with --tests=07reshape5intr to run a specific
test. And I have a branch[1] with some fixen and also adds a --loop
option to run tests multiple times.

I expect 00raid0 to fail with the master branch, however I have not seen
that specific trace back. Might be worth debugging that one if you can.

Logan

[1] https://github.com/lsgunth/mdadm bugfixes2

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-23  5:41               ` Donald Buczek
  2022-05-23  9:51                 ` Guoqing Jiang
@ 2022-05-24 15:58                 ` Logan Gunthorpe
  2022-05-24 18:16                   ` Song Liu
  2022-05-25  9:17                 ` Guoqing Jiang
  2 siblings, 1 reply; 45+ messages in thread
From: Logan Gunthorpe @ 2022-05-24 15:58 UTC (permalink / raw)
  To: Donald Buczek, Guoqing Jiang, Song Liu; +Cc: linux-raid



On 2022-05-22 23:41, Donald Buczek wrote:
>> Looks like bfq or block issue, will try it from my side.
> 
> FYI: I've used loop devices on a virtio disk.
> 
> I later discovered Logans patches [1], which I were not aware of, as I'm not subscribed to the lists.
> 
> [1]: https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u
> 
> The series seems to acknowledge that there are open problems and tries to fix them.
> So I've used his md-bug branch from https://github.com/sbates130272/linux-p2pmem but it didn't look better.
> 
> So I understand, the mdadm tests *are* supposed to work and every bug I see here is worth analyzing? Or is Logan hunting down everything anyway?

I'm not hunting down everything. There's too much brokenness. I've done
a bunch of work: there's that series plus an mdadm branch I'll send to
the list later (just linked it on my previous email). But even after all
that, I still have ~25 broken tests, but I've marked those tests and
they shouldn't stop the test script from running everything.

Logan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-23  9:51                 ` Guoqing Jiang
@ 2022-05-24 16:13                   ` Logan Gunthorpe
  2022-05-25  9:04                     ` Guoqing Jiang
  2022-05-30  9:55                   ` Guoqing Jiang
  1 sibling, 1 reply; 45+ messages in thread
From: Logan Gunthorpe @ 2022-05-24 16:13 UTC (permalink / raw)
  To: Guoqing Jiang, Donald Buczek, Song Liu; +Cc: linux-raid



On 2022-05-23 03:51, Guoqing Jiang wrote:
> I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla 
> 5.12, none of them
> can pass your mentioned tests.
> 
> [root@localhost mdadm]# lsblk|grep vd
> vda          252:0    0    1G  0 disk
> vdb          252:16   0    1G  0 disk
> vdc          252:32   0    1G  0 disk
> vdd          252:48   0    1G  0 disk
> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=05r1-add-internalbitmap
> Testing on linux-5.12.0-default kernel
> /root/mdadm/tests/05r1-add-internalbitmap... succeeded
> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=07reshape5intr
> Testing on linux-5.12.0-default kernel
> /root/mdadm/tests/07reshape5intr... FAILED - see 
> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details
> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=07revert-grow
> Testing on linux-5.12.0-default kernel
> /root/mdadm/tests/07revert-grow... FAILED - see 
> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details
> [root@localhost mdadm]# head -10  /var/tmp/07revert-grow.log | grep mdadm
> + . /root/mdadm/tests/07revert-grow
> *++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb 
> /dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9**
> *
> The above line is clearly wrong from my understanding.
> 
> And let's check ubuntu 22.04.
> 
> root@vm:/home/gjiang/mdadm# lsblk|grep vd
> vda    252:0    0     1G  0 disk
> vdb    252:16   0     1G  0 disk
> vdc    252:32   0     1G  0 disk
> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=05r1-failfast
> Testing on linux-5.15.0-30-generic kernel
> /home/gjiang/mdadm/tests/05r1-failfast... succeeded
> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c}   
> --tests=07reshape5intr
> Testing on linux-5.15.0-30-generic kernel
> /home/gjiang/mdadm/tests/07reshape5intr... FAILED - see 
> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details
> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} 
> --tests=07revert-grow
> Testing on linux-5.15.0-30-generic kernel
> /home/gjiang/mdadm/tests/07revert-grow... FAILED - see 
> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details
> 
> So I would not consider it is regression.

I definitely had those test working (at least some of the time) before I
rebased on md-next or if I revert 7e6ba434cc6080. You might need to try
my branch (plus that patch reverted) and my mdadm branch as there are a
number of fixes that may have helped with that specific test.

https://github.com/lsgunth/mdadm/ bugfixes2
https://github.com/sbates130272/linux-p2pmem md-bug

Thanks,

Logan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-24 15:58                 ` Logan Gunthorpe
@ 2022-05-24 18:16                   ` Song Liu
  0 siblings, 0 replies; 45+ messages in thread
From: Song Liu @ 2022-05-24 18:16 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Donald Buczek, Guoqing Jiang, linux-raid

On Tue, May 24, 2022 at 8:58 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2022-05-22 23:41, Donald Buczek wrote:
> >> Looks like bfq or block issue, will try it from my side.
> >
> > FYI: I've used loop devices on a virtio disk.
> >
> > I later discovered Logans patches [1], which I were not aware of, as I'm not subscribed to the lists.
> >
> > [1]: https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u
> >
> > The series seems to acknowledge that there are open problems and tries to fix them.
> > So I've used his md-bug branch from https://github.com/sbates130272/linux-p2pmem but it didn't look better.
> >
> > So I understand, the mdadm tests *are* supposed to work and every bug I see here is worth analyzing? Or is Logan hunting down everything anyway?
>
> I'm not hunting down everything. There's too much brokenness. I've done
> a bunch of work: there's that series plus an mdadm branch I'll send to
> the list later (just linked it on my previous email). But even after all
> that, I still have ~25 broken tests, but I've marked those tests and
> they shouldn't stop the test script from running everything.

Thanks for your fixes so far. I will work with Jes to look into the remaining
failures and develop plan to fix them.

Song

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-24 16:13                   ` Logan Gunthorpe
@ 2022-05-25  9:04                     ` Guoqing Jiang
  2022-05-25 18:22                       ` Logan Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-25  9:04 UTC (permalink / raw)
  To: Logan Gunthorpe, Donald Buczek, Song Liu; +Cc: linux-raid



On 5/25/22 12:13 AM, Logan Gunthorpe wrote:
> On 2022-05-23 03:51, Guoqing Jiang wrote:
>> I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla
>> 5.12, none of them
>> can pass your mentioned tests.
>>
>> [root@localhost mdadm]# lsblk|grep vd
>> vda          252:0    0    1G  0 disk
>> vdb          252:16   0    1G  0 disk
>> vdc          252:32   0    1G  0 disk
>> vdd          252:48   0    1G  0 disk
>> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d}
>> --tests=05r1-add-internalbitmap
>> Testing on linux-5.12.0-default kernel
>> /root/mdadm/tests/05r1-add-internalbitmap... succeeded
>> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d}
>> --tests=07reshape5intr
>> Testing on linux-5.12.0-default kernel
>> /root/mdadm/tests/07reshape5intr... FAILED - see
>> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details
>> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d}
>> --tests=07revert-grow
>> Testing on linux-5.12.0-default kernel
>> /root/mdadm/tests/07revert-grow... FAILED - see
>> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details
>> [root@localhost mdadm]# head -10  /var/tmp/07revert-grow.log | grep mdadm
>> + . /root/mdadm/tests/07revert-grow
>> *++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb
>> /dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9**
>> *
>> The above line is clearly wrong from my understanding.
>>
>> And let's check ubuntu 22.04.
>>
>> root@vm:/home/gjiang/mdadm# lsblk|grep vd
>> vda    252:0    0     1G  0 disk
>> vdb    252:16   0     1G  0 disk
>> vdc    252:32   0     1G  0 disk
>> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d}
>> --tests=05r1-failfast
>> Testing on linux-5.15.0-30-generic kernel
>> /home/gjiang/mdadm/tests/05r1-failfast... succeeded
>> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c}
>> --tests=07reshape5intr
>> Testing on linux-5.15.0-30-generic kernel
>> /home/gjiang/mdadm/tests/07reshape5intr... FAILED - see
>> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details
>> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c}
>> --tests=07revert-grow
>> Testing on linux-5.15.0-30-generic kernel
>> /home/gjiang/mdadm/tests/07revert-grow... FAILED - see
>> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details
>>
>> So I would not consider it is regression.
> I definitely had those test working (at least some of the time) before I
> rebased on md-next or if I revert 7e6ba434cc6080. You might need to try
> my branch (plus that patch reverted) and my mdadm branch as there are a
> number of fixes that may have helped with that specific test.
>
> https://github.com/lsgunth/mdadm/ bugfixes2
> https://github.com/sbates130272/linux-p2pmem md-bug

I would prefer to focus on block tree or md tree. With latest block tree
(commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below
similar issue as Donald reported, it happened with the cmd (which did
work with 5.12 kernel).

vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap

May 25 04:48:51 vm79 kernel: Call Trace:
May 25 04:48:51 vm79 kernel:  <TASK>
May 25 04:48:51 vm79 kernel:  bfq_bic_update_cgroup+0x28/0x1b0
May 25 04:48:51 vm79 kernel:  bfq_insert_requests+0x29d/0x22d0
May 25 04:48:51 vm79 kernel:  ? ioc_find_get_icq+0x21c/0x2a0
May 25 04:48:51 vm79 kernel:  ? bfq_prepare_request+0x11/0x30
May 25 04:48:51 vm79 kernel:  blk_mq_sched_insert_request+0x8b/0x100
May 25 04:48:51 vm79 kernel:  blk_mq_submit_bio+0x44c/0x540
May 25 04:48:51 vm79 kernel:  __submit_bio+0xe8/0x160
May 25 04:48:51 vm79 kernel:  submit_bio_noacct_nocheck+0xf0/0x2b0
May 25 04:48:51 vm79 kernel:  ? submit_bio+0x3e/0xd0
May 25 04:48:51 vm79 kernel:  submit_bio+0x3e/0xd0
May 25 04:48:51 vm79 kernel:  submit_bh_wbc+0x117/0x140
May 25 04:48:51 vm79 kernel:  block_read_full_page+0x1eb/0x4f0
May 25 04:48:51 vm79 kernel:  ? blkdev_llseek+0x60/0x60
May 25 04:48:51 vm79 kernel:  ? folio_add_lru+0x51/0x80
May 25 04:48:51 vm79 kernel:  do_read_cache_folio+0x3b4/0x5e0
May 25 04:48:51 vm79 kernel:  ? kmem_cache_alloc_node+0x183/0x2e0
May 25 04:48:51 vm79 kernel:  ? alloc_vmap_area+0x9f/0x8a0
May 25 04:48:51 vm79 kernel:  read_cache_page+0x15/0x80
May 25 04:48:51 vm79 kernel:  read_part_sector+0x38/0x140
May 25 04:48:51 vm79 kernel:  read_lba+0x105/0x220
May 25 04:48:51 vm79 kernel:  efi_partition+0xed/0x7f0

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-23  5:41               ` Donald Buczek
  2022-05-23  9:51                 ` Guoqing Jiang
  2022-05-24 15:58                 ` Logan Gunthorpe
@ 2022-05-25  9:17                 ` Guoqing Jiang
  2 siblings, 0 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-25  9:17 UTC (permalink / raw)
  To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid



On 5/23/22 1:41 PM, Donald Buczek wrote:
>>>
>>> [   57.434064] md: md0 stopped.
>>> [   57.586951] md0: detected capacity change from 0 to 107520
>>> [   57.618454] BUG: kernel NULL pointer dereference, address: 
>>> 0000000000000094
>>> [   57.620830] #PF: supervisor read access in kernel mode
>>> [   57.622554] #PF: error_code(0x0000) - not-present page
>>> [   57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD 
>>> 10df28067 PMD 0
>>> [   57.626548] Oops: 0000 [#1] PREEMPT SMP PTI
>>> [   57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted 
>>> 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
>>> [   57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>> 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
>>> [   57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80
>>> [   57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 
>>> 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 
>>> 5c c2 08 <80> bb 94 00 00 00 00 70
>>> [   57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
>>> [   57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 
>>> 0000000000000001
>>> [   57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: 
>>> ffff8881032ba180
>>> [   57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: 
>>> ffff88810318cb00
>>> [   57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: 
>>> ffff88810318cae0
>>> [   57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: 
>>> ffffc90001c27c00
>>> [   57.659316] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) 
>>> knlGS:0000000000000000
>>> [   57.661700] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 
>>> 0000000000170ee0
>>> [   57.665453] Call Trace:
>>> [   57.666479]  <TASK>
>>> [   57.667382]  bfq_bic_update_cgroup+0x28/0x1b0
>>> [   57.668724]  bfq_insert_requests+0x233/0x2340
>>> [   57.670049]  ? ioc_find_get_icq+0x21c/0x2a0
>>> [   57.671315]  ? bfq_prepare_request+0x11/0x30
>>> [   57.672565]  blk_mq_sched_insert_requests+0x5c/0x150
>>> [   57.673891]  blk_mq_flush_plug_list+0xe1/0x2a0
>>> [   57.675140]  __blk_flush_plug+0xdf/0x120
>>> [   57.676259]  io_schedule_prepare+0x3d/0x50
>>> [   57.677373]  io_schedule_timeout+0xf/0x40
>>> [   57.678465]  wait_for_completion_io+0x78/0x140
>>> [   57.679578]  submit_bio_wait+0x5b/0x80
>>> [   57.680575]  blkdev_issue_discard+0x65/0xb0
>>> [   57.681640]  blkdev_common_ioctl+0x391/0x8f0
>>> [   57.682712]  blkdev_ioctl+0x216/0x2a0
>>> [   57.683648]  __x64_sys_ioctl+0x76/0xb0
>>> [   57.684607]  do_syscall_64+0x42/0x90
>>> [   57.685527]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [   57.686645] RIP: 0033:0x7fdfce56dc17
>>> [   57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 
>>> 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 
>>> 00 0f 05 <48> 3d 01 f0 ff ff 73 08
>>> [   57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 
>>> 0000000000000010
>>> [   57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 
>>> 00007fdfce56dc17
>>> [   57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 
>>> 0000000000000003
>>> [   57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: 
>>> 0000000000000000
>>> [   57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: 
>>> 0000000000000000
>>> [   57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: 
>>> 0000000000000000
>>> [   57.699281]  </TASK>
>>> [   57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q 
>>> garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm 
>>> drm_kms_helper kvm drm fb_sys_fops vi4
>>> [   57.705955] CR2: 0000000000000094
>>> [   57.706710] ---[ end trace 0000000000000000 ]---
>>> [   57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80
>>> [   57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 
>>> 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 
>>> 5c c2 08 <80> bb 94 00 00 00 00 70
>>> [   57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006
>>> [   57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 
>>> 0000000000000001
>>> [   57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: 
>>> ffff8881032ba180
>>> [   57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: 
>>> ffff88810318cb00
>>> [   57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: 
>>> ffff88810318cae0
>>> [   57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: 
>>> ffffc90001c27c00
>>> [   57.718438] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) 
>>> knlGS:0000000000000000
>>> [   57.719778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 
>>> 0000000000170ee0
>>> [   57.722019] note: mkfs.ext3[1064] exited with preempt_count 1
>>> [   57.723067] ------------[ cut here ]------------
>>> [   57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 
>>> do_exit+0x8cb/0xbc0
>>> [   57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q 
>>> garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm 
>>> drm_kms_helper kvm drm fb_sys_fops vi4
>>> [   57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G D 
>>> 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
>>> [   57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>> 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
>>> [   57.734853] RIP: 0010:do_exit+0x8cb/0xbc0
>>> [   57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c 
>>> db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 
>>> f8 ff ff <0f> 0b e9 6b f7 ff ff 4b
>>> [   57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082
>>> [   57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: 
>>> 0000000000000000
>>> [   57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: 
>>> 0000000000000009
>>> [   57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: 
>>> c0000000ffffbfff
>>> [   57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: 
>>> 0000000000000009
>>> [   57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: 
>>> 0000000000000000
>>> [   57.746388] FS:  00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) 
>>> knlGS:0000000000000000
>>> [   57.747806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 
>>> 0000000000170ee0
>>> [   57.750225] Call Trace:
>>> [   57.750894]  <TASK>
>>> [   57.751535]  make_task_dead+0x41/0xf0
>>> [   57.752369]  rewind_stack_and_make_dead+0x17/0x17
>>> [   57.753336] RIP: 0033:0x7fdfce56dc17
>>> [   57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 
>>> 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 
>>> 00 0f 05 <48> 3d 01 f0 ff ff 73 08
>>> [   57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 
>>> 0000000000000010
>>> [   57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 
>>> 00007fdfce56dc17
>>> [   57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 
>>> 0000000000000003
>>> [   57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: 
>>> 0000000000000000
>>> [   57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: 
>>> 0000000000000000
>>> [   57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: 
>>> 0000000000000000
>>> [   57.765177]  </TASK>
>>> [   57.765813] ---[ end trace 0000000000000000 ]---
>>> [   57.790046] md0: detected capacity change from 107520 to 0
>>> [   57.792834] md: md0 stopped.
>>> [   78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>> [   78.845334] rcu:     10-...0: (0 ticks this GP) 
>>> idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805
>>> [   78.847246]     (detected by 13, t=21005 jiffies, g=9013, q=1419)
>>> [   78.848619] Sending NMI from CPU 13 to CPUs 10:
>>> [   78.849810] NMI backtrace for cpu 10
>>> [   78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G      D 
>>> W         5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77
>>> [   78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>> 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
>>> [   78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0
>>> [   78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 
>>> a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c 
>>> c3 f3 90 <8b> 07 84 c0 75 f8 eb e7
>>> [   78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002
>>> [   78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: 
>>> 0000000000000000
>>> [   78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
>>> ffff888102177c30
>>> [   78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: 
>>> ffff888106a4ed10
>>> [   78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: 
>>> ffff888103833000
>>> [   78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: 
>>> ffffc90001c9fa78
>>> [   78.849844] FS:  00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) 
>>> knlGS:0000000000000000
>>> [   78.849847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: 
>>> 0000000000170ee0
>>> [   78.849850] Call Trace:
>>> [   78.849853]  <TASK>
>>> [   78.849855]  bfq_insert_requests+0xae/0x2340
>>> [   78.849862]  ? submit_bio_noacct_nocheck+0x225/0x2b0
>>> [   78.849868]  blk_mq_sched_insert_requests+0x5c/0x150
>>> [   78.849872]  blk_mq_flush_plug_list+0xe1/0x2a0
>>> [   78.849876]  __blk_flush_plug+0xdf/0x120
>>> [   78.849879]  blk_finish_plug+0x27/0x40
>>> [   78.849882]  read_pages+0x15b/0x360
>>> [   78.849891]  page_cache_ra_unbounded+0x120/0x170
>>> [   78.849894]  filemap_get_pages+0xdd/0x5f0
>>> [   78.849899]  filemap_read+0xbf/0x350
>>> [   78.849902]  ? __mod_memcg_lruvec_state+0x72/0xc0
>>> [   78.849907]  ? __mod_lruvec_page_state+0xb4/0x160
>>> [   78.849909]  ? folio_add_lru+0x51/0x80
>>> [   78.849912]  ? _raw_spin_unlock+0x12/0x30
>>> [   78.849916]  ? __handle_mm_fault+0xdee/0x14d0
>>> [   78.849921]  blkdev_read_iter+0xa9/0x180
>>> [   78.849924]  new_sync_read+0x109/0x180
>>> [   78.849929]  vfs_read+0x187/0x1b0
>>> [   78.849932]  ksys_read+0xa1/0xe0
>>> [   78.849935]  do_syscall_64+0x42/0x90
>>> [   78.849938]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [   78.849941] RIP: 0033:0x7fd3d6322f8e
>>> [   78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec 
>>> 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 
>>> 14 0f 05 <48> 3d 00 f0 ff ff 77 59
>>> [   78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: 
>>> 0000000000000000
>>> [   78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 
>>> 00007fd3d6322f8e
>>> [   78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: 
>>> 0000000000000003
>>> [   78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: 
>>> 0000000000000003
>>> [   78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: 
>>> 00000000004a2a60
>>> [   78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: 
>>> 000000000048a4a0
>>> [   78.849954]  </TASK>
>>
>> Looks like bfq or block issue, will try it from my side.

Hmm, it could be md specific issue because I find below stack after 
similar call trace happened

vm79:~> ps aux|grep "\[md"|grep D|awk '{print $2}'
2087
vm79:~> sudo cat /proc/2087/stack
[<0>] raid1_sync_request+0x65e/0xb60 [raid1]
[<0>] md_do_sync+0xa13/0xf50 [md_mod]
[<0>] md_thread+0x131/0x180 [md_mod]
[<0>] kthread+0xe8/0x110
[<0>] ret_from_fork+0x22/0x30

Donald, could you share the md process stack when you see the call trace?

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-25  9:04                     ` Guoqing Jiang
@ 2022-05-25 18:22                       ` Logan Gunthorpe
  2022-05-26  9:46                         ` Jan Kara
  2022-05-26 11:53                         ` Jan Kara
  0 siblings, 2 replies; 45+ messages in thread
From: Logan Gunthorpe @ 2022-05-25 18:22 UTC (permalink / raw)
  To: Guoqing Jiang, Donald Buczek, Song Liu, Jan Kara, Jens Axboe,
	Paolo Valente
  Cc: linux-raid



On 2022-05-25 03:04, Guoqing Jiang wrote:
> I would prefer to focus on block tree or md tree. With latest block tree
> (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below
> similar issue as Donald reported, it happened with the cmd (which did
> work with 5.12 kernel).
> 
> vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap

Ok, so this test passes for me, but my VM was not running with bfq. It
also seems we have layers upon layers of different bugs to untangle.
Perhaps you can try the tests with bfq disabled to make progress on the
other regression I reported.

If I enable bfq and set the loop devices to the bfq scheduler, then I
hit the same bug as you and Donald. It's clearly a NULL pointer
de-reference in the bfq code, which seems to be triggered on the
partition read after mdadm opens a block device (not sure if it's the md
device or the loop device but I suspect the latter seeing it's not going
through any md code).

Simplifying things down a bit, the null pointer dereference can be
triggered by creating an md device with loop devices that have bfq
scheduler set:

  mdadm --create --run /dev/md0 --level=1 -n2 /dev/loop0 /dev/loop1

The crash occurs in bfq_bio_bfqg() with blkg_to_bfqg() returning NULL.
It's hard to trace where the NULL comes from in there -- the code is a
bit complex.

I've found that the bfq bug exists in current md-next (42b805af102) but
did not trigger in the base tag of v5.18-rc3. Bisecting revealed the bug
was introduced by:

  4e54a2493e58 ("bfq: Get rid of __bio_blkcg() usage")

Reverting that commit and the next commit (075a53b7) on top of md-next
was confirmed to fix the bug.

I've copied Jan, Jens and Paolo who can hopefully help with this. A
cleaned up stack trace follows this email for their benefit.

Logan

--

 BUG: KASAN: null-ptr-deref in bfq_bio_bfqg+0x65/0xf0
 Read of size 1 at addr 0000000000000094 by task mdadm/850

 CPU: 1 PID: 850 Comm: mdadm Not tainted
5.18.0-rc3-eid-vmlocalyes-dbg-00005-g42b805af1024 #2113
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2
04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5a/0x74
  kasan_report.cold+0x5f/0x1a9
  __asan_load1+0x4d/0x50
  bfq_bio_bfqg+0x65/0xf0
  bfq_bic_update_cgroup+0x2f/0x340
  bfq_insert_requests+0x568/0x5800
  blk_mq_sched_insert_request+0x180/0x230
  blk_mq_submit_bio+0x9f0/0xe50
  __submit_bio+0xeb/0x100
  submit_bio_noacct_nocheck+0x1fd/0x470
  submit_bio_noacct+0x350/0xa80
  submit_bio+0x84/0xf0
  submit_bh_wbc+0x27a/0x2b0
  block_read_full_page+0x578/0xb60
  blkdev_readpage+0x18/0x20
  do_read_cache_folio+0x290/0x430
  read_cache_page+0x41/0x130
  read_part_sector+0x7a/0x3d0
  read_lba+0x161/0x340
  efi_partition+0x1ce/0xdd0
  bdev_disk_changed+0x2e9/0x6a0
  blkdev_get_whole+0xd5/0x140
  blkdev_get_by_dev.part.0+0x37f/0x570
  blkdev_get_by_dev+0x51/0x60
  blkdev_open+0xa4/0x140
  do_dentry_open+0x2a7/0x6d0
  vfs_open+0x58/0x60
  path_openat+0x77e/0x13f0
  do_filp_open+0x154/0x280
  do_sys_openat2+0x119/0x2c0
  __x64_sys_openat+0xe7/0x160
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

--

bfq_bio_bfqg+0x65/0xf0:

bfq_bio_bfqg at block/bfq-cgroup.c:619
 614 		struct blkcg_gq *blkg = bio->bi_blkg;
 615 		struct bfq_group *bfqg;
 616 	
 617 		while (blkg) {
 618 			bfqg = blkg_to_bfqg(blkg);
>619<			if (bfqg->online) {
 620 				bio_associate_blkg_from_css(bio,
 621 				return bfqg;
 622 			}
 623 			blkg = blkg->parent;
 624 		

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-25 18:22                       ` Logan Gunthorpe
@ 2022-05-26  9:46                         ` Jan Kara
  2022-05-26 11:53                         ` Jan Kara
  1 sibling, 0 replies; 45+ messages in thread
From: Jan Kara @ 2022-05-26  9:46 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Guoqing Jiang, Donald Buczek, Song Liu, Jan Kara, Jens Axboe,
	Paolo Valente, linux-raid

On Wed 25-05-22 12:22:06, Logan Gunthorpe wrote:
> 
> 
> On 2022-05-25 03:04, Guoqing Jiang wrote:
> > I would prefer to focus on block tree or md tree. With latest block tree
> > (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below
> > similar issue as Donald reported, it happened with the cmd (which did
> > work with 5.12 kernel).
> > 
> > vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap
> 
> Ok, so this test passes for me, but my VM was not running with bfq. It
> also seems we have layers upon layers of different bugs to untangle.
> Perhaps you can try the tests with bfq disabled to make progress on the
> other regression I reported.
> 
> If I enable bfq and set the loop devices to the bfq scheduler, then I
> hit the same bug as you and Donald. It's clearly a NULL pointer
> de-reference in the bfq code, which seems to be triggered on the
> partition read after mdadm opens a block device (not sure if it's the md
> device or the loop device but I suspect the latter seeing it's not going
> through any md code).
> 
> Simplifying things down a bit, the null pointer dereference can be
> triggered by creating an md device with loop devices that have bfq
> scheduler set:
> 
>   mdadm --create --run /dev/md0 --level=1 -n2 /dev/loop0 /dev/loop1
> 
> The crash occurs in bfq_bio_bfqg() with blkg_to_bfqg() returning NULL.
> It's hard to trace where the NULL comes from in there -- the code is a
> bit complex.
> 
> I've found that the bfq bug exists in current md-next (42b805af102) but
> did not trigger in the base tag of v5.18-rc3. Bisecting revealed the bug
> was introduced by:
> 
>   4e54a2493e58 ("bfq: Get rid of __bio_blkcg() usage")
> 
> Reverting that commit and the next commit (075a53b7) on top of md-next
> was confirmed to fix the bug.
> 
> I've copied Jan, Jens and Paolo who can hopefully help with this. A
> cleaned up stack trace follows this email for their benefit.

Indeed. Thanks for easy reproduction steps. I've reproduced the issue
locally and I'm looking into it because from a quick look into the code
this shouldn't be possible which means I'm missing something obvious :).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-25 18:22                       ` Logan Gunthorpe
  2022-05-26  9:46                         ` Jan Kara
@ 2022-05-26 11:53                         ` Jan Kara
  2022-05-31  6:11                           ` Christoph Hellwig
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kara @ 2022-05-26 11:53 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Guoqing Jiang, Donald Buczek, Song Liu, Jan Kara, Jens Axboe,
	Paolo Valente, linux-raid, Tejun Heo, Christoph Hellwig,
	linux-block

[Added couple of CCs since this seems to be an issue in the generic block
layer]

On Wed 25-05-22 12:22:06, Logan Gunthorpe wrote:
> On 2022-05-25 03:04, Guoqing Jiang wrote:
> > I would prefer to focus on block tree or md tree. With latest block tree
> > (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below
> > similar issue as Donald reported, it happened with the cmd (which did
> > work with 5.12 kernel).
> > 
> > vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap
> 
> Ok, so this test passes for me, but my VM was not running with bfq. It
> also seems we have layers upon layers of different bugs to untangle.
> Perhaps you can try the tests with bfq disabled to make progress on the
> other regression I reported.
> 
> If I enable bfq and set the loop devices to the bfq scheduler, then I
> hit the same bug as you and Donald. It's clearly a NULL pointer
> de-reference in the bfq code, which seems to be triggered on the
> partition read after mdadm opens a block device (not sure if it's the md
> device or the loop device but I suspect the latter seeing it's not going
> through any md code).
> 
> Simplifying things down a bit, the null pointer dereference can be
> triggered by creating an md device with loop devices that have bfq
> scheduler set:
> 
>   mdadm --create --run /dev/md0 --level=1 -n2 /dev/loop0 /dev/loop1
> 
> The crash occurs in bfq_bio_bfqg() with blkg_to_bfqg() returning NULL.
> It's hard to trace where the NULL comes from in there -- the code is a
> bit complex.
> 
> I've found that the bfq bug exists in current md-next (42b805af102) but
> did not trigger in the base tag of v5.18-rc3. Bisecting revealed the bug
> was introduced by:
> 
>   4e54a2493e58 ("bfq: Get rid of __bio_blkcg() usage")
> 
> Reverting that commit and the next commit (075a53b7) on top of md-next
> was confirmed to fix the bug.
> 
> I've copied Jan, Jens and Paolo who can hopefully help with this. A
> cleaned up stack trace follows this email for their benefit.

So I've debugged this. The crash happens on the very first bio submitted to
the md0 device. The problem is that this bio gets remapped to loop0 - this
happens through bio_alloc_clone() -> __bio_clone() which ends up calling
bio_clone_blkg_association(). Now the resulting bio is inconsistent - it's
dst_bio->bi_bdev is pointing to loop0 while dst_bio->bi_blkg is pointing to
blkcg_gq associated with md0 request queue. And this breaks BFQ because
when this bio is inserted to loop0 request queue, BFQ looks at
bio->bi_blkg->q (it is a bit more complex than that but this is the gist
of the problem), expects its data there but BFQ is not initialized for md0
request_queue.

Now I think this is a bug in __bio_clone() but the inconsistency in the bio
is very much what we asked bio_clone_blkg_association() to do so maybe I'm
missing something and bios that are associated with one bdev but pointing
to blkg of another bdev are fine and controllers are supposed to handle
that (although I'm not sure how should they do that). So I'm asking here
before I just go and delete bio_clone_blkg_association() from
__bio_clone()...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-23  9:51                 ` Guoqing Jiang
  2022-05-24 16:13                   ` Logan Gunthorpe
@ 2022-05-30  9:55                   ` Guoqing Jiang
  2022-05-30 16:35                     ` Logan Gunthorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-30  9:55 UTC (permalink / raw)
  To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid



On 5/23/22 5:51 PM, Guoqing Jiang wrote:
>
>
>>>>> I noticed a clear regression with mdadm tests with this patch in 
>>>>> md-next
>>>>> (7e6ba434cc6080).
>>>>>
>>>>> Before the patch, tests 07reshape5intr and 07revert-grow would fail
>>>>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 
>>>>> runs
>>>>> for the latter).
>>>>>
>>>>> After this patch, both tests always fail.
>>>>>
>>>>> I don't have time to dig into why this is, but it would be nice if
>>>>> someone can at least fix the regression. It is hard to make any 
>>>>> progress
>>>>> on these tests if we are continuing to further break them.
>
> I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla 
> 5.12, none of them
> can pass your mentioned tests.
>
> [root@localhost mdadm]# lsblk|grep vd
> vda          252:0    0    1G  0 disk
> vdb          252:16   0    1G  0 disk
> vdc          252:32   0    1G  0 disk
> vdd          252:48   0    1G  0 disk
> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=05r1-add-internalbitmap
> Testing on linux-5.12.0-default kernel
> /root/mdadm/tests/05r1-add-internalbitmap... succeeded
> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=07reshape5intr
> Testing on linux-5.12.0-default kernel
> /root/mdadm/tests/07reshape5intr... FAILED - see 
> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for 
> details
> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=07revert-grow
> Testing on linux-5.12.0-default kernel
> /root/mdadm/tests/07revert-grow... FAILED - see 
> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details
> [root@localhost mdadm]# head -10  /var/tmp/07revert-grow.log | grep mdadm
> + . /root/mdadm/tests/07revert-grow
> *++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb 
> /dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9**
> *
> The above line is clearly wrong from my understanding.
>
> And let's check ubuntu 22.04.
>
> root@vm:/home/gjiang/mdadm# lsblk|grep vd
> vda    252:0    0     1G  0 disk
> vdb    252:16   0     1G  0 disk
> vdc    252:32   0     1G  0 disk
> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d} 
> --tests=05r1-failfast
> Testing on linux-5.15.0-30-generic kernel
> /home/gjiang/mdadm/tests/05r1-failfast... succeeded
> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c}   
> --tests=07reshape5intr
> Testing on linux-5.15.0-30-generic kernel
> /home/gjiang/mdadm/tests/07reshape5intr... FAILED - see 
> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for 
> details
> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} 
> --tests=07revert-grow
> Testing on linux-5.15.0-30-generic kernel
> /home/gjiang/mdadm/tests/07revert-grow... FAILED - see 
> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details
>
> So I would not consider it is regression.

I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate 
why it failed with this patch), but 07revert-grow still failed without
apply this one.

 From fail07revert-grow.log, it shows below issues.

[ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp 
00007ffdcf252800 error 4 in mdadm[400000+81000]
[ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48

[ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp 
00007ffe91e39800 error 4 in mdadm[400000+81000]
[ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48

[ 7876.779855] ======================================================
[ 7876.779858] WARNING: possible circular locking dependency detected
[ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G            E
[ 7876.779864] ------------------------------------------------------
[ 7876.779867] mdadm/25444 is trying to acquire lock:
[ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0}, 
at: flush_workqueue+0x87/0x470
[ 7876.779879]
                but task is already holding lock:
[ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3}, 
at: action_store+0x11a/0x2c0 [md_mod]
[ 7876.779892]
                which lock already depends on the new lock.

[ 7876.779896]
                the existing dependency chain (in reverse order) is:
[ 7876.779899]
                -> #3 (&mddev->reconfig_mutex){+.+.}-{3:3}:
[ 7876.779904]        __mutex_lock+0x8f/0x920
[ 7876.779909]        layout_store+0x47/0x120 [md_mod]
[ 7876.779914]        md_attr_store+0x7a/0xc0 [md_mod]
[ 7876.779919]        kernfs_fop_write_iter+0x135/0x1b0
[ 7876.779924]        new_sync_write+0x10c/0x190
[ 7876.779927]        vfs_write+0x30e/0x370
[ 7876.779930]        ksys_write+0xa4/0xe0
[ 7876.779933]        do_syscall_64+0x3a/0x80
[ 7876.779936]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7876.779940]
                -> #2 (kn->active#359){++++}-{0:0}:
[ 7876.779945]        __kernfs_remove+0x28c/0x2e0
[ 7876.779948]        kernfs_remove_by_name_ns+0x52/0x90
[ 7876.779952]        remove_files.isra.1+0x30/0x70
[ 7876.779955]        sysfs_remove_group+0x3d/0x80
[ 7876.779958]        sysfs_remove_groups+0x29/0x40
[ 7876.779962]        __kobject_del+0x1b/0x80
[ 7876.779965]        kobject_del+0xf/0x20
[ 7876.779968]        mddev_delayed_delete+0x15/0x20 [md_mod]
[ 7876.779973]        process_one_work+0x2d8/0x650
[ 7876.779976]        worker_thread+0x2a/0x3b0
[ 7876.779979]        kthread+0xe8/0x110
[ 7876.779981]        ret_from_fork+0x22/0x30
[ 7876.779985]
                -> #1 ((work_completion)(&mddev->del_work)#2){+.+.}-{0:0}:
[ 7876.779990]        process_one_work+0x2af/0x650
[ 7876.779993]        worker_thread+0x2a/0x3b0
[ 7876.779996]        kthread+0xe8/0x110
[ 7876.779998]        ret_from_fork+0x22/0x30
[ 7876.780001]
                -> #0 ((wq_completion)md_misc){+.+.}-{0:0}:
[ 7876.780005]        __lock_acquire+0x12a0/0x1770
[ 7876.780009]        lock_acquire+0x277/0x310
[ 7876.780011]        flush_workqueue+0xae/0x470
[ 7876.780014]        action_store+0x188/0x2c0 [md_mod]
[ 7876.780019]        md_attr_store+0x7a/0xc0 [md_mod]
[ 7876.780024]        kernfs_fop_write_iter+0x135/0x1b0
[ 7876.780027]        new_sync_write+0x10c/0x190
[ 7876.780030]        vfs_write+0x30e/0x370
[ 7876.780033]        ksys_write+0xa4/0xe0
[ 7876.780036]        do_syscall_64+0x3a/0x80
[ 7876.780038]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7876.780042]
                other info that might help us debug this:

[ 7876.780045] Chain exists of:
                  (wq_completion)md_misc --> kn->active#359 --> 
&mddev->reconfig_mutex

[ 7876.780052]  Possible unsafe locking scenario:

[ 7876.780054]        CPU0                    CPU1
[ 7876.780056]        ----                    ----
[ 7876.780059]   lock(&mddev->reconfig_mutex);
[ 7876.780061] lock(kn->active#359);
[ 7876.780064] lock(&mddev->reconfig_mutex);
[ 7876.780068]   lock((wq_completion)md_misc);
[ 7876.780070]
                 *** DEADLOCK ***

[ 7876.780073] 5 locks held by mdadm/25444:
[ 7876.780075]  #0: ffff9917c769a458 (sb_writers#3){.+.+}-{0:0}, at: 
ksys_write+0xa4/0xe0
[ 7876.780082]  #1: ffff9917de1a6518 (&of->prealloc_mutex){+.+.}-{3:3}, 
at: kernfs_fop_write_iter+0x5c/0x1b0
[ 7876.780088]  #2: ffff9917de1a6488 (&of->mutex){+.+.}-{3:3}, at: 
kernfs_fop_write_iter+0x103/0x1b0
[ 7876.780094]  #3: ffff99181b0d2158 (kn->active#287){++++}-{0:0}, at: 
kernfs_fop_write_iter+0x10c/0x1b0
[ 7876.780100]  #4: ffff9917c5c1c2c0 
(&mddev->reconfig_mutex){+.+.}-{3:3}, at: action_store+0x11a/0x2c0 [md_mod]
[ 7876.780108]
                stack backtrace:
[ 7876.780111] CPU: 1 PID: 25444 Comm: mdadm Tainted: G E     
5.18.0-rc3-57-default #28
[ 7876.780115] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[ 7876.780120] Call Trace:
[ 7876.780122]  <TASK>
[ 7876.780124]  dump_stack_lvl+0x55/0x6d
[ 7876.780128]  check_noncircular+0x105/0x120
[ 7876.780133]  ? __lock_acquire+0x12a0/0x1770
[ 7876.780136]  ? lockdep_lock+0x21/0x90
[ 7876.780139]  __lock_acquire+0x12a0/0x1770
[ 7876.780143]  lock_acquire+0x277/0x310
[ 7876.780146]  ? flush_workqueue+0x87/0x470
[ 7876.780149]  ? __raw_spin_lock_init+0x3b/0x60
[ 7876.780152]  ? lockdep_init_map_type+0x58/0x250
[ 7876.780156]  flush_workqueue+0xae/0x470
[ 7876.780158]  ? flush_workqueue+0x87/0x470
[ 7876.780161]  ? _raw_spin_unlock+0x29/0x40
[ 7876.780166]  ? action_store+0x188/0x2c0 [md_mod]
[ 7876.780171]  action_store+0x188/0x2c0 [md_mod]
[ 7876.780176]  md_attr_store+0x7a/0xc0 [md_mod]
[ 7876.780182]  kernfs_fop_write_iter+0x135/0x1b0
[ 7876.780186]  new_sync_write+0x10c/0x190
[ 7876.780190]  vfs_write+0x30e/0x370
[ 7876.780193]  ksys_write+0xa4/0xe0
[ 7876.780197]  do_syscall_64+0x3a/0x80
[ 7876.780200]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7876.780203] RIP: 0033:0x7f7cd43f9c03
[ 7876.780206] Code: 2d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 
1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 55 f3 c3 0f 1f 00 41 54 55 49 89 d4 53 48 89
[ 7876.780213] RSP: 002b:00007fff35b600b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[ 7876.780218] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 
00007f7cd43f9c03
[ 7876.780221] RDX: 0000000000000004 RSI: 00000000004669a7 RDI: 
0000000000000003
[ 7876.780224] RBP: 00000000004669a7 R08: 000000000000000b R09: 
00000000ffffffff
[ 7876.780227] R10: 0000000000000000 R11: 0000000000000246 R12: 
000000000183d730
[ 7876.780231] R13: 00000000ffffffff R14: 0000000000001800 R15: 
0000000000000000
[ 7876.780236]  </TASK>
[ 7876.781007] md: reshape of RAID array md0
[ 7876.785949] md: md0: reshape interrupted.
[ 7877.193686] md0: detected capacity change from 107520 to 0
[ 7877.193694] md: md0 stopped.
[ 7877.314438] debugfs: Directory 'md0' with parent 'block' already present!
[ 7877.649549] md: array md0 already has disks!
[ 7877.665188] md/raid:md0: device loop0 operational as raid disk 0
[ 7877.665193] md/raid:md0: device loop3 operational as raid disk 3
[ 7877.665197] md/raid:md0: device loop2 operational as raid disk 2
[ 7877.665199] md/raid:md0: device loop1 operational as raid disk 1
[ 7877.665202] md/raid:md0: device loop4 operational as raid disk 4
[ 7877.665205] md/raid:md0: force stripe size 512 for reshape
[ 7877.667172] md/raid:md0: raid level 5 active with 4 out of 4 devices, 
algorithm 2
[ 7877.667360] md0: detected capacity change from 0 to 107520
[ 7878.107865] mdadm[25693]: segfault at 0 ip 000000000040fe56 sp 
00007fff7a845d50 error 4 in mdadm[400000+81000]
[ 7878.107878] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
[ 7878.428298] md: reshape of RAID array md0
[ 7880.552543] md: md0: reshape done.
[ 7882.053117] md0: detected capacity change from 107520 to 0
[ 7882.053124] md: md0 stopped.
[ ... ]
[ 7904.978417] md0: detected capacity change from 0 to 107520
[ 7905.346749] mdadm[26328]: segfault at 0 ip 000000000040fe56 sp 
00007ffc28540c20 error 4 in mdadm[400000+81000]
[ 7905.346764] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
[ 7905.476488] md: reshape of RAID array md0
[ 7907.530152] md: md0: reshape done.
[ 7909.149524] md0: detected capacity change from 107520 to 0
[ 7909.149533] md: md0 stopped.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-30  9:55                   ` Guoqing Jiang
@ 2022-05-30 16:35                     ` Logan Gunthorpe
  2022-05-31  8:13                       ` Guoqing Jiang
  0 siblings, 1 reply; 45+ messages in thread
From: Logan Gunthorpe @ 2022-05-30 16:35 UTC (permalink / raw)
  To: Guoqing Jiang, Donald Buczek, Song Liu; +Cc: linux-raid



On 2022-05-30 03:55, Guoqing Jiang wrote:
> I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate 
> why it failed with this patch), but 07revert-grow still failed without
> apply this one.
> 
>  From fail07revert-grow.log, it shows below issues.
> 
> [ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp 
> 00007ffdcf252800 error 4 in mdadm[400000+81000]
> [ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
> 
> [ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp 
> 00007ffe91e39800 error 4 in mdadm[400000+81000]
> [ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
> 
> [ 7876.779855] ======================================================
> [ 7876.779858] WARNING: possible circular locking dependency detected
> [ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G            E
> [ 7876.779864] ------------------------------------------------------
> [ 7876.779867] mdadm/25444 is trying to acquire lock:
> [ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0}, 
> at: flush_workqueue+0x87/0x470
> [ 7876.779879]
>                 but task is already holding lock:
> [ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3}, 
> at: action_store+0x11a/0x2c0 [md_mod]
> [ 7876.779892]
>                 which lock already depends on the new lock.
> 

Hmm, strange. I'm definitely running with lockdep and even if I try the
test on my machine, on v5.18-rc3, I don't get this error. Not sure why.

In any case it looks like we recently added a
flush_workqueue(md_misc_wq) call in action_store() which runs with the
mddev_lock() held. According to your lockdep warning, that can deadlock.

That call was added in this commit:

Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev")

Can we maybe run flush_workqueue() before we take mddev_lock()?

Logan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-26 11:53                         ` Jan Kara
@ 2022-05-31  6:11                           ` Christoph Hellwig
  2022-05-31  7:43                             ` Jan Kara
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-31  6:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Logan Gunthorpe, Guoqing Jiang, Donald Buczek, Song Liu,
	Jens Axboe, Paolo Valente, linux-raid, Tejun Heo,
	Christoph Hellwig, linux-block

On Thu, May 26, 2022 at 01:53:36PM +0200, Jan Kara wrote:
> So I've debugged this. The crash happens on the very first bio submitted to
> the md0 device. The problem is that this bio gets remapped to loop0 - this
> happens through bio_alloc_clone() -> __bio_clone() which ends up calling
> bio_clone_blkg_association(). Now the resulting bio is inconsistent - it's
> dst_bio->bi_bdev is pointing to loop0 while dst_bio->bi_blkg is pointing to
> blkcg_gq associated with md0 request queue. And this breaks BFQ because
> when this bio is inserted to loop0 request queue, BFQ looks at
> bio->bi_blkg->q (it is a bit more complex than that but this is the gist
> of the problem), expects its data there but BFQ is not initialized for md0
> request_queue.
> 
> Now I think this is a bug in __bio_clone() but the inconsistency in the bio
> is very much what we asked bio_clone_blkg_association() to do so maybe I'm
> missing something and bios that are associated with one bdev but pointing
> to blkg of another bdev are fine and controllers are supposed to handle
> that (although I'm not sure how should they do that). So I'm asking here
> before I just go and delete bio_clone_blkg_association() from
> __bio_clone()...

This behavior probably goes back to my commit here:

ommit d92c370a16cbe0276954c761b874bd024a7e4fac
Author: Christoph Hellwig <hch@lst.de>
Date:   Sat Jun 27 09:31:48 2020 +0200

    block: really clone the block cgroup in bio_clone_blkg_association

and it seems everyone else was fine with that behavior so far.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-31  6:11                           ` Christoph Hellwig
@ 2022-05-31  7:43                             ` Jan Kara
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kara @ 2022-05-31  7:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Logan Gunthorpe, Guoqing Jiang, Donald Buczek,
	Song Liu, Jens Axboe, Paolo Valente, linux-raid, Tejun Heo,
	linux-block

On Mon 30-05-22 23:11:47, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 01:53:36PM +0200, Jan Kara wrote:
> > So I've debugged this. The crash happens on the very first bio submitted to
> > the md0 device. The problem is that this bio gets remapped to loop0 - this
> > happens through bio_alloc_clone() -> __bio_clone() which ends up calling
> > bio_clone_blkg_association(). Now the resulting bio is inconsistent - it's
> > dst_bio->bi_bdev is pointing to loop0 while dst_bio->bi_blkg is pointing to
> > blkcg_gq associated with md0 request queue. And this breaks BFQ because
> > when this bio is inserted to loop0 request queue, BFQ looks at
> > bio->bi_blkg->q (it is a bit more complex than that but this is the gist
> > of the problem), expects its data there but BFQ is not initialized for md0
> > request_queue.
> > 
> > Now I think this is a bug in __bio_clone() but the inconsistency in the bio
> > is very much what we asked bio_clone_blkg_association() to do so maybe I'm
> > missing something and bios that are associated with one bdev but pointing
> > to blkg of another bdev are fine and controllers are supposed to handle
> > that (although I'm not sure how should they do that). So I'm asking here
> > before I just go and delete bio_clone_blkg_association() from
> > __bio_clone()...
> 
> This behavior probably goes back to my commit here:
> 
> ommit d92c370a16cbe0276954c761b874bd024a7e4fac
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Sat Jun 27 09:31:48 2020 +0200
> 
>     block: really clone the block cgroup in bio_clone_blkg_association
> 
> and it seems everyone else was fine with that behavior so far.

Yes, thanks for the pointer. I agree nobody else was crashing due to this
but it will be causing interesting inconsistencies in accounting and
throttling in all the IO controllers - e.g. usually both original
and cloned bio will get accounted to md0 device and loop0 device settings
will be completely ignored. From my experience these things do not
get really tested too much until some customer tries to use them :). So I
think we have to effectively revert your above change. I'll send a patch.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-30 16:35                     ` Logan Gunthorpe
@ 2022-05-31  8:13                       ` Guoqing Jiang
  0 siblings, 0 replies; 45+ messages in thread
From: Guoqing Jiang @ 2022-05-31  8:13 UTC (permalink / raw)
  To: Logan Gunthorpe, Donald Buczek, Song Liu; +Cc: linux-raid



On 5/31/22 12:35 AM, Logan Gunthorpe wrote:
> On 2022-05-30 03:55, Guoqing Jiang wrote:
>> I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate
>> why it failed with this patch), but 07revert-grow still failed without
>> apply this one.
>>
>>   From fail07revert-grow.log, it shows below issues.
>>
>> [ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp
>> 00007ffdcf252800 error 4 in mdadm[400000+81000]
>> [ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31
>> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89
>> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
>>
>> [ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp
>> 00007ffe91e39800 error 4 in mdadm[400000+81000]
>> [ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31
>> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89
>> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
>>
>> [ 7876.779855] ======================================================
>> [ 7876.779858] WARNING: possible circular locking dependency detected
>> [ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G            E
>> [ 7876.779864] ------------------------------------------------------
>> [ 7876.779867] mdadm/25444 is trying to acquire lock:
>> [ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0},
>> at: flush_workqueue+0x87/0x470
>> [ 7876.779879]
>>                  but task is already holding lock:
>> [ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3},
>> at: action_store+0x11a/0x2c0 [md_mod]
>> [ 7876.779892]
>>                  which lock already depends on the new lock.
>>
> Hmm, strange. I'm definitely running with lockdep and even if I try the
> test on my machine, on v5.18-rc3, I don't get this error. Not sure why.
>
> In any case it looks like we recently added a
> flush_workqueue(md_misc_wq) call in action_store() which runs with the
> mddev_lock() held. According to your lockdep warning, that can deadlock.

It was originally added by f851b60db if I am not misunderstood.

> That call was added in this commit:
>
> Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev")

The above fix commit didn't add it. And cc1ffe61c026 was added to avoid 
other
lockdep warnings, IIRC it just added work_pending checking before flush.

> Can we maybe run flush_workqueue() before we take mddev_lock()?

Currently, I am not sure, need to investigate and test. Anyway, it is on 
my todo
list unless someone beats me 😉.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held
  2022-05-20 18:27         ` Logan Gunthorpe
  2022-05-21 18:23           ` Donald Buczek
@ 2022-06-02  8:12           ` Xiao Ni
  1 sibling, 0 replies; 45+ messages in thread
From: Xiao Ni @ 2022-06-02  8:12 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Song Liu, Guoqing Jiang, Donald Buczek, linux-raid

Hi all

I've tried the test and got the same result.  8b48ec23cc51(md: don't
unregister sync_thread with reconfig_mutex held)
introduces this problem.

My environment:
kernel: latest md-next branch
mdadm: Logan's tree bugfixes2 branch
(https://github.com/lsgunth/mdadm/) because Logan fixed some
errors for the test cases. I'm also trying to fix other failures based
on the tree.

Case1:
mdadm -CR /dev/md0 -amd -l5 -c 1024 -n2 --assume-clean /dev/loop1 /dev/loop2
mkfs.xfs /dev/md0
mount /dev/md0 /mnt/
mdadm /dev/md0 -a /dev/loop0
mdadm /dev/md0 --grow -n3
cat /proc/mdstat
umount  /mnt/
mdadm --stop /dev/md0
mdadm -A /dev/md0 /dev/loop[0-2]
cat /proc/mdstat (reshape doesn't happen)
mount /dev/md0 /mnt/  (fails here)
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/md0,
missing codepage or helper program, or other error.

Case2:
I commented some codes, built the md module and insmod the new module
        //if (reconfig_mutex_held)
        //      mddev_unlock(mddev);
        /* resync has finished, collect result */
        md_unregister_thread(&mddev->sync_thread);
        //if (reconfig_mutex_held)
        //      mddev_lock_nointr(mddev);

The test Case1 can pass

Need to look more to find the root cause. It looks like some
information in superblock doesn't sync to member devices.
So after assemble, it can't do reshape again. Need more time to look at this.

Best Regards
Xiao

On Sat, May 21, 2022 at 2:27 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> Hi,
>
> On 2022-05-10 00:44, Song Liu wrote:
> > On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> >> On 5/9/22 2:37 PM, Song Liu wrote:
> >>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev>  wrote:
> >>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com>
> >>>>
> >>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it
> >>>> doesn't reconfigure array.
> >>>>
> >>>> And it could cause deadlock problem for raid5 as follows:
> >>>>
> >>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo
> >>>>     idle to sync_action.
> >>>> 2. raid5 sync thread was blocked if there were too many active stripes.
> >>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
> >>>>     which causes the number of active stripes can't be decreased.
> >>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
> >>>>     to hold reconfig_mutex.
> >>>>
> >>>> More details in the link:
> >>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
> >>>>
> >>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr
> >>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the
> >>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly.
> >>> mddev_is_locked() feels really hacky to me. It cannot tell whether
> >>> mddev is locked
> >>> by current thread. So technically, we can unlock reconfigure_mutex for
> >>> other thread
> >>> by accident, no?
> >>
> >> I can switch back to V2 if you think that is the correct way to do though no
> >> one comment about the change in dm-raid.
> >
> > I guess v2 is the best at the moment. I pushed a slightly modified v2 to
> > md-next.
> >
>
> I noticed a clear regression with mdadm tests with this patch in md-next
> (7e6ba434cc6080).
>
> Before the patch, tests 07reshape5intr and 07revert-grow would fail
> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs
> for the latter).
>
> After this patch, both tests always fail.
>
> I don't have time to dig into why this is, but it would be nice if
> someone can at least fix the regression. It is hard to make any progress
> on these tests if we are continuing to further break them.
>
> Logan
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-06-02  8:13 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-05  8:16 [PATCH 0/2] two fixes for md Guoqing Jiang
2022-05-05  8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
2022-05-05 14:02   ` kernel test robot
2022-05-05 18:04   ` kernel test robot
2022-05-06  2:34     ` Guoqing Jiang
2022-05-06  2:34       ` Guoqing Jiang
2022-05-05  8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang
2022-05-09  6:39   ` Song Liu
2022-05-09  8:12     ` Guoqing Jiang
2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang
2022-05-09  6:37   ` Song Liu
2022-05-09  8:09     ` Guoqing Jiang
2022-05-09  9:32       ` Wols Lists
2022-05-09 10:37         ` Guoqing Jiang
2022-05-09 11:19           ` Wols Lists
2022-05-09 11:26             ` Guoqing Jiang
2022-05-10  6:44       ` Song Liu
2022-05-10 12:01         ` Donald Buczek
2022-05-10 12:09           ` Guoqing Jiang
2022-05-10 12:35             ` Donald Buczek
2022-05-10 18:02               ` Song Liu
2022-05-11  8:10                 ` Guoqing Jiang
2022-05-11 21:45                   ` Song Liu
2022-05-20 18:27         ` Logan Gunthorpe
2022-05-21 18:23           ` Donald Buczek
2022-05-23  1:08             ` Guoqing Jiang
2022-05-23  5:41               ` Donald Buczek
2022-05-23  9:51                 ` Guoqing Jiang
2022-05-24 16:13                   ` Logan Gunthorpe
2022-05-25  9:04                     ` Guoqing Jiang
2022-05-25 18:22                       ` Logan Gunthorpe
2022-05-26  9:46                         ` Jan Kara
2022-05-26 11:53                         ` Jan Kara
2022-05-31  6:11                           ` Christoph Hellwig
2022-05-31  7:43                             ` Jan Kara
2022-05-30  9:55                   ` Guoqing Jiang
2022-05-30 16:35                     ` Logan Gunthorpe
2022-05-31  8:13                       ` Guoqing Jiang
2022-05-24 15:58                 ` Logan Gunthorpe
2022-05-24 18:16                   ` Song Liu
2022-05-25  9:17                 ` Guoqing Jiang
2022-05-24 15:51             ` Logan Gunthorpe
2022-06-02  8:12           ` Xiao Ni
2022-05-09  8:18   ` Donald Buczek
2022-05-09  8:48     ` Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.